Building a Search Engine pt 2: The Searcher
19 July 02020
So when I consulted with my brother, a textbook, and the internet about parallel processing and how to deal with the index saving. The resolution I came to was that you can't really just task another core with saving the whole index. The options I had were either multithreading (creating another thread on the same core), or multiprocessing (using two separate cores). I couldn't use multithreading because saving the file was cpu-limiting, so creating a second thread wouldn't help. And multiprocessing wouldn't work because two cores can't share the same memory (they can but things can get wonky when it happens). If I wanted to use multiprocessing, I would need to write the index from one core to another, which is the exact problem I was trying to get around. That's my understanding anyway, it may be incorrect since I don't have any experience with parallel processing. So instead, I decided to split the index into a bunch of different files and merge them into one big file, either in parallel (which is preferable) or after the crawling is over (which is easier).
Changing the crawler to save partial indices is a surprisingly easy modification, it only took 7 lines of code. I did some basic analysis, and prior to the change, we could crawl 190 pages in 130 seconds, without saving anything we could crawl 425 pages in 120 seconds, and saving small files we got 337 in about 110 seconds. I know that the time savings are even higher given that the preformance degrades signifigantly over time for script that saved the large file, and I just tested the first 2 minutes. For now I've decided to just do the merging in series rather than in parallel since that's easier and I woln't have to deal with multiprocessing. I'll revisit that later if it becomes an issue and/or I need to continuously crawl the web.
What really gave me trouble is the ranking function. I couldn't seem to find any information on how to actually rank the results based on how information appears on the html page. I have some information from the textbook I'm following (Croft, metzler, and Strohman pg 128), but it's pretty cryptic and abstract. Google's original paper, anatomy of a search engine has some hints in section 4.5.1, but the way they've set things up is just different enough from the way I've set things up to not be of much use. It seems like what I should be doing is taking the dot product of two functions, g and f. g takes in the query and assigns a number to each word in the query (and permutations of the words) based on how important those words are. f takes in the document and assigns a number to how many times the word appears in the document and is weighted by what tags are associated with that word (ex. words in headings are assigned more weight). After you take the dot product of each, you then multiply it by the pagerank value of the document. Of the two the g function is giving me more trouble, since it's hard to determine which words in a query are more important. I think one shortcut for me is to just get rid of g entirely, and create a function that simply looks at the weight of each occurance of the words in the query, and adds them together. There's still the issue of multi-word queries though, and I'm still trying to wrestle with how to deal with the distance between words. While a simple Euclidian distance function would work for two words (sqrt((location of first word - location of second word)^2)), if you add a third you get into the travelling salesman problem, which I don't want to deal with. I started by just dealing with single-word queries and hoping that the solution to multi-word queries will fall out of that, which is more or less what ended up happening. It ended up just being a whole lot of for loops as you take the distance between all the different word locations, multiplying by the weights, and then summing up all those distances. It's a pretty crude method but it more or less works for now. I can always come back to it if it's not up to snuff.
That's the main news so far, the search functionality is rough and the code is messy, but it works for now. I have some other stuff to do like stemming words and implementing a spellchecker, but once I have that done I should be able to move onto designing and hooking up the front end, probably using either flask or django. I'm leaning towards flask since it seems more well suited to a light weight webapp. I also plan on making a script that makes continuous merging and crawling possible, so I can just leave the crawler running for a long period of time.