Building a Search Engine pt 4: Shipping It

18 August 02020

The search engine is finally in a good enough state to show off. If you just want to check it out and don't want to read the whole post it's at wholesearchcatalog.com. In my last post it was technically working in that you could enter a query and get results back, but it had a lot of bugs and necessary features missing and was ubearably slow. Since then I've fixed a lot of those problems, and while there are still more features to add and more blogs to crawl, the core experience is there. So, on with the project update.

One of the biggest lessons I learned is to exploit caching and RAM as much as possible if you want to have fast code. But before I explain, some context. As I mentioned before, the search engine uses mongodb as its database which is organized like database->collection->documents. For the search engine there are two databases, the word index and the page index. The word index has each collection being a particular word, and it contains the urls and locations of each word on that page. So for the "building" collection in the word index database, for each page containing the word "building", there would be a corresponding document with the url of the page, the locations of where that word occurs, and when it was last updated. In the page index each collection is a webpage, and each document contains the tag information for the page. So for the collection "johnpatrickbender.com", it would contain one document that has a list of all the locations of the h1 tags, one document for the li tags, etc.

So how does the idea of exploiting RAM relate to this? Say I search "building", the urls for this series of blog posts are in the "building" collection since building is in the title, so all those urls need to be scored to see how well they fit the query. The scoring function takes in that list of urls and iterates through each url to score that page, and in turn for each url it needs to iterate through the tags to weight each instance of the word. The issue was that for each tag the code iterated through, it made a database call to pull up that particular document with the tag information. So the code would look something like webdb.building.find_one({'url': url, 'tag': tag}) where url and tag are whatever url and tag I was on at the time. The issue with this is that calling the database is takes a long time. For a simple search with less than 50 urls I was calling it around 2000 times, and each call took about 5 miliseconds, which totalled up to about 10 seconds, an eternity for a search engine. Rather than load each document individually, instead I loaded every document in the "building" collection with a single call by writing webdb.building.find_all(), then I would do the rest of the scoring. So instead of spending 10 seconds loading information from the database, the code spent 0.25 seconds loading the same amount of information. I could even take this a step farther by calling all the urls up at once, but that seems on first glance to require significant restructuring of the database and would require a lot more space in the hard drive. One easier way of speeding up database calls is to cache results in a new database when a term is searched, and just load those instead of having to recompute common queries every time.

That was the big breakthrough for me these past few weeks. Other than that I patched up a lot of bugs that I ran across, largely resulting from people haveing weird formats for their websites. There are a lot of weird standards out on the internet, especially for blogs, and that can sometimes create havoc (ie bugs) when you erroniously assume all websites will have a sensible layout. As a fun example, one website the crawler looks at deletes every closing tag at the beginning of a line, but only when the scraper downloads it it. When I downloaded the html page from my browser, those closing tags were still there, which I eventually figured out after about 3 hours of assuming my code was to blame. That then begins the process of figuring out how to add closing tags when none exist, which I'm sure will introduce more errors down the line through further erronious assumptions.

One of the things my many bug fixes taught me is how to use list comprehensions rather than for loops to build lists. I ran into this bug a few times where I would write a for loop to iterate through a list while deleting elements of that list if certain conditions are met. This is almost always a bad idea, since deleting parts of a thing your iterating through will cause it to skip certain elements. There are ways you can get around it, but it much easier to instead write something like [x for y if x not in z] since it handles everything on the back end. You can even do double or triple for loops, so something like [x for x in y if all(x not in z for z in a)]. For those of you familiar with set theory or real analysis, this sort of way of writing lists will be second nature.

Another thing I learned about was the python module NLTK, which stands for natural language toolkit. My brother told me about it since he's worked with natural language processing before, and I found it to be extraordinarily helpful. Specifically I used it's stemming package to stem words both when crawling and searching. Stemming is the process of obtaining the root word, so going from 'twitching' -> 'twitch' or 'likes' -> 'like'. This makes searching a lot easier becuase you don't need to search for the exact word that you want, and it makes indexing a little easier by reducing the number of words you need to keep track of. With the NLTK module it took two lines of code, which is amazing especially since I was dreading implementing a stemmer for a long time. Here's the tutorial I used.

Finaly I registered a domain, hooked up DNS routes, and set up some basic analytics using simple analytics. So the search engine is up for public use if you would like to try it out, it's at wholesearchcatalog.com. For now it's free to use and you don't need to sign up for anything. I've crawled around 50 sites, mostly things that hackers would be interested in, so a lot of computer science and math focused blogs, plus a few corporate engineering blogs, research groups, and security blogs. If you have any feedback or suggested sites let me know either via email (me@johnpatrickbender.com) or via comment on this post on r/hnblogs or via the comment/suggestion box on the website.

Next up for me to do on this project is to implement caching search results, which I mentioned eariler, implementing inverse document frequency, and making a better architecture by splitting up the database, server, and crawler. Inverse document frequency, which is often combined with another technique called term frquency, is a way of weighting the importance of a particular keyword in a query. It basically says keywords that are common in our universe of pages are less important, and keywords that are more rare are more important. I originally learned about inverse document frequency from this blog post by Arpit Bhayani when I was looking up something unrelated on the whole search catalog, which I take as an endorsement for the project being on the right track. The architecutre currently has everything on a single ec2 instance. Instead what I should do is have a dedicated database instance, a dedicated crawler instance, and server instance(s) behind a load balancer. That seems to be the correct way of doing it on first pass anyway, I know almost nothing about systems administration or server architecting, so hopefully nothing falls apart.