Building a Search Engine pt 1: The Crawler and Indexer
16 July 02020
A few weeks ago I decided to write a search engine from scratch. Specifically a search engine that is focused on personal blogs. This is partially an attempt to revive the blogosphere from it's collapse around the early 2010's. There are a few reasons in my mind for the blogosphere's collapse, such as being pushed into walled gardens (facebook, medium, and their ilk), competition for SEO spam pushing their search rankings down, and lower barriers to entry making it easier for bad content to drown out the good content. The idea behind the search engine is to only crawl pre-approved sites. Those sites will be limited to high quality small scale blogs owned by individuals or small groups of people.
So let's go over briefly what a search engine actually is. A search engine is composed of three main parts, the crawler, the indexer, and the searcher. The crawler is the code that requests html files and follows links in those files to yet more files. So starting from a url such as http://www.johnpatrickbender.com/, it would find all the links to my various posts, then the links from those posts to other html files (say, http://www.wikipedia.org/), and so on and so on. I used scrapy, a python library geared toward scraping web pages, to do most of the heavy lifting with crawling. It allows you to very easily scrape up all the links on the html page and follow them. Following only links from pre-approved sites takes a bit of work since links can come in all shapes and sizes, but with enough trial and error I think I have most of the edge cases down.
The indexer takes the html files that our crawler downloads and creates what is called a reverse index from them. Normally web pages work by requesting a particular url, then being given content in the form of an html file, which is basically just a string of words. When searching, we want the reverse of that: we request particular words, and want a string of urls with those words in them. So naturally the reverse index is a dictionary of words with urls that contain those words, plus some extra information that will be helpful when searching such as where the word appears and how many times.
I haven't actually done much work on the search functionality yet, so I'll write about it another time when I have a better grasp of it.
It took a few days of fiddling around with the scraper and indexer to get the indices exactly like I wanted them, but I ran into a snag, which is that actually saving the indices takes a long time. When the reverse index was a mere 6 MB it took almost half a second to save, and given that I was saving every time I crawled a new site that would take a while. There are few ways I could overcome this. The easiest would just be to save only after it has crawled x number of files, but that only delays the problem since the file size will continue to just grow and grow, and if there's a crash then I could lose x-1 files. Another solution would be to save a bunch of small json files and then merge them on another thread. A third solution would just be to set up multithreading and use another thread to save it directly. I think I'm going to go with the last one since that seems like a pretty durable solution that's that not difficult to implement, though I've never worked with multithreading so who knows. If the multithreading ends up taking too long, I can take the first idea, saving every x number of files, and dynamically change x so that the two threads take approximately equal amounts of time.