Building a Search Engine pt 3: The Database and Front End

Home

Building a Search Engine pt 3: The Database and Front End

29 July 02020

A lot has happened since my last post, and the quick update is that I have a very crude v1 working. Right after posting part 2 I let the crawler, indexer, and merger run overnight, and it crashed my computer. The reason was that I was storing all the data in a giagantic json file that ended up being about 1.3 GB. Apparently computers don't like reading and writing 1.3 GB of data once every minute for 8 hours. So I decided to migrate to an actual database since you don't have to load the entire database to save it, choosing to go with Mongo since I had played with it a bit in a previous project, plus it has a python module that allows flask to easily hook into it. The migration from raw json to an actual database didn't take too long, and once it got running it made my life a lot easier.

Speaking of flask, I decided to go with flask over django, and so far it's going alright for never having done any front-end work before. One of the biggest lessons I learned is that there's a lot of fiddling with configuration settings to make apache talk to flask. It involves a thing called a Web Server Gateway Interfact (WSGI), and to get the WSGI to work requires a lot of patience and ability to google.

Unfortunately around the time where I was just starting to poke into flask, my mongo database crashed in a way that would have taken a long time to fix, or at least longer than I was willing to invest. The crash had to do with me not safely shutting down the program while it was still writing to the database, so it got corrupted. I tried to repair it, but it would always bump into the limit on the number of files one program can access. So I decided, since I was going to move onto an AWS server anyway, I might as well do it now. Porting it from my machine to the AWS instance was pretty easy after installing all the necessary python modules. I actually ended up running into the same problems with the original mongo database with the limit on the number of files, and discovered it was a simple two line edit in the limits.conf file in one of the root directories. I even got the multithreading that I mentioned in the last post up and running so that I could crawl and merge at the same time.

After I got it all configured, I decided to just make two pages, the index, where users type in their queries, and the results, where users receive the answers to their queries. It's a very basic setup for flask and I'm sure I'm doing a bunch of things wrong or at least improperly, but it works more or less. One thing I've leared about flask is that the people who write flask code really really like to separate everything out into multiple files. One of the hardest bugs I solved was that I couldn't import one of the files because the apache WSGI configuration file didn't allow one file to see another file even though it was in the same directory. Honestly the biggest take aways from this whole experience is that using apache is the easiest way to serve dynamic content, but it still sucks and misconfigurations happen all the time.

I still have some stuff to do before I would consider it ready to show to people. Mostly things that are easy enough to work out like hooking it up to a domain name and fixing a bug where you can't make multi-word queries, but it's pretty close. The biggest issue that I have is that everytime I try to crawl and merge sites the instance crashes. I think it's likely that the bug is due to the server's limited resources. I may be maxing out the ram or the hard drive or something like that. Unfortunately everytime it happens my terminal freezes since the instance woln't send me any data. If it ends up being a problem with the code itself I'll just have to print a bunch of stuff to a log file to try to track down the source of the problem. Luckily I'm already using the logging module on python so that should be as easy as changing one line of code. I'm leaving for vacation for a few days though, so I woln't actually be able to work on it for a few days.