Music Auto Tracker pt 1: Fun with Spectrograms
14 October 2019
A couple months back I was watching my girlfriend play guitar from an online tab, and she had to pause playing to scroll down. So for fun, I decided to see if I could find out a way to automate playing along with a song so that you don’t have to pause to scroll down. While I could probably figure out a way to detect the particular chords being played using the frequencies, I decided to use deep learning since that seems more fun.
I found a couple of resources online that I could use. The first one is GuitarSet, a dataset of about 3 hours of guitar playing with really in depth and accurate data. Specifically, they’re in the form of .jams files, which a JSON like files with a variety of information about the .mp3 file, and most importantly the time stamps for particular chords. Two more libraries I discovered were librosa and scipy.signal, which were useful for working with mp3’s and, crucially, converting them into spectrograms.
According to people in the Music Information Retrieval, spectrograms are a very poor way of getting information from audio; however, I don’t really care because this is just for fun and I don’t need a really high fidelity. The first thing we need to do is break up the audio files into chunks of audio. I choose 0.2 seconds, since that’s probably big enough to extract meaningful information, but not too big so that latency becomes an issue. The second choice you need to make is the resolution. By default, signal and librosa both make the spectrograms go from around 20 hertz up to the tens of thousands, but we don’t actually need that, so we can cut it from 100 to around 500 hertz, since that’s about the range of guitar. We also need to choose the number of fast Fourier transforms we make, which is essentially is the resolution of the spectrogram. Unfortunately, the way that scipy.signal works is that you can either have really good horizontal resolution with bad vertical resolution (Δx is small → Δy is big) or the contrapositive. There may be a work around, but I couldn’t find it. I choose to have a relatively small Δy (about 10 hertz) and a larger Δx (about 0.02 seconds). I also removed the axes to make it easier for the AI to understand. Below is an example. The dimmer lines above the main, bright line are the overtones of the original frequency.
Additionally, I choose to only save spectrograms with a note onset (ie. The beginning of a note being played). To do that you can measure the total sound energy at a particular time, take the root mean square of it, and then take the derivative. Every clip that had a derivative higher than some number I saved, everything below that I threw out. This was primarily to save space, since I would have ended up using about 10x the storage if I didn’t selectively throw out spectrograms. Also this was an artifact from earlier testing from raw audio I recorded from my microphone. I may or may not end of keeping it in the final product.Jupyter Notebook Download