I've been working on a genetic algorithm to figure out how to play a guitar chord just by listening to it. I have it all working now, but the system I am using to compare how 'similar' a guess is to the target seems less than robust. Here is the basic workflow of the algorithm:
- I have a wav file and a corresponding spectrogram of the 'target' chord which I am trying to have the algorithm recreate
- I create a population of 'chords'. I simulate how a person would place their fingers on a fretboard and then convert it to a more readable set of pitches.
- Then, I run each chord through the simulation. I play what that chord would sound like through a MIDI library and, at the same time, record the sound to convert into a spectrogram. This is not ideal because then I have to wait for each chord to play in isolation, but to my knowledge, there is no way to just dream up a spectrogram, I have to just record it.
- Then I assign each chord a 'fitness' value by doing either mean squared or absolute error pixel-wise between the target spectrogram and each chord's spectrogram. I also cut any silence before or after the wav file before turning them into a spectrogram and take the average across the time dimension for each spectrogram (so time is not a factor, and instead of being a plot of frequencies vs time, it is just a list of frequencies and their corresponding amplitudes)
However, I have noticed that this algorithm for finding the fitness of each organism doesn't work great. Even when I run the same chord shape through multiple times, I get a lot of fluctuation each time as to what fitness it receives. It should be relatively the same volume each time, so I can't imagine normalization will help, but I could try it. Any other ideas about how to get it more reliable? Are spectrograms not the way to go? Or maybe there is a smarter option than pixel-wise error? Thanks!