Please check [this answer][1], which describes a few approaches to the same problem. Given that bird song is a monophonic signal (only one fundamental frequency at any point in time - as opposed to polyphonic) - and given that the timbre is irrelevant, the most interesting feature to extract for this classification task is a pitch contour.


 [1]: http://dsp.stackexchange.com/questions/8220/performing-classification-based-on-fft-results