The problem I'm trying to solve is that I'm trying to extract a "pure" version of a one-minute musical track that has been mixed together with voices speaking over it, over let's say 50 different recorded versions.
The mixing has been done in a DAW so the background musical track should have essentially identical samples at identical volume, and I can align all 50 versions perfectly down to the sample.
Obviously I could just average all 50 versions together, but you'll still be able to hear the voices as a general muddiness which won't work at all.
It occurred to me that a solution could be to convert each recording into a spectogram using identical settings, and then take the lowest value for each "pixel" that occurs in any of the recordings. Since the voices talking on top are only additive, this should essentially perfectly reveal the pure musical soundtrack underneath.
(A friend of mine once did a similar thing with photography -- he photographed a busy traffic intersection thousands of times over a few hours and took the median value for each pixel, and produced an "empty" image of the busy intersection without any cars or pedestrians. For pixels you need a median since objects aren't additive, but since audio is additive a minimum should work.)
So I have two questions about this.
First, is there any reason this wouldn't work, or is there a better approach?
And secondly, is anyone aware of tool that can accomplish this? I know people frequently layer together a hundred exposures in Photoshop and use blending modes to calculate a desired effect. But I've never heard of a tool that allows you to layer spectrograms and calculate things like minimums across them. I don't really want to have to hand-code something like this together in Python, but I will if I have to -- so would also appreciate if there's a particular package that would be best suited for this.