1
$\begingroup$

The problem I'm trying to solve is that I'm trying to extract a "pure" version of a one-minute musical track that has been mixed together with voices speaking over it, over let's say 50 different recorded versions.

The mixing has been done in a DAW so the background musical track should have essentially identical samples at identical volume, and I can align all 50 versions perfectly down to the sample.

Obviously I could just average all 50 versions together, but you'll still be able to hear the voices as a general muddiness which won't work at all.

It occurred to me that a solution could be to convert each recording into a spectogram using identical settings, and then take the lowest value for each "pixel" that occurs in any of the recordings. Since the voices talking on top are only additive, this should essentially perfectly reveal the pure musical soundtrack underneath.

(A friend of mine once did a similar thing with photography -- he photographed a busy traffic intersection thousands of times over a few hours and took the median value for each pixel, and produced an "empty" image of the busy intersection without any cars or pedestrians. For pixels you need a median since objects aren't additive, but since audio is additive a minimum should work.)

So I have two questions about this.

First, is there any reason this wouldn't work, or is there a better approach?

And secondly, is anyone aware of tool that can accomplish this? I know people frequently layer together a hundred exposures in Photoshop and use blending modes to calculate a desired effect. But I've never heard of a tool that allows you to layer spectrograms and calculate things like minimums across them. I don't really want to have to hand-code something like this together in Python, but I will if I have to -- so would also appreciate if there's a particular package that would be best suited for this.

$\endgroup$

1 Answer 1

1
$\begingroup$

First, is there any reason this wouldn't work,

That's a reasonable approach but the devil is in the details. You will have to figure out the right framing approach (frame size, window, overlap, etc) to avoid time domain aliasing, which sounds really weird and you need to properly manage the phase of the spectra/

since audio is additive a minimum should work.

That is NOT a safe assumption. The spectra are complex and depending on the relative phase between music and speech the magnitude of the sum can be smaller than the original magnitude. It's correct to assume that the speech is additive "on average" but may not be for each individual spectral line.

or is there a better approach?

I think your best shot here is to find "speech free" frames and splice them together. Assuming that the recordings are all sample aligned from start to finish and level matched, you could do this even in the time domain. Go through the files sample by sample and calculate the PDF (probability density function) across all 50 recordings. There ought to be a clear peak in there: choose the sample value that has the most occurrences.

That approach only works if the music is truly identical (no sample rate conversion or clock adjustments, no compression, no mastering, no dithering, etc.) If that's not the case, the same approach could be done with the short term Fourier Transform as well. Don't choose the median, minimum or mean: choose the value that occurs the most.

$\endgroup$
3
  • $\begingroup$ Thank you, great point that minimum values may not work, but the value that occurs most commonly will. It also makes me realize that I haven't considered the phase information of the FFT, so I'll need to be selecting the most common phase as well, or phase+value. $\endgroup$ Commented Jun 21, 2024 at 20:35
  • $\begingroup$ Whoops, I hit enter before I finished typing and can't edit it anymore. Finishing my previous comment: Finding speech-free frames isn't an option -- the speech is basically continuous conversation, it's not occasional narration. Thanks for the advice -- I'm assuming you haven't come across anyone else trying to do this before, or any tool designed for this? $\endgroup$ Commented Jun 21, 2024 at 20:42
  • $\begingroup$ @crazygringo: "continuous speech" maybe less of a problem than you think. Natural human speech has a fairly low duty cycle, so there is plenty of pauses and gaps in there (if you choose a good frame size). $\endgroup$ Commented Jun 22, 2024 at 0:29

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.