1

For my project I have to detect if two audio files are similar and when the first audio file is contained in the second. My problem is that I tried to use librosa the numpy.correlate. I don't know if I'm doing it in the right way. How can I detect if audio is contained in another audio file?

import librosa import numpy long_audio_series, long_audio_rate = librosa.load("C:\\Users\\Jerry\\Desktop\\long_file.mp3") short_audio_series, short_audio_rate = librosa.load("C:\\Users\\Jerry\\Desktop\\short_file.mka") for long_stream_id, long_stream in enumerate(long_audio_series): for short_stream_id, short_stream in enumerate(short_audio_series): print(numpy.correlate(long_stream, short_stream)) 
2
  • What kind of audio are these events? How long is a typical event? Commented Aug 2, 2019 at 8:39
  • @jonnor 30 minutes is long_audio and the short audio 1:30 minute Commented Aug 2, 2019 at 9:51

2 Answers 2

3

Simply comparing the audio signals long_audio_series and short_audio_series probably won't work. What I'd recommend doing is audio fingerprinting, to be more precise, essentially a poor man's version of what Shazam does. There is of course the patent and the paper, but you might want to start with this very readable description. Here's the central image, the constellation map (CM), from that article:

Constellation Map image from https://willdrevo.com/fingerprinting-and-audio-recognition-with-python/

If you don't want to scale to very many songs, you can skip the whole hashing part and concentrate on peak finding.

So what you need to do is:

  1. Create a power spectrogram (easy with librosa.core.stft).
  2. Find local peaks in all your files (can be done with scipy.ndimage.filters.maximum_filter) to create CMs, i.e., 2D images only containing the peaks. The resulting CM is typically binary, i.e. containing 0 for no peaks and 1 for peaks.
  3. Slide your query CM (based on short_audio_series) over each of your database CM (based on long_audio_series). For each time step count how many "stars" (i.e. 1s) align and store the count along with the slide offset (essentially the position of the short audio in the long audio).
  4. Pick the max count and return the corresponding short audio and position in the long audio. You will have to convert frame numbers back to seconds.

Example for the "slide" (untested sample code):

import numpy as np scores = {} cm_short = ... # 2d constellation map for the short audio cm_long = ... # 2d constellation map for the long audio # we assume that dim 0 is the time frame # and dim 1 is the frequency bin # both CMs contains only 0 or 1 frames_short = cm_short.shape[0] frames_long = cm_long.shape[0] for offset in range(frames_long-frames_short): cm_long_excerpt = cm_long[offset:offset+frames_short] score = np.sum(np.multiply(cm_long_excerpt, cm_short)) scores[offset] = score # TODO: find the highest score in "scores" and # convert its offset back to seconds 

Now, if your database is large, this will lead to way too many comparisons and you will also have to implement the hashing scheme, which is also described in the article I linked to above.

Note that the described procedure only matches identical recordings, but allows for noise and slight distortion. If that is not what you want, please define similarity a little better, because that could be all kinds of things (drum patterns, chord sequence, instrumentation, ...). A classic, DSP-based way to find similarities for these features is the following: Extract the appropriate feature for short frames (e.g. 256 samples) and then compute the similarity. E.g., if harmonic content is of interest to you, you could extract chroma vectors and then calculate a distance between chroma vectors, e.g., cosine distance. When you compute the similarity of each frame in your database signal with every frame in your query signal you end up with something similar to a self similarity matrix (SSM) or recurrence matrix (RM). Diagonal lines in the SSM/RM usually indicate similar sections.

Sign up to request clarification or add additional context in comments.

7 Comments

What do you mean with "CM"? I haven't any DB
CM = constellation map
Usually the problem is formulated as "querying a database of audio documents with a sample". If you only have one long file, than that's your database. Your short file is your query.
How can I slide my CM for match with the query, Sorry I am beginner in audio processing?
Create a CM for your long document and for your short document. Using numpy slicing, create an excerpt from the long document that is as long as you short document. Then simply np.multiply the two images and np.sum the result. That's your count. Now, to slide, choose a different excerpt from the long CM, shifted by one frame, and so on.
|
0

I guess you only need to find an offset, but either way, there's how to first find the similarity and then how to find the offset from the short file into the long file

Measuring Similarity

First you need to decode them into PCM and ensure it has specific sample rate, which you can choose beforehand (e.g. 16KHz). You'll need to resample songs that have different sample rate. High sample rate is not required since you need a fuzzy comparison anyway, but too low sample rate will lose too much details.

You can use the following code for that:

ffmpeg -i audio1.mkv -c:a pcm_s24le output1.wav ffmpeg -i audio2.mkv -c:a pcm_s24le output2.wav 

And below there's a code to get a number from 0 to 100 for the similarity from two audio files using python, it works by generating fingerprints from audio files and comparing them based out of them using cross correlation

It requires Chromaprint and FFMPEG installed, also it doesn't work for short audio files, if this is a problem, you can always reduce the speed of the audio like in this guide, be aware this is going to add a little noise.

# correlation.py import subprocess import numpy # seconds to sample audio file for sample_time = 500# number of points to scan cross correlation over span = 150# step size (in points) of cross correlation step = 1# minimum number of points that must overlap in cross correlation # exception is raised if this cannot be met min_overlap = 20# report match when cross correlation has a peak exceeding threshold threshold = 0.5 # calculate fingerprint def calculate_fingerprints(filename): fpcalc_out = subprocess.getoutput('fpcalc -raw -length %i %s' % (sample_time, filename)) fingerprint_index = fpcalc_out.find('FINGERPRINT=') + 12 # convert fingerprint to list of integers fingerprints = list(map(int, fpcalc_out[fingerprint_index:].split(','))) return fingerprints # returns correlation between lists def correlation(listx, listy): if len(listx) == 0 or len(listy) == 0: # Error checking in main program should prevent us from ever being # able to get here. raise Exception('Empty lists cannot be correlated.') if len(listx) > len(listy): listx = listx[:len(listy)] elif len(listx) < len(listy): listy = listy[:len(listx)] covariance = 0 for i in range(len(listx)): covariance += 32 - bin(listx[i] ^ listy[i]).count("1") covariance = covariance / float(len(listx)) return covariance/32 # return cross correlation, with listy offset from listx def cross_correlation(listx, listy, offset): if offset > 0: listx = listx[offset:] listy = listy[:len(listx)] elif offset < 0: offset = -offset listy = listy[offset:] listx = listx[:len(listy)] if min(len(listx), len(listy)) < min_overlap: # Error checking in main program should prevent us from ever being # able to get here. return #raise Exception('Overlap too small: %i' % min(len(listx), len(listy))) return correlation(listx, listy) # cross correlate listx and listy with offsets from -span to span def compare(listx, listy, span, step): if span > min(len(listx), len(listy)): # Error checking in main program should prevent us from ever being # able to get here. raise Exception('span >= sample size: %i >= %i\n' % (span, min(len(listx), len(listy))) + 'Reduce span, reduce crop or increase sample_time.') corr_xy = [] for offset in numpy.arange(-span, span + 1, step): corr_xy.append(cross_correlation(listx, listy, offset)) return corr_xy # return index of maximum value in list def max_index(listx): max_index = 0 max_value = listx[0] for i, value in enumerate(listx): if value > max_value: max_value = value max_index = i return max_index def get_max_corr(corr, source, target): max_corr_index = max_index(corr) max_corr_offset = -span + max_corr_index * step print("max_corr_index = ", max_corr_index, "max_corr_offset = ", max_corr_offset) # report matches if corr[max_corr_index] > threshold: print(('%s and %s match with correlation of %.4f at offset %i' % (source, target, corr[max_corr_index], max_corr_offset))) def correlate(source, target): fingerprint_source = calculate_fingerprints(source) fingerprint_target = calculate_fingerprints(target) corr = compare(fingerprint_source, fingerprint_target, span, step) max_corr_offset = get_max_corr(corr, source, target) if __name__ == "__main__": correlate(SOURCE_FILE, TARGET_FILE) 

Code converted into python 3 from: https://shivama205.medium.com/audio-signals-comparison-23e431ed2207

Finding offset

Like earlier you need to decode them into PCM and ensure it has specific sample rate.

Again you can use the following code for that:

ffmpeg -i audio1.mkv -c:a pcm_s24le output1.wav ffmpeg -i audio2.mkv -c:a pcm_s24le output2.wav 

Then you can use the following code, it normalizes PCM data (i.e. find maximum sample value and rescale all samples so that sample with largest amplitude uses entire dynamic range of data format) and then converts it to spectrum domain (FFT) and finds a peak using cross correlation to finally return the offset in seconds

Depending of your case, you may want to avoid normalizing PCM data, which then you would need change a litte the code below

import argparse import librosa import numpy as np from scipy import signal def find_offset(within_file, find_file, window): y_within, sr_within = librosa.load(within_file, sr=None) y_find, _ = librosa.load(find_file, sr=sr_within) c = signal.correlate(y_within, y_find[:sr_within*window], mode='valid', method='fft') peak = np.argmax(c) offset = round(peak / sr_within, 2) return offset def main(): parser = argparse.ArgumentParser() parser.add_argument('--find-offset-of', metavar='audio file', type=str, help='Find the offset of file') parser.add_argument('--within', metavar='audio file', type=str, help='Within file') parser.add_argument('--window', metavar='seconds', type=int, default=10, help='Only use first n seconds of a target audio') args = parser.parse_args() offset = find_offset(args.within, args.find_offset_of, args.window) print(f"Offset: {offset}s" ) if __name__ == '__main__': main() 

Source and further explanation: https://dev.to/hiisi13/find-an-audio-within-another-audio-in-10-lines-of-python-1866

Then you would need depending of your case to combine these two piece of code, maybe you only want to find the offset in cases where the audio is similar, or the other way around.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.