I guess you only need to find an offset, but either way, there's how to first find the similarity and then how to find the offset from the short file into the long file
Measuring Similarity
First you need to decode them into PCM and ensure it has specific sample rate, which you can choose beforehand (e.g. 16KHz). You'll need to resample songs that have different sample rate. High sample rate is not required since you need a fuzzy comparison anyway, but too low sample rate will lose too much details.
You can use the following code for that:
ffmpeg -i audio1.mkv -c:a pcm_s24le output1.wav ffmpeg -i audio2.mkv -c:a pcm_s24le output2.wav
And below there's a code to get a number from 0 to 100 for the similarity from two audio files using python, it works by generating fingerprints from audio files and comparing them based out of them using cross correlation
It requires Chromaprint and FFMPEG installed, also it doesn't work for short audio files, if this is a problem, you can always reduce the speed of the audio like in this guide, be aware this is going to add a little noise.
# correlation.py import subprocess import numpy # seconds to sample audio file for sample_time = 500# number of points to scan cross correlation over span = 150# step size (in points) of cross correlation step = 1# minimum number of points that must overlap in cross correlation # exception is raised if this cannot be met min_overlap = 20# report match when cross correlation has a peak exceeding threshold threshold = 0.5 # calculate fingerprint def calculate_fingerprints(filename): fpcalc_out = subprocess.getoutput('fpcalc -raw -length %i %s' % (sample_time, filename)) fingerprint_index = fpcalc_out.find('FINGERPRINT=') + 12 # convert fingerprint to list of integers fingerprints = list(map(int, fpcalc_out[fingerprint_index:].split(','))) return fingerprints # returns correlation between lists def correlation(listx, listy): if len(listx) == 0 or len(listy) == 0: # Error checking in main program should prevent us from ever being # able to get here. raise Exception('Empty lists cannot be correlated.') if len(listx) > len(listy): listx = listx[:len(listy)] elif len(listx) < len(listy): listy = listy[:len(listx)] covariance = 0 for i in range(len(listx)): covariance += 32 - bin(listx[i] ^ listy[i]).count("1") covariance = covariance / float(len(listx)) return covariance/32 # return cross correlation, with listy offset from listx def cross_correlation(listx, listy, offset): if offset > 0: listx = listx[offset:] listy = listy[:len(listx)] elif offset < 0: offset = -offset listy = listy[offset:] listx = listx[:len(listy)] if min(len(listx), len(listy)) < min_overlap: # Error checking in main program should prevent us from ever being # able to get here. return #raise Exception('Overlap too small: %i' % min(len(listx), len(listy))) return correlation(listx, listy) # cross correlate listx and listy with offsets from -span to span def compare(listx, listy, span, step): if span > min(len(listx), len(listy)): # Error checking in main program should prevent us from ever being # able to get here. raise Exception('span >= sample size: %i >= %i\n' % (span, min(len(listx), len(listy))) + 'Reduce span, reduce crop or increase sample_time.') corr_xy = [] for offset in numpy.arange(-span, span + 1, step): corr_xy.append(cross_correlation(listx, listy, offset)) return corr_xy # return index of maximum value in list def max_index(listx): max_index = 0 max_value = listx[0] for i, value in enumerate(listx): if value > max_value: max_value = value max_index = i return max_index def get_max_corr(corr, source, target): max_corr_index = max_index(corr) max_corr_offset = -span + max_corr_index * step print("max_corr_index = ", max_corr_index, "max_corr_offset = ", max_corr_offset) # report matches if corr[max_corr_index] > threshold: print(('%s and %s match with correlation of %.4f at offset %i' % (source, target, corr[max_corr_index], max_corr_offset))) def correlate(source, target): fingerprint_source = calculate_fingerprints(source) fingerprint_target = calculate_fingerprints(target) corr = compare(fingerprint_source, fingerprint_target, span, step) max_corr_offset = get_max_corr(corr, source, target) if __name__ == "__main__": correlate(SOURCE_FILE, TARGET_FILE)
Code converted into python 3 from: https://shivama205.medium.com/audio-signals-comparison-23e431ed2207
Finding offset
Like earlier you need to decode them into PCM and ensure it has specific sample rate.
Again you can use the following code for that:
ffmpeg -i audio1.mkv -c:a pcm_s24le output1.wav ffmpeg -i audio2.mkv -c:a pcm_s24le output2.wav
Then you can use the following code, it normalizes PCM data (i.e. find maximum sample value and rescale all samples so that sample with largest amplitude uses entire dynamic range of data format) and then converts it to spectrum domain (FFT) and finds a peak using cross correlation to finally return the offset in seconds
Depending of your case, you may want to avoid normalizing PCM data, which then you would need change a litte the code below
import argparse import librosa import numpy as np from scipy import signal def find_offset(within_file, find_file, window): y_within, sr_within = librosa.load(within_file, sr=None) y_find, _ = librosa.load(find_file, sr=sr_within) c = signal.correlate(y_within, y_find[:sr_within*window], mode='valid', method='fft') peak = np.argmax(c) offset = round(peak / sr_within, 2) return offset def main(): parser = argparse.ArgumentParser() parser.add_argument('--find-offset-of', metavar='audio file', type=str, help='Find the offset of file') parser.add_argument('--within', metavar='audio file', type=str, help='Within file') parser.add_argument('--window', metavar='seconds', type=int, default=10, help='Only use first n seconds of a target audio') args = parser.parse_args() offset = find_offset(args.within, args.find_offset_of, args.window) print(f"Offset: {offset}s" ) if __name__ == '__main__': main()
Source and further explanation: https://dev.to/hiisi13/find-an-audio-within-another-audio-in-10-lines-of-python-1866
Then you would need depending of your case to combine these two piece of code, maybe you only want to find the offset in cases where the audio is similar, or the other way around.