Cryptographic hashes like SHA-256 cannot be used to compare the distance between two audio files. Cryptographic hashes are deliberately designed to be unpredictable and to ideally reveal no information about the input that was hashed.
However, there are many suitable acoustic fingerprinting algorithms that accept a segment of audio and return a fingerprint vector. Then, you can measure the similarity of two audio clips by seeing how close together their corresponding fingerprint vectors are.
Picking an acoustic fingerprinting algorithm
Chromaprint is a popular open source acoustic fingerprinting algorithm with bindings and reimplementations in many popular languages. Chromaprint is used by the AcoustID project, which is building an open source database to collect fingerprints and metadata for popular music.
The researcher Joren Six has also written and open-sourced the acoustic fingerprinting libraries Panako and Olaf. However, they are currently both licensed as AGPLv3 and might possibly infringe upon still-active US patents.
Several companies--such as Pex--sell APIs for checking if arbitrary audio files contain copyrighted material. If you sign up for Pex, they will give you their closed-source SDK for generating acoustic fingerprints as per their algorithm.
Generating and comparing fingerprints
Here, I will assume that you chose Chromaprint. You will have to install libchromaprint and an FFT library.
I will assume that you chose Chromaprint and that you want to compare fingerprints using Python, although the general principle applies to other fingerprinting libraries.
- Install libchromaprint or the fpcalc command line tool.
- Install the pyacoustid Python library from PyPI. It will look for your existing installation of libchromaprint or fpcalc.
- Normalize your audio files to remove differences that could confuse Chromaprint, such as silence at the beginning of an audio file. Also keep in mind that Chromaprin
- While I typically measure the distance between vectors using NumPy, many Chromaprint users compare two audio files by computing the
xor function between the fingerprints and counting the number of 1 bits.
Here is some quick-and-dirty Python code for comparing the distance between two fingerprints. Although if I were building a production service, I'd implement the comparison in C++ or Rust.
from operator import xor from typing import List # These imports should be in your Python module path # after installing the `pyacoustid` package from PyPI. import acoustid import chromaprint def get_fingerprint(filename: str) -> List[int]: """ Reads an audio file from the filesystem and returns a fingerprint. Args: filename: The filename of an audio file on the local filesystem to read. Returns: Returns a list of 32-bit integers. Two fingerprints can be roughly compared by counting the number of corresponding bits that are different from each other. """ _, encoded = acoustid.fingerprint_file(filename) fingerprint, _ = chromaprint.decode_fingerprint( encoded ) return fingerprint def fingerprint_distance( f1: List[int], f2: List[int], fingerprint_len: int, ) -> float: """ Returns a normalized distance between two fingerprints. Args: f1: The first fingerprint. f2: The second fingerprint. fingerprint_len: Only compare the first `fingerprint_len` integers in each fingerprint. This is useful when comparing audio samples of a different length. Returns: Returns a number between 0.0 and 1.0 representing the distance between two fingerprints. This value represents distance as like a percentage. """ max_hamming_weight = 32 * fingerprint_len hamming_weight = sum( sum( c == "1" for c in bin(xor(f1[i], f2[i])) ) for i in range(fingerprint_len) ) return hamming_weight / max_hamming_weight
The above functions would let you compare two fingerprints as follows:
>>> f1 = get_fingerprint("1.mp3") >>> f2 = get_fingerprint("2.mp3") >>> f_len = min(len(f1), len(f2)) >>> fingerprint_distance(f1, f2, f_len) 0.35 # for example
You can read more about how to use Chromaprint to compute the distance between different audio files. This mailing list thread describes the theory of how to compare Chromaprint fingerprints. This GitHub Gist offers another implementation.