How do I combine the raw data of two WAVE files?

Question

If I were to make two recordings, A, and B, would A and B sound (drastically) different if:

"A" is a recording of two people talking simultaneously.
"B" is created in Audacity by "Mix and Render" on two tracks of people talking, making them "talk" simultaneously.

Assume that each person in each recording is speaking at the same volume, and it's a reasonable volume.

My intuition tells me that in recording "A" there wouldn't be any over-amplification because when you hear two people talking it doesn't sound like twice the volume.

But then, I know that when you add two waveforms, the result of constructive interference can cause the amplitude of certain samples to become over-amplified, which would be apparent in "B".

So would "A" and "B" sound different, all else being equal?

Basically, I want to add two audio tracks together and not have any over-amplification, and I was hoping that there is a different method (hopefully a mathematical method) besides just naively reducing the amplitude of whatever part is over-amplified, so that the rest of the audio doesn't lose so much amplitude.

Is there a mathematical method of doing this, or do I have to go in and edit it piecewise?

My goal is to combine multiple audio files algorithmically, i.e., without software like Audacity; only looking at raw data. Hopefully, there is a method I can implement with just math, since the raw data is in the form of a list of 16-bit integers. I was hoping that there is a mathematical way to do this because I'm dealing with a great number of files, but they are all very short (< 5 seconds), so I was hoping that there is a mathematical way to do this, and I can just write a short single-purpose program to automate and generalize the task, so that I can combine any two audio files with no manual editing.

Just for the sake of example, consider the following:

I have two audio files: 1.wav, 2.wav
1.wav is going to consist of a series of simple tones one after another I can generate mathematically, some of which are just silence, some are square waves, etc.
2.wav is a few words of pre-recorded speech.
I want to create a new audio file that sounds like 1.wav and 2.wav overlaid. (It might sound like a person talking while a machine beeps in the background, for example.)

For context, I can easily convert 16-bit signed PCM little-endian WAVE files to a list of 16-bit signed big-endian integers, and back again.

So, how do I "correctly" combine these two files? It can't simply be the addition of the waveform of 1.wav plus the waveform of 2.wav, because then some parts may breach the maximum amplitude (-32768 or +32767) and I would have to adjust it. I would prefer a technique that I can apply to the raw data (meaning I can convert any two audio files to integers, create a new list of integers representing the combined audio, and then convert those integers into actual audio again).

Yeah, this is more to do with signal processing and algorithms. But all you would need to do to keep within FS, is check maximum (peak) sample value (as you're processing new sample values) and if any are found, attenuate all samples in all channels so that peak == [target] ([target] should be <= -1 dBFS because true peaks can be higher). It's probably not that simple in practice, though :). — n00dles
– n00dles, Commented Feb 18, 2023 at 17:12
Btw, two people talking would increase the "volume" as in software. Not exactly the same, though, as the higher sound level may have an effect on the acoustics of the room. Think if 10 people were singing. Or 100. It would be very loud. — n00dles
– n00dles, Commented Feb 24, 2023 at 19:01
@n00dles Okay, so then what would be the case if you had an orchestra of violins? Surely 20 violins playing together aren't 20 times the amplitude of 1 violin, right? (A bit less due to the propagation) It doesn't really make sense to me. — Patrick O'Brien
– Patrick O'Brien, Commented Feb 27, 2023 at 8:43
If they were exactly in-phase (and equal wavelength), then it would be 20 x the intensity (if you negate distance, etc). I rely on my logic. I can't do complicated maths. — n00dles
– n00dles, Commented Mar 4, 2023 at 18:16

score 1 · Accepted Answer · 2023-03-01 18:37:49Z

Actually, this question is quite deep than it seems at the first glance as it includes both analog and digital signal processing point of view.

In nature, because of the presence of the superposition principle, we are able to hear multiple sounds at the same time. Let's assume that we have two sources of sound and those propagating sound pressure waves arrive at a transducer, that is, a microphone. If the electrical signal because of the first source is $a(t)$ and the second signal as a result of the second audio source is $b(t)$, in the output of the microphone;

$$c(t) = a(t) + b(t) \rightarrow C(f) = A(f) + B(f)$$

$c(t)$ is the combination of each electrical signal as a result of the superposition. Also, since that feature applies to Fourier transform, the combined spectra $C(f)$ is also made up of each spectral information of those waveforms. In my opinion, the Fourier transform is maybe the most important tool to examine as you deal with amplitudes and some hidden information (frequency and phase, which are responsible of the subtle audible effects that we may experience when we listen to something). This was the philosophy part of that basic mechanism that we experience every day.

So, let's go back to the point of the question: Frankly, when we deal with WAVE files, we examine discrete audio signals. If we use the same signals;

$$c[n] = a[n] + b[n] \rightarrow C(e^{j\omega}) = A(e^{j\omega}) + B(e^{j\omega})$$

The superposition is still in effect when discrete signals and their DFTs are of concern. Therefore, if we want to mix two or more WAVE files, it is better (again, from my perspective) to make use of that feature, i.e., summing up each corresponding amplitude values inside the WAVE files brings results that are closer to the natural. Thus, I've decided on looking from that perspective.

I created two mono WAVE files using Audacity: The first one is a metronome rhythm and the second is a tone of 120 Hz. Each file is created by using signed 16-bit representation and 44,1 kHz of sampling frequency. I used Python and some required packages for generating a combination of these files. Relevant WAVE files are given in my GitHub link below:

The GitHub link for the WAVE files. Also, the following scripts can be found here, too.

And the script of this methodology is as follows:

import wave import pyaudio import numpy as np def ltbe_int16(data: np.uint8) -> np.int16: # Little- to big-endian conversion function is here. big_endian_sign_test_mask = 0x80; # Since we can't directly tell Python to decide on whether a binary number is actually representing a negative number by looking at the sign bit, this mask is used to shed light onto the nature of the big-endian version of the unsigned 8-bit number. if data & big_endian_sign_test_mask == 0: # If the MSB (most significant bit) of the unsigned 8-bit number is 0, the big-endian correspondence of that number (which is the ADC output value in the decimal form) is positive. result = (data << 8) & 0xFF00; return result; else: # Otherwise, if the MSB is 1, then the ADC output is a negative decimal number. intermediate = (data << 8) & 0xFF00; result = -1 * ((intermediate ^ 0xFFFF) + 1); # However, since Python can't produce a correct decimal value as a result of ambiguity during interpreting sign bits of binary numbers, we tell it what to do so that the result becomes a decimal number in the signed 16-bit representation. return result; def btle_int16(data: np.int16) -> np.uint8: # Big- to little-endian conversion is done here. result = data >> 8; return result; with wave.open("Rhythm.wav", "rb") as wf_1: p = pyaudio.PyAudio(); FRAMES_1 = wf_1.getnframes(); # In the PyAudio documentation, the word "FRAMES" refer to the audio samples so a frame is actually a sample. I didn't say "SAMPLES" in order to avoid confusion. CHANNELS_1 = wf_1.getnchannels(); # Amount of channels the audio file includes. In our case, our WAVE files are in the one-channel or mono configuration. FRAME_RATE_1 = wf_1.getframerate(); # The frame rate is the sampling rate or frequency. 44,1 kHz is used for all files. raw_audio_information_1 = wf_1.readframes(FRAMES_1); wf_1.close(); with wave.open("Tone.wav", "rb") as wf_2: p = pyaudio.PyAudio(); FRAMES_2 = wf_2.getnframes(); CHANNELS_2 = wf_2.getnchannels(); FRAME_RATE_2 = wf_2.getframerate(); raw_audio_information_2 = wf_2.readframes(FRAMES_2); wf_2.close(); raw_audio_information_1_array_LE = np.frombuffer(raw_audio_information_1, np.uint8); raw_audio_information_2_array_LE = np.frombuffer(raw_audio_information_2, np.uint8); raw_audio_information_1_array_BE = np.zeros(len(raw_audio_information_1_array_LE), np.int16); raw_audio_information_2_array_BE = np.zeros(len(raw_audio_information_2_array_LE), np.int16); i = 0; for i in range(0, len(raw_audio_information_1_array_BE)): raw_audio_information_1_array_BE[i] = raw_audio_information_1_array_BE[i] + ltbe_int16(raw_audio_information_1_array_LE[i]); raw_audio_information_2_array_BE[i] = raw_audio_information_2_array_BE[i] + ltbe_int16(raw_audio_information_2_array_LE[i]); combined_raw_audio_information_array_BE = np.zeros(len(raw_audio_information_1_array_BE), np.int16); combined_raw_audio_information_array_LE = np.zeros(len(combined_raw_audio_information_array_BE), np.uint8); j = 0; while j < len(combined_raw_audio_information_array_BE): combined_raw_audio_information_array_BE[j] = combined_raw_audio_information_array_BE[j] + 0.5 * raw_audio_information_1_array_BE[j] + 0.5 * raw_audio_information_2_array_BE[j]; combined_raw_audio_information_array_LE[j] = combined_raw_audio_information_array_LE[j] + btle_int16(combined_raw_audio_information_array_BE[j]); j = j + 1; combined_raw_audio_information_LE_list = combined_raw_audio_information_array_LE.tolist(); combined_raw_audio_information_LE_byte_array = bytearray(combined_raw_audio_information_LE_list); combined_raw_audio_information_LE_bytes = bytes(combined_raw_audio_information_LE_byte_array); with wave.open("Combined (Rhythm and Tone).wav", "wb") as wf_3: p = pyaudio.PyAudio(); SAMPLE_VALUE_FORMAT = pyaudio.paInt16; CHANNELS_3 = 1; FRAME_RATE_3 = 44100; wf_3.setsampwidth(p.get_sample_size(SAMPLE_VALUE_FORMAT)); wf_3.setnchannels(CHANNELS_3); wf_3.setframerate(FRAME_RATE_3); wf_3.writeframes(combined_raw_audio_information_LE_bytes); wf_3.close(); p.terminate();

I tried to explain some lines of code that may need attention, but I'd like discuss some aspects that may become important during understanding the algorithm.

The algorithm reads frames or samples of each WAVE file by using np.frombuffer() function. This function should be used carefully, otherwise some unwanted results are likely to be observed. So, let me explain the issue here: Consider a bytes object that has the following form:

b'\xab\x72\x36\x23'

Since the package PyAudio is used, Python's features are in the game. While we write a WAVE file from scratch with that package, we need to inject little-endian version of the original ADC output values which are big-endian by default (at least, in my computer). Because, for writing bytes to form a WAVE file, Python's bytearray() and bytes()functions are utilised. Ultimately, those functions require that the elements (i.e., the decimal numbers) should be in the range [0, 256) which means bytes are going to be in the unsigned 8-bit format (This is why someone observes decimal values from 0 to 255 when he/she tries to plot raw audio data without doing little- to big-endian conversion.) Let's head back to our example. If the data type inside the function becomes np.uint8, each hex value is distinguished easily as the algorithm looks for each np.uint8 number. And, the output becomes like this:

[171 114 54 35]

However, if the data type is, say, np.uint16, the algorithm starts to pick two 8-bit number at a time. As a consequence, the output may become as follows in accordance with the np.frombuffer() function's way of usage and the computer's byte order settings (It is a good practice to check out the function's documentation.):

[29355 9014]

Apart from that detail, in the code, I've scaled each WAVE file's original ADC output values with 0.5 so that if two amplitude values ever happen to take an extremum value (-32.768 or 32.512), the sum becomes either one of the given extremum values. This, in turn, won't lead to clipped and distorted analog audio signals in the output of the DAC.

Graphs of Rhythm.wav and Tone.wav is given below in the order:

And, the mixture file Combined.wav graph is like this. You can see that because of the destructive interference, there are some attenuations with respect to each individual graphs. Additionally, that sine tone seems like a huge, wide line as I wanted to capture whole temporal outcome.

In spite of that satisfactory result, I attempted to create different combinations, like mixing whistles and speech. But, the result wasn't that good as I didn't use advanced DSP techniques to get something professional. So, I decided to add another method to play individual sound files at the same time: The stereo arrangement. The next lines of codes accomplish that task.

import wave import pyaudio import numpy as np def ltbe_int16(data: np.uint8) -> np.int16: # Little- to big-endian conversion function is here. big_endian_sign_test_mask = 0x80; # Since we can't directly tell Python to decide on whether a binary number is actually representing a negative number by looking at the sign bit, this mask is used to shed light onto the nature of the big-endian version of the unsigned 8-bit number. if data & big_endian_sign_test_mask == 0: # If the MSB (most significant bit) of the unsigned 8-bit number is 0, the big-endian correspondence of that number (which is the ADC output value in the decimal form) is positive. result = (data << 8) & 0xFF00; return result; else: # Otherwise, if the MSB is 1, then the ADC output is a negative decimal number. intermediate = (data << 8) & 0xFF00; result = -1 * ((intermediate ^ 0xFFFF) + 1); # However, since Python can't produce a correct decimal value as a result of ambiguity during interpreting sign bits of binary numbers, we tell it what to do so that the result becomes a decimal number in the signed 16-bit representation. return result; def btle_int16(data: np.int16) -> np.uint8: # Big- to little-endian conversion is done here. result = data >> 8; return result; with wave.open("Rhythm.wav", "rb") as wf_1: p = pyaudio.PyAudio(); FRAMES_1 = wf_1.getnframes(); # In the PyAudio documentation, the word "FRAMES" refer to the audio samples so a frame is actually a sample. I didn't say "SAMPLES" in order to avoid confusion. CHANNELS_1 = wf_1.getnchannels(); # Amount of channels the audio file includes. In our case, our WAVE files are in the one-channel or mono configuration. FRAME_RATE_1 = wf_1.getframerate(); # The frame rate is the sampling rate or frequency. 44,1 kHz is used for all files. raw_audio_information_1 = wf_1.readframes(FRAMES_1); wf_1.close(); with wave.open("Tone.wav", "rb") as wf_2: p = pyaudio.PyAudio(); FRAMES_2 = wf_2.getnframes(); CHANNELS_2 = wf_2.getnchannels(); FRAME_RATE_2 = wf_2.getframerate(); raw_audio_information_2 = wf_2.readframes(FRAMES_2); wf_2.close(); raw_audio_information_1_array_LE = np.frombuffer(raw_audio_information_1, np.uint8); raw_audio_information_2_array_LE = np.frombuffer(raw_audio_information_2, np.uint8); raw_audio_information_1_array_BE = np.zeros(len(raw_audio_information_1_array_LE), np.int16); raw_audio_information_2_array_BE = np.zeros(len(raw_audio_information_2_array_LE), np.int16); i = 0; for i in range(0, len(raw_audio_information_1_array_BE)): raw_audio_information_1_array_BE[i] = raw_audio_information_1_array_BE[i] + ltbe_int16(raw_audio_information_1_array_LE[i]); raw_audio_information_2_array_BE[i] = raw_audio_information_2_array_BE[i] + ltbe_int16(raw_audio_information_2_array_LE[i]); combined_raw_audio_information_array_BE = np.zeros(2 * len(raw_audio_information_1_array_BE), np.int16); combined_raw_audio_information_array_LE = np.zeros(len(combined_raw_audio_information_array_BE), np.uint8); j = 0; while j < len(combined_raw_audio_information_array_BE): combined_raw_audio_information_array_BE[j] = combined_raw_audio_information_array_BE[j] + raw_audio_information_1_array_BE[int(j / 2)]; combined_raw_audio_information_array_LE[j] = combined_raw_audio_information_array_LE[j] + btle_int16(combined_raw_audio_information_array_BE[j]); j = j + 1; combined_raw_audio_information_array_BE[j] = combined_raw_audio_information_array_BE[j] + raw_audio_information_1_array_BE[int((j + 1) / 2)]; combined_raw_audio_information_array_LE[j] = combined_raw_audio_information_array_LE[j] + btle_int16(combined_raw_audio_information_array_BE[j]); j = j + 1; combined_raw_audio_information_array_BE[j] = combined_raw_audio_information_array_BE[j] + raw_audio_information_2_array_BE[int((j - 2) / 2)]; combined_raw_audio_information_array_LE[j] = combined_raw_audio_information_array_LE[j] + btle_int16(combined_raw_audio_information_array_BE[j]); j = j + 1; combined_raw_audio_information_array_BE[j] = combined_raw_audio_information_array_BE[j] + raw_audio_information_2_array_BE[int((j - 1) / 2)]; combined_raw_audio_information_array_LE[j] = combined_raw_audio_information_array_LE[j] + btle_int16(combined_raw_audio_information_array_BE[j]); j = j + 1; combined_raw_audio_information_LE_list = combined_raw_audio_information_array_LE.tolist(); combined_raw_audio_information_LE_byte_array = bytearray(combined_raw_audio_information_LE_list); combined_raw_audio_information_LE_bytes = bytes(combined_raw_audio_information_LE_byte_array); with wave.open("Combined (Rhythm and Tone in the Stereo Arrangement).wav", "wb") as wf_3: p = pyaudio.PyAudio(); SAMPLE_VALUE_FORMAT = pyaudio.paInt16; CHANNELS_3 = 2; FRAME_RATE_3 = 44100; wf_3.setsampwidth(p.get_sample_size(SAMPLE_VALUE_FORMAT)); wf_3.setnchannels(CHANNELS_3); wf_3.setframerate(FRAME_RATE_3); wf_3.writeframes(combined_raw_audio_information_LE_bytes); wf_3.close(); p.terminate();

The resultant file gives the impression of individual tracks are played from different locations and independently which is capable of imitating a real life scenario that I've discussed in the first paragraphs of this post. The graph of this file is given below.

Obviously, there are so many things that I would like to tell about my source code and algorithm which would clarify the process even more.

I'm curious, 1) "Summing up each corresponding amplitude values inside the WAVE files brings results that are closer to the natural" As opposed to what? 2) "the result wasn't that good as I didn't use advanced DSP techniques" What happened there? What was up with the result? — n00dles
– n00dles, Commented Mar 4, 2023 at 18:27
@n00dles 1) I've assumed a single human ear which experiences sounds that coming at the same speed, but they may have a different phase. Hence, if I employ the superposition (which happens in reality), the result would be much natural rather than playing different audio from different speakers as their sources don't have to be separated physically. They may be so close to each other. 2) When I overlaid a speech and a whistle, in the first method, I got distorted combination. — user56213
– user56213, Commented Mar 5, 2023 at 11:11
(2) - that's peculiar. I wonder why. Are there any references that discuss it? — n00dles
– n00dles, Commented Mar 5, 2023 at 20:10
@n00dles Well, frankly, it is not that distorted. It just sounds like there is a crackling in the background. But, I am able to hear words in my speech along with that high-pitched whistle. Just that added unwanted effect is the issue. However, the WAVE file which includes the individual files that I'd created on Audacity (those I mentioned in my answer) have been a good duo, without any problem at all. Maybe, it is about the elevated noise floor. By the way, I didn't search for any relevant paper. Nevertheless, I will look into it. — user56213
– user56213, Commented Mar 5, 2023 at 21:07

Stack Exchange Network

How do I combine the raw data of two WAVE files?

1 Answer 1

Hot Network Questions

How do I combine the raw data of two WAVE files?

1 Answer 1

Related

Hot Network Questions