Actually, this question is quite deep than it seems at the first glance as it includes both analog and digital signal processing point of view.
In nature, because of the presence of the superposition principle, we are able to hear multiple sounds at the same time. Let's assume that we have two sources of sound and those propagating sound pressure waves arrive at a transducer, that is, a microphone. If the electrical signal because of the first source is $a(t)$ and the second signal as a result of the second audio source is $b(t)$, in the output of the microphone;
$$c(t) = a(t) + b(t) \rightarrow C(f) = A(f) + B(f)$$
$c(t)$ is the combination of each electrical signal as a result of the superposition. Also, since that feature applies to Fourier transform, the combined spectra $C(f)$ is also made up of each spectral information of those waveforms. In my opinion, the Fourier transform is maybe the most important tool to examine as you deal with amplitudes and some hidden information (frequency and phase, which are responsible of the subtle audible effects that we may experience when we listen to something). This was the philosophy part of that basic mechanism that we experience every day.
So, let's go back to the point of the question: Frankly, when we deal with WAVE files, we examine discrete audio signals. If we use the same signals;
$$c[n] = a[n] + b[n] \rightarrow C(e^{j\omega}) = A(e^{j\omega}) + B(e^{j\omega})$$
The superposition is still in effect when discrete signals and their DFTs are of concern. Therefore, if we want to mix two or more WAVE files, it is better (again, from my perspective) to make use of that feature, i.e., summing up each corresponding amplitude values inside the WAVE files brings results that are closer to the natural. Thus, I've decided on looking from that perspective.
I created two mono WAVE files using Audacity: The first one is a metronome rhythm and the second is a tone of 120 Hz. Each file is created by using signed 16-bit representation and 44,1 kHz of sampling frequency. I used Python and some required packages for generating a combination of these files. Relevant WAVE files are given in my GitHub link below:
The GitHub link for the WAVE files. Also, the following scripts can be found here, too.
And the script of this methodology is as follows:
import wave import pyaudio import numpy as np def ltbe_int16(data: np.uint8) -> np.int16: # Little- to big-endian conversion function is here. big_endian_sign_test_mask = 0x80; # Since we can't directly tell Python to decide on whether a binary number is actually representing a negative number by looking at the sign bit, this mask is used to shed light onto the nature of the big-endian version of the unsigned 8-bit number. if data & big_endian_sign_test_mask == 0: # If the MSB (most significant bit) of the unsigned 8-bit number is 0, the big-endian correspondence of that number (which is the ADC output value in the decimal form) is positive. result = (data << 8) & 0xFF00; return result; else: # Otherwise, if the MSB is 1, then the ADC output is a negative decimal number. intermediate = (data << 8) & 0xFF00; result = -1 * ((intermediate ^ 0xFFFF) + 1); # However, since Python can't produce a correct decimal value as a result of ambiguity during interpreting sign bits of binary numbers, we tell it what to do so that the result becomes a decimal number in the signed 16-bit representation. return result; def btle_int16(data: np.int16) -> np.uint8: # Big- to little-endian conversion is done here. result = data >> 8; return result; with wave.open("Rhythm.wav", "rb") as wf_1: p = pyaudio.PyAudio(); FRAMES_1 = wf_1.getnframes(); # In the PyAudio documentation, the word "FRAMES" refer to the audio samples so a frame is actually a sample. I didn't say "SAMPLES" in order to avoid confusion. CHANNELS_1 = wf_1.getnchannels(); # Amount of channels the audio file includes. In our case, our WAVE files are in the one-channel or mono configuration. FRAME_RATE_1 = wf_1.getframerate(); # The frame rate is the sampling rate or frequency. 44,1 kHz is used for all files. raw_audio_information_1 = wf_1.readframes(FRAMES_1); wf_1.close(); with wave.open("Tone.wav", "rb") as wf_2: p = pyaudio.PyAudio(); FRAMES_2 = wf_2.getnframes(); CHANNELS_2 = wf_2.getnchannels(); FRAME_RATE_2 = wf_2.getframerate(); raw_audio_information_2 = wf_2.readframes(FRAMES_2); wf_2.close(); raw_audio_information_1_array_LE = np.frombuffer(raw_audio_information_1, np.uint8); raw_audio_information_2_array_LE = np.frombuffer(raw_audio_information_2, np.uint8); raw_audio_information_1_array_BE = np.zeros(len(raw_audio_information_1_array_LE), np.int16); raw_audio_information_2_array_BE = np.zeros(len(raw_audio_information_2_array_LE), np.int16); i = 0; for i in range(0, len(raw_audio_information_1_array_BE)): raw_audio_information_1_array_BE[i] = raw_audio_information_1_array_BE[i] + ltbe_int16(raw_audio_information_1_array_LE[i]); raw_audio_information_2_array_BE[i] = raw_audio_information_2_array_BE[i] + ltbe_int16(raw_audio_information_2_array_LE[i]); combined_raw_audio_information_array_BE = np.zeros(len(raw_audio_information_1_array_BE), np.int16); combined_raw_audio_information_array_LE = np.zeros(len(combined_raw_audio_information_array_BE), np.uint8); j = 0; while j < len(combined_raw_audio_information_array_BE): combined_raw_audio_information_array_BE[j] = combined_raw_audio_information_array_BE[j] + 0.5 * raw_audio_information_1_array_BE[j] + 0.5 * raw_audio_information_2_array_BE[j]; combined_raw_audio_information_array_LE[j] = combined_raw_audio_information_array_LE[j] + btle_int16(combined_raw_audio_information_array_BE[j]); j = j + 1; combined_raw_audio_information_LE_list = combined_raw_audio_information_array_LE.tolist(); combined_raw_audio_information_LE_byte_array = bytearray(combined_raw_audio_information_LE_list); combined_raw_audio_information_LE_bytes = bytes(combined_raw_audio_information_LE_byte_array); with wave.open("Combined (Rhythm and Tone).wav", "wb") as wf_3: p = pyaudio.PyAudio(); SAMPLE_VALUE_FORMAT = pyaudio.paInt16; CHANNELS_3 = 1; FRAME_RATE_3 = 44100; wf_3.setsampwidth(p.get_sample_size(SAMPLE_VALUE_FORMAT)); wf_3.setnchannels(CHANNELS_3); wf_3.setframerate(FRAME_RATE_3); wf_3.writeframes(combined_raw_audio_information_LE_bytes); wf_3.close(); p.terminate();
I tried to explain some lines of code that may need attention, but I'd like discuss some aspects that may become important during understanding the algorithm.
The algorithm reads frames or samples of each WAVE file by using np.frombuffer() function. This function should be used carefully, otherwise some unwanted results are likely to be observed. So, let me explain the issue here: Consider a bytes object that has the following form:
b'\xab\x72\x36\x23'
Since the package PyAudio is used, Python's features are in the game. While we write a WAVE file from scratch with that package, we need to inject little-endian version of the original ADC output values which are big-endian by default (at least, in my computer). Because, for writing bytes to form a WAVE file, Python's bytearray() and bytes()functions are utilised. Ultimately, those functions require that the elements (i.e., the decimal numbers) should be in the range [0, 256) which means bytes are going to be in the unsigned 8-bit format (This is why someone observes decimal values from 0 to 255 when he/she tries to plot raw audio data without doing little- to big-endian conversion.) Let's head back to our example. If the data type inside the function becomes np.uint8, each hex value is distinguished easily as the algorithm looks for each np.uint8 number. And, the output becomes like this:
[171 114 54 35]
However, if the data type is, say, np.uint16, the algorithm starts to pick two 8-bit number at a time. As a consequence, the output may become as follows in accordance with the np.frombuffer() function's way of usage and the computer's byte order settings (It is a good practice to check out the function's documentation.):
[29355 9014]
Apart from that detail, in the code, I've scaled each WAVE file's original ADC output values with 0.5 so that if two amplitude values ever happen to take an extremum value (-32.768 or 32.512), the sum becomes either one of the given extremum values. This, in turn, won't lead to clipped and distorted analog audio signals in the output of the DAC.
Graphs of Rhythm.wav and Tone.wav is given below in the order:


And, the mixture file Combined.wav graph is like this. You can see that because of the destructive interference, there are some attenuations with respect to each individual graphs. Additionally, that sine tone seems like a huge, wide line as I wanted to capture whole temporal outcome.

In spite of that satisfactory result, I attempted to create different combinations, like mixing whistles and speech. But, the result wasn't that good as I didn't use advanced DSP techniques to get something professional. So, I decided to add another method to play individual sound files at the same time: The stereo arrangement. The next lines of codes accomplish that task.
import wave import pyaudio import numpy as np def ltbe_int16(data: np.uint8) -> np.int16: # Little- to big-endian conversion function is here. big_endian_sign_test_mask = 0x80; # Since we can't directly tell Python to decide on whether a binary number is actually representing a negative number by looking at the sign bit, this mask is used to shed light onto the nature of the big-endian version of the unsigned 8-bit number. if data & big_endian_sign_test_mask == 0: # If the MSB (most significant bit) of the unsigned 8-bit number is 0, the big-endian correspondence of that number (which is the ADC output value in the decimal form) is positive. result = (data << 8) & 0xFF00; return result; else: # Otherwise, if the MSB is 1, then the ADC output is a negative decimal number. intermediate = (data << 8) & 0xFF00; result = -1 * ((intermediate ^ 0xFFFF) + 1); # However, since Python can't produce a correct decimal value as a result of ambiguity during interpreting sign bits of binary numbers, we tell it what to do so that the result becomes a decimal number in the signed 16-bit representation. return result; def btle_int16(data: np.int16) -> np.uint8: # Big- to little-endian conversion is done here. result = data >> 8; return result; with wave.open("Rhythm.wav", "rb") as wf_1: p = pyaudio.PyAudio(); FRAMES_1 = wf_1.getnframes(); # In the PyAudio documentation, the word "FRAMES" refer to the audio samples so a frame is actually a sample. I didn't say "SAMPLES" in order to avoid confusion. CHANNELS_1 = wf_1.getnchannels(); # Amount of channels the audio file includes. In our case, our WAVE files are in the one-channel or mono configuration. FRAME_RATE_1 = wf_1.getframerate(); # The frame rate is the sampling rate or frequency. 44,1 kHz is used for all files. raw_audio_information_1 = wf_1.readframes(FRAMES_1); wf_1.close(); with wave.open("Tone.wav", "rb") as wf_2: p = pyaudio.PyAudio(); FRAMES_2 = wf_2.getnframes(); CHANNELS_2 = wf_2.getnchannels(); FRAME_RATE_2 = wf_2.getframerate(); raw_audio_information_2 = wf_2.readframes(FRAMES_2); wf_2.close(); raw_audio_information_1_array_LE = np.frombuffer(raw_audio_information_1, np.uint8); raw_audio_information_2_array_LE = np.frombuffer(raw_audio_information_2, np.uint8); raw_audio_information_1_array_BE = np.zeros(len(raw_audio_information_1_array_LE), np.int16); raw_audio_information_2_array_BE = np.zeros(len(raw_audio_information_2_array_LE), np.int16); i = 0; for i in range(0, len(raw_audio_information_1_array_BE)): raw_audio_information_1_array_BE[i] = raw_audio_information_1_array_BE[i] + ltbe_int16(raw_audio_information_1_array_LE[i]); raw_audio_information_2_array_BE[i] = raw_audio_information_2_array_BE[i] + ltbe_int16(raw_audio_information_2_array_LE[i]); combined_raw_audio_information_array_BE = np.zeros(2 * len(raw_audio_information_1_array_BE), np.int16); combined_raw_audio_information_array_LE = np.zeros(len(combined_raw_audio_information_array_BE), np.uint8); j = 0; while j < len(combined_raw_audio_information_array_BE): combined_raw_audio_information_array_BE[j] = combined_raw_audio_information_array_BE[j] + raw_audio_information_1_array_BE[int(j / 2)]; combined_raw_audio_information_array_LE[j] = combined_raw_audio_information_array_LE[j] + btle_int16(combined_raw_audio_information_array_BE[j]); j = j + 1; combined_raw_audio_information_array_BE[j] = combined_raw_audio_information_array_BE[j] + raw_audio_information_1_array_BE[int((j + 1) / 2)]; combined_raw_audio_information_array_LE[j] = combined_raw_audio_information_array_LE[j] + btle_int16(combined_raw_audio_information_array_BE[j]); j = j + 1; combined_raw_audio_information_array_BE[j] = combined_raw_audio_information_array_BE[j] + raw_audio_information_2_array_BE[int((j - 2) / 2)]; combined_raw_audio_information_array_LE[j] = combined_raw_audio_information_array_LE[j] + btle_int16(combined_raw_audio_information_array_BE[j]); j = j + 1; combined_raw_audio_information_array_BE[j] = combined_raw_audio_information_array_BE[j] + raw_audio_information_2_array_BE[int((j - 1) / 2)]; combined_raw_audio_information_array_LE[j] = combined_raw_audio_information_array_LE[j] + btle_int16(combined_raw_audio_information_array_BE[j]); j = j + 1; combined_raw_audio_information_LE_list = combined_raw_audio_information_array_LE.tolist(); combined_raw_audio_information_LE_byte_array = bytearray(combined_raw_audio_information_LE_list); combined_raw_audio_information_LE_bytes = bytes(combined_raw_audio_information_LE_byte_array); with wave.open("Combined (Rhythm and Tone in the Stereo Arrangement).wav", "wb") as wf_3: p = pyaudio.PyAudio(); SAMPLE_VALUE_FORMAT = pyaudio.paInt16; CHANNELS_3 = 2; FRAME_RATE_3 = 44100; wf_3.setsampwidth(p.get_sample_size(SAMPLE_VALUE_FORMAT)); wf_3.setnchannels(CHANNELS_3); wf_3.setframerate(FRAME_RATE_3); wf_3.writeframes(combined_raw_audio_information_LE_bytes); wf_3.close(); p.terminate();
The resultant file gives the impression of individual tracks are played from different locations and independently which is capable of imitating a real life scenario that I've discussed in the first paragraphs of this post. The graph of this file is given below.

Obviously, there are so many things that I would like to tell about my source code and algorithm which would clarify the process even more.