music-detector-demo.mp4
Real-time music detection running from the microphone β predictions update every second with confidence scores.
This is a repository to showcase the capabilities of DeepAudioX. A lightweight Pytorch-based library that me and my team in MagCIL have developed. DeepAudioX provides simple APIs for training / evaluating / and inference audio models using pretrained models as feature extractors!
In this project we train a binary classifier that detects Music vs Non-Music excerpts. We use a small sample of the AudioSet dataset that you can download from Kaggle (see the instructions below).
Let's get started π
music-detector/ βββ configs/ β βββ training_config.yaml # Training hyperparameters and dataset paths β βββ online_inference_config.yaml # Real-time inference settings β βββ music_non_music_map.json # Train set: YouTube ID -> class label mapping β βββ valid_music_non_music_map.json # Validation set: YouTube ID -> class label mapping βββ scripts/ β βββ dataset.py # Dataset construction utilities β βββ train.py # Training and evaluation script β βββ online-inference.py # Real-time microphone inference βββ pretrained_models/ # Trained model checkpoints (.pt files) βββ misc/ # Demo videos and media assets βββ data/ # AudioSet audio files (not tracked in git) βββ pyproject.toml βββ README.md - Python 3.13+
- uv package manager
- PortAudio (required by
sounddevicefor microphone access)
git clone https://github.com/ChrisNick92/music-detector.git cd music-detector uv syncDownload the AudioSet train and validation sets from Kaggle:
- Train set: https://www.kaggle.com/datasets/zfturbo/audioset
- Validation set: https://www.kaggle.com/datasets/zfturbo/audioset-valid
Place the downloaded data so the directory structure looks like:
data/ βββ audio_set_train/ β βββ train_wav/ # .wav files βββ audio_set_valid/ βββ valid_wav/ # .wav files The class label mappings (music_non_music_map.json and valid_music_non_music_map.json) are already included in the configs/ directory. By default, training_config.yaml expects them under data/ β either copy them there or update the train_mapping / valid_mapping paths in the config to point to configs/ instead.
Train a music detection model using a pretrained backbone:
uv run python scripts/train.pyOr specify a custom config:
uv run python scripts/train.py --config configs/training_config.yamlEdit configs/training_config.yaml to customize the training pipeline:
| Section | Key | Description |
|---|---|---|
dataset | train_data_dir | Path to the directory with training .wav files |
dataset | train_mapping | Path to JSON mapping YouTube IDs to "Music" / "Non-Music" |
dataset | valid_data_dir | Path to the directory with validation .wav files |
dataset | valid_mapping | Path to JSON mapping for validation set |
dataset | sample_rate | Audio sampling rate in Hz (default: 16000) |
dataset | segment_duration | Segment length in seconds; null for full files (default: 10.0) |
model | backbone | Pretrained backbone: beats, passt, mobilenet_05_as, mobilenet_10_as, mobilenet_40_as |
model | pooling | Pooling method: gap, simpool, ep, or null (default: ep) |
model | freeze_backbone | Freeze backbone weights during training (default: true) |
model | classifier_hidden_layers | List of hidden layer sizes for the MLP head, e.g. [256] or [] |
model | activation | Activation function: relu, gelu, tanh, leakyrelu |
model | pretrained | Load pretrained backbone weights (default: true) |
training | learning_rate | Initial learning rate (default: 0.001) |
training | epochs | Maximum training epochs (default: 100) |
training | patience | Early stopping patience (default: 15) |
training | batch_size | Batch size (default: 16) |
training | num_workers | DataLoader workers (default: 4) |
training | checkpoint_path | Where to save the best model checkpoint |
training | device_index | GPU index to use; omit or null for CPU |
After training completes, the script automatically evaluates the best checkpoint on the validation set and prints a classification report.
Run music detection from your microphone in real time:
uv run python scripts/online-inference.pyOr with a custom config:
uv run python scripts/online-inference.py --config configs/online_inference_config.yamlThe script captures audio segments from your microphone and prints predictions continuously:
========================================================== Music Detector β Real-Time Inference Segment: 1s | Sample rate: 16000 Hz ========================================================== Press Ctrl+C to stop. 2025-02-28 14:34:56 β« Music [||||||||||||||||....] 82.3% 2025-02-28 14:34:57 β
Non-Music [||||||||||||||......] 71.5% 2025-02-28 14:34:58 β« Music [||||||||||||||||||||] 97.1% Press Ctrl+C to stop.
Edit configs/online_inference_config.yaml:
| Section | Key | Description |
|---|---|---|
model | backbone | Must match the backbone used during training |
model | pooling | Must match the pooling used during training |
model | classifier_hidden_layers | Must match the architecture used during training |
model | checkpoint_path | Path to the trained .pt checkpoint |
inference | sample_rate | Must match the sample rate used during training |
inference | segment_duration | Duration of each audio segment to classify (in seconds) |
inference | device_index | GPU index to use; omit or null for CPU |
Provides build_music_detection_dataset() β constructs a DeepAudioX AudioClassificationDataset from a directory of .wav files and a JSON class mapping. Used by both training and can be imported for custom workflows.
End-to-end training pipeline. Loads config, builds datasets, constructs an AudioClassifier with a pretrained backbone, trains with early stopping, and evaluates the best checkpoint. All configuration is driven by YAML β no code changes needed to experiment.
Real-time inference from microphone input. Loads a trained checkpoint, captures audio in fixed-duration segments, and prints timestamped predictions with confidence scores.
A pretrained checkpoint (pretrained_models/mobilenet_05.pt) is included in the repo so you can test the music detector right away β no training required!
Just run:
uv run python scripts/online-inference.pyThen play some music from your phone, speakers, or any source near your microphone and watch the predictions roll in. πΆ