| BEATs: Audio Pre-Training with Acoustic Tokenizers | iterative audio pre-training framework to learn bidirectional encoder representation from audio transformers | 98.1% | chen22 | π |
| Masked Autoencoders that Listen | Image-based MAE for audio spectrograms | 97.4% | huang2022 | π |
| HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection | Transformer model with hierarchical structure and token-semantic modules | 97.00% | chen2022 | π |
| PaSST: Efficient Training of Audio Transformers with Patchout | drops out some of the input patches during training of AST | 96.8% | koutini22 | π |
| CLAP: Learning Audio Concepts From Natural Language Supervision | CNN model pretrained by natural language supervision | 96.70% | elizalde2022 | π |
| AST: Audio Spectrogram Transformer | Pure Attention Model Pretrained on AudioSet | 95.70% | gong2021 | π |
| Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer | A Transformer model pretrained w/ visual image supervision | 95.70% | zhao2022 | π |
| A Sequential Self Teaching Approach for Improving Generalization in Sound Event Recognition | Multi-stage sequential learning with knowledge transfer from Audioset | 94.10% | kumar2020 | |
| Efficient End-to-End Audio Embeddings Generation for Audio Classification on Target Applications | CNN model pretrained on AudioSet | 92.32% | lopez-meyer2021 | |
| Urban Sound Tagging using Multi-Channel Audio Feature with Convolutional Neural Networks | Pretrained model with multi-channel features | 89.50% | kim2020 | π |
| An Ensemble of Convolutional Neural Networks for Audio Classification | CNN ensemble with data augmentation | 88.65% | nanni2020 | π |
| Environmental Sound Classification on the Edge: A Pipeline for Deep Acoustic Networks on Extremely Resource-Constrained Devices | CNN model (ACDNet) with potential compression | 87.1% | mohaimenuzzaman2021 | π |
| Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification | CNN with filterbanks learned using convolutional RBM + fusion with GTSC and mel energies | 86.50% | sailor2017 | |
| Wav2CLIP: Learning Robust Audio Representations From CLIP | Distilling from CLIP | 85.95% | wu2021 | π |
| AclNet: efficient end-to-end audio classification CNN | CNN with mixup and data augmentation | 85.65% | huang2018 | |
| On Open-Set Classification with L3-Net Embeddings for Machine Listening Applications | x-vector network with openll3 embeddings | 85.00% | wilkinghoff2020 | |
| Learning from Between-class Examples for Deep Sound Recognition | EnvNet-v2 (tokozume2017a) + data augmentation + Between-Class learning | 84.90% | tokozume2017b | |
| Novel Phase Encoded Mel Filterbank Energies for Environmental Sound Classification | CNN working with phase encoded mel filterbank energies (PEFBEs), fusion with Mel energies | 84.15% | tak2017 | |
| Knowledge Transfer from Weakly Labeled Audio using Convolutional Neural Network for Sound Events and Scenes | CNN pretrained on AudioSet | 83.50% | kumar2017 | π |
| Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification | CNN with filterbanks learned using convolutional RBM + fusion with GTSC | 83.00% | sailor2017 | |
| Deep Multimodal Clustering for Unsupervised Audiovisual Learning | CNN + unsupervised audio-visual learning | 82.60% | hu2019 | |
| Novel TEO-based Gammatone Features for Environmental Sound Classification | Fusion of GTSC & TEO-GTSC with CNN | 81.95% | agrawal2017 | |
| Learning from Between-class Examples for Deep Sound Recognition | EnvNet-v2 (tokozume2017a) + Between-Class learning | 81.80% | tokozume2017b | |
| π§ Human accuracy | Crowdsourcing experiment in classifying ESC-50 by human listeners | 81.30% | piczak2015a | π |
| Objects that Sound | Look, Listen and Learn (L3) network (arandjelovic2017a) with stride 2, larger batches and learning rate schedule | 79.80% | arandjelovic2017b | |
| Look, Listen and Learn | 8-layer convolutional subnetwork pretrained on an audio-visual correspondence task | 79.30% | arandjelovic2017a | |
| Learning Environmental Sounds with Multi-scale Convolutional Neural Network | Multi-scale convolutions with feature fusion (waveform + spectrogram) | 79.10% | zhu2018 | |
| Novel TEO-based Gammatone Features for Environmental Sound Classification | GTSC with CNN | 79.10% | agrawal2017 | |
| Learning from Between-class Examples for Deep Sound Recognition | EnvNet-v2 (tokozume2017a) + data augmentation | 78.80% | tokozume2017b | |
| Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification | CNN with filterbanks learned using convolutional RBM | 78.45% | sailor2017 | |
| Learning from Between-class Examples for Deep Sound Recognition | Baseline CNN (piczak2015b) + Batch Normalization + Between-Class learning | 76.90% | tokozume2017b | |
| Novel TEO-based Gammatone Features for Environmental Sound Classification | TEO-GTSC with CNN | 74.85% | agrawal2017 | |
| Learning from Between-class Examples for Deep Sound Recognition | EnvNet-v2 (tokozume2017a) | 74.40% | tokozume2017b | |
| Soundnet: Learning sound representations from unlabeled video | 8-layer CNN (raw audio) with transfer learning from unlabeled videos | 74.20% | aytar2016 | π |
| Learning from Between-class Examples for Deep Sound Recognition | 18-layer CNN on raw waveforms (dai2016) + Between-Class learning | 73.30% | tokozume2017b | |
| Novel Phase Encoded Mel Filterbank Energies for Environmental Sound Classification | CNN working with phase encoded mel filterbank energies (PEFBEs) | 73.25% | tak2017 | |
| Classifying environmental sounds using image recognition networks | 16 kHz sampling rate, GoogLeNet on spectrograms (40 ms frame length) | 73.20% | boddapati2017 | π |
| Learning from Between-class Examples for Deep Sound Recognition | Baseline CNN (piczak2015b) + Batch Normalization | 72.40% | tokozume2017b | |
| Novel TEO-based Gammatone Features for Environmental Sound Classification | Fusion of MFCC & TEO-GTCC with GMM | 72.25% | agrawal2017 | |
| Learning environmental sounds with end-to-end convolutional neural network (EnvNet) | Combination of spectrogram and raw waveform CNN | 71.00% | tokozume2017a | |
| Novel TEO-based Gammatone Features for Environmental Sound Classification | TEO-GTCC with GMM | 68.85% | agrawal2017 | |
| Classifying environmental sounds using image recognition networks | 16 kHz sampling rate, AlexNet on spectrograms (30 ms frame length) | 68.70% | boddapati2017 | π |
| Very Deep Convolutional Neural Networks for Raw Waveforms | 18-layer CNN on raw waveforms | 68.50% | dai2016, tokozume2017b | π |
| Classifying environmental sounds using image recognition networks | 32 kHz sampling rate, GoogLeNet on spectrograms (30 ms frame length) | 67.80% | boddapati2017 | π |
| WSNet: Learning Compact and Efficient Networks with Weight Sampling | SoundNet 8-layer CNN architecture with 100x model compression | 66.25% | jin2017 | |
| Soundnet: Learning sound representations from unlabeled video | 5-layer CNN (raw audio) with transfer learning from unlabeled videos | 66.10% | aytar2016 | π |
| WSNet: Learning Compact and Efficient Networks with Weight Sampling | SoundNet 8-layer CNN architecture with 180x model compression | 65.80% | jin2017 | |
| Soundnet: Learning sound representations from unlabeled video | 5-layer CNN trained on raw audio of ESC-50 only | 65.00% | aytar2016 | π |
| π Environmental Sound Classification with Convolutional Neural Networks - CNN baseline | CNN with 2 convolutional and 2 fully-connected layers, mel-spectrograms as input, vertical filters in the first layer | 64.50% | piczak2015b | π |
| auDeep: Unsupervised Learning of Representations from Audio with Deep Recurrent Neural Networks | MLP classifier on features extracted with an RNN autoencoder | 64.30% | freitag2017 | π |
| Classifying environmental sounds using image recognition networks | 32 kHz sampling rate, AlexNet on spectrograms (30 ms frame length) | 63.20% | boddapati2017 | π |
| Classifying environmental sounds using image recognition networks | CRNN | 60.30% | boddapati2017 | π |
| Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks | 3-layer CNN with vertical filters on wideband mel-STFT (median accuracy) | 56.37% | huzaifah2017 | |
| Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks | 3-layer CNN with square filters on wideband mel-STFT (median accuracy) | 54.00% | huzaifah2017 | |
| Soundnet: Learning sound representations from unlabeled video | 8-layer CNN trained on raw audio of ESC-50 only | 51.10% | aytar2016 | π |
| Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks | 5-layer CNN with square filters on wideband mel-STFT (median accuracy) | 50.87% | huzaifah2017 | |
| Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks | 5-layer CNN with vertical filters on wideband mel-STFT (median accuracy) | 46.25% | huzaifah2017 | |
| π Baseline - random forest | Baseline ML approach (MFCC & ZCR + random forest) | 44.30% | piczak2015a | π |
| Soundnet: Learning sound representations from unlabeled video | Convolutional autoencoder trained on unlabeled videos | 39.90% | aytar2016 | π |
| π Baseline - SVM | Baseline ML approach (MFCC & ZCR + SVM) | 39.60% | piczak2015a | π |
| π Baseline - k-NN | Baseline ML approach (MFCC & ZCR + k-NN) | 32.20% | piczak2015a | π |