Skip to content
View ga642381's full-sized avatar
🎯
Focusing
🎯
Focusing

Sponsoring

@voidful

Highlights

  • Pro

Block or report ga642381

Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

Slap your MacBook, it yells back. Uses Apple Silicon accelerometer via IOKit HID.

Go 3,390 147 Updated Mar 17, 2026

Offical code for the CVPR 2024 Paper: Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language

Jupyter Notebook 87 13 Updated Jun 12, 2024

Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling

Python 9 2 Updated Mar 2, 2026

Official implementation of "The Mind's Transformer" (ICLR 2026).

Shell 7 Updated Mar 1, 2026

A real-time and multilingual speech translation model

Python 223 21 Updated Feb 13, 2026

fd-sds

Python 13 Updated Feb 2, 2026

Qwen3-TTS is an open-source series of TTS models developed by the Qwen team at Alibaba Cloud, supporting stable, expressive, and streaming speech generation, free-form voice design, and vivid voice…

Python 9,886 1,248 Updated Mar 17, 2026

Train transformer language models with reinforcement learning.

Python 17,760 2,582 Updated Mar 23, 2026

Pixio: a capable vision encoder dedicated to dense prediction, simply by pixel reconstruction

Python 362 10 Updated Jan 22, 2026

State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!

Jupyter Notebook 2,206 148 Updated Mar 12, 2026

The repository provides code for running inference with the Meta Segment Anything Audio Model (SAM-Audio), links for downloading the trained model checkpoints, and example notebooks that show how t…

Python 3,403 300 Updated Jan 5, 2026

Code and resources from Seeing is Hearing: Benchmarking Vision Language Models at Interpreting Spectrograms (IJCNLP-AACL, 2025)

2 Updated Oct 26, 2025

[ASRU 2025] Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

Python 44 1 Updated Nov 21, 2025

Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3.5, DeepSeek-R1, GLM-5, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Llava, Phi4, ...)…

Python 13,303 1,292 Updated Mar 23, 2026

Code for the blog "Neural audio codecs: how to get audio into LLMs"

Python 159 4 Updated Oct 20, 2025

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Python 158,305 32,585 Updated Mar 23, 2026
Python 13 1 Updated Mar 18, 2026

Fixes AI pixel art or sprite web uploads

Python 385 29 Updated Mar 20, 2026

A method that directly addresses the modality gap by aligning speech token with the corresponding text transcription during the tokenization stage.

Python 115 13 Updated Sep 3, 2025

The official repo of "WhiStress: Enriching Transcriptions with Sentence Stress Detection" (Interspeech 2025)

Python 37 12 Updated Jul 24, 2025

EMO-SUPERB submission

Python 51 1 Updated Oct 13, 2025

gpt-oss-120b and gpt-oss-20b are two open-weight language models by OpenAI

Python 19,930 2,065 Updated Jan 13, 2026

Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.

Python 1,372 101 Updated Mar 16, 2026

Text-audio foundation model from Boson AI

Python 7,991 615 Updated Jan 18, 2026

Kimi K2 is the large language model series developed by Moonshot AI team

10,541 798 Updated Jan 21, 2026

Code for DeSTA2.5-Audio, general-purpose LALM

Python 130 7 Updated Feb 4, 2026

Foundation Models and Data for Human-Human and Human-AI interactions.

Python 363 28 Updated Dec 13, 2025

SoTA open-source TTS

Python 23,905 3,175 Updated Mar 18, 2026

Collection of works for evaluating (and analyzing) large audio-language models (LALMs)

40 1 Updated Aug 11, 2025
Next