- Cambridge, USA
- kwchang.org
Highlights
- Pro
Lists (2)
Sort Name ascending (A-Z)
Starred repositories
Slap your MacBook, it yells back. Uses Apple Silicon accelerometer via IOKit HID.
Offical code for the CVPR 2024 Paper: Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language
Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling
Official implementation of "The Mind's Transformer" (ICLR 2026).
A real-time and multilingual speech translation model
Qwen3-TTS is an open-source series of TTS models developed by the Qwen team at Alibaba Cloud, supporting stable, expressive, and streaming speech generation, free-form voice design, and vivid voice…
Train transformer language models with reinforcement learning.
Pixio: a capable vision encoder dedicated to dense prediction, simply by pixel reconstruction
State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!
The repository provides code for running inference with the Meta Segment Anything Audio Model (SAM-Audio), links for downloading the trained model checkpoints, and example notebooks that show how t…
Code and resources from Seeing is Hearing: Benchmarking Vision Language Models at Interpreting Spectrograms (IJCNLP-AACL, 2025)
[ASRU 2025] Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3.5, DeepSeek-R1, GLM-5, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Llava, Phi4, ...)…
kyutai-labs / nanoGPTaudio
Forked from karpathy/nanoGPTCode for the blog "Neural audio codecs: how to get audio into LLMs"
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
Fixes AI pixel art or sprite web uploads
A method that directly addresses the modality gap by aligning speech token with the corresponding text transcription during the tokenization stage.
The official repo of "WhiStress: Enriching Transcriptions with Sentence Stress Detection" (Interspeech 2025)
gpt-oss-120b and gpt-oss-20b are two open-weight language models by OpenAI
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.
Text-audio foundation model from Boson AI
Kimi K2 is the large language model series developed by Moonshot AI team
Code for DeSTA2.5-Audio, general-purpose LALM
Foundation Models and Data for Human-Human and Human-AI interactions.
Collection of works for evaluating (and analyzing) large audio-language models (LALMs)




