Skip to content
View KuangjuX's full-sized avatar
😔
Depression
😔
Depression

Organizations

@twtstudio @Ko-oK-OS @HMUniversity @TJUCS @NSCSCC-2022-TJU @raspberrypi-embedded @KuangjuX-Archived @HeliosXCore @TiledTensor

Block or report KuangjuX

Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
KuangjuX/README.md

Chengxiang Qi (齐呈祥)

🏠 Homepage📝 Zhihu💻 GitHub

M.Eng. Student in Computer Technology @ University of Chinese Academy of Sciences (UCAS)
B.Eng. in Computer Science @ Tianjin University (Outstanding Thesis Award)


About Me

I am a final-year master's student at the University of Chinese Academy of Sciences, focusing on ML Systems, Deep Learning Compilers, and GPU Programming. Previously, I worked extensively with the Rust programming language on systems-level projects including operating systems and hypervisors.

Research Interests

Machine Learning Systems · Deep Learning Compilers · GPU Kernel Optimization · CUDA Programming · Operating Systems · Virtualization


Experience

WeChat — LLM Infra Team · ML System Intern · June 2025 – Present

  • Implemented Light-DuoAttention using CuTeDSL for efficient long-context inference, integrated and running within SGLang.
  • Explored NVSHMEM & DeepEP; built NVSHMEM-Tutorial with hybrid CUDA IPC / RDMA communication for internal technical sharing.
  • Implemented Ring Attention Forward based on ThunderKittens using the LCF template, outperforming ring-flash-attention on short sequences. Implemented Flash Attention Backward based on LCF.
  • Performed performance analysis on MagiAttention, ZigZag Ring Attention, and ZigZag Flex Attention.
  • Investigated DSL design on NVIDIA Hopper architecture.

Microsoft Research Asia — System & Network Group · Research Intern · Feb 2024 – May 2025

  • Based on the FractalTensor programming model, optimized GEMM, Back-to-Back GEMMs, Stacked/Dilated LSTM, and FlashAttention-2 using CUTLASS. Achieved up to 5.45× speedup over SOTA on NVIDIA A100, with 2.14× average acceleration.
  • As a core designer and developer, built TileFusion: an efficient C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
  • Mentored by Dr. Ying Cao. Co-first authored a paper published at SOSP'24.

Tsinghua University — OS Laboratory · Research Intern · May 2023 – July 2023

  • Wrote an Intel 82599 NIC driver in Rust (referencing DPDK for optimization) and integrated it into ArceOS. Performed network performance benchmarking and optimization.
  • Developed a Type-2 hypervisor based on ArceOS capable of booting Linux; built Hypercraft as a standalone VMM library.

Selected Projects

Project Description Stars
microsoft/TileFusion C++ macro kernel template library for tile processing across GPU memory hierarchy with TensorCore support Stars
microsoft/FractalTensor Programming framework for organizing DNN data as nested statically-shaped tensors with automatic compiler analysis Stars
NVSHMEM-Tutorial Build a DeepEP-like GPU communication buffer with NVSHMEM (hybrid CUDA IPC / RDMA) Stars
xv6-rust Reimplementation of MIT xv6-riscv in Rust; reference implementation for OSCOMP Stars
arceos Experimental modular OS in Rust — contributed hypervisor, ixgbe NIC driver, and network optimization Stars
Hypercraft VMM library in Rust for RISC-V / AArch64 virtualization, capable of booting Linux Stars
hypocaust-2 Hardware-assisted RISC-V hypervisor using H Extension; boots rCore, RT-Thread, and Linux Stars

Publications

  • Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor Siran Liu*, Chengxiang Qi*, Ying Cao, Chao Yang, Weifang Hu, Xuanhua Shi, Fan Yang, Mao Yang ACM SIGOPS 30th Symposium on Operating Systems Principles (SOSP'24) · (*equal contribution) [Paper] [Code]

  • 基于 RISC-V 的 Type-1 Hypervisor 的设计与实现 Chengxiang Qi Bachelor Thesis, Tianjin University · (Outstanding Thesis Award) [Code]


Talks

  • Hypocaust: a RISC-V Type-1 Hypervisor Written in RustOS2ATC 2022, Beijing (March 2023) Presentation on the design and implementation of a RISC-V Type-1 hypervisor, showcasing virtualization techniques and system-level Rust programming.

Tech Stack

Pinned Loading

  1. TiledTensor/TiledCUDA TiledTensor/TiledCUDA Public

    We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstra…

    C++ 193 11

  2. microsoft/TileFusion microsoft/TileFusion Public

    TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.

    Cuda 106 6

  3. Ko-oK-OS/xv6-rust Ko-oK-OS/xv6-rust Public

    🦀️ Reimplement xv6-riscv in Rust!

    Rust 357 36

  4. arceos-org/arceos arceos-org/arceos Public

    An experimental modular OS written in Rust.

    Rust 742 427

  5. hypercraft hypercraft Public

    hypercraft is a VMM library written in Rust.

    Rust 54 17

  6. NVSHMEM-Tutorial NVSHMEM-Tutorial Public

    NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer

    Cuda 172 14