Cactus Blog

Deep dives into on-device AI, inference optimization, and the engineering behind Cactus.

Sub-150ms Transcription with Cloud-Level Accuracy: Why We Built a Hybrid Engine

How Cactus combines on-device and cloud inference for real-time speech transcription to achieve sub-150ms latency and handle noisy audio.

Roman Shemet

|February 27, 2026|5 min read

ModelsApplications

The Sweet Spot for Mac Code Use: Reviewing LFM2 24B MoE A2B with Cactus

LFM2-24B-A2B features 24B total parameters but only activates 2B during inference. We break down the MoE architecture, GQA, gated convolutions, and show how to run it locally with Cactus.

Noah Cylich & Henry Ndubuaku|February 24, 2026|10 min read