Cactus

A low-latency AI engine for mobile devices & wearables. Main features:

Fast: fastest inference on ARM CPU
Low RAM: zero-copy memory mapping ensures 10x lower RAM use than other engines
Multimodal: one SDK for speech, vision, and language models
Cloud fallback: automatically route requests to cloud models if needed
Energy-efficient: NPU-accelerated prefill

┌─────────────────┐ │ Cactus Engine │ ←── OpenAI-compatible APIs for all major languages └─────────────────┘ Chat, vision, STT, RAG, tool call, cloud handoff │ ┌─────────────────┐ │ Cactus Graph │ ←── Zero-copy computation graph (PyTorch for mobile) └─────────────────┘ Custom models, optimised for RAM & quantisation │ ┌─────────────────┐ │ Cactus Kernels │ ←── ARM SIMD kernels (Apple, Snapdragon, Exynos, etc) └─────────────────┘ Custom attention, KV-cache quant, chunked prefill

Quick Demo (Mac)

Step 1: brew install cactus-compute/cactus/cactus
Step 2: cactus transcribe or cactus run

Cactus Engine

#include "cactus.h" cactus_model_t model = cactus_init( "path/to/weight/folder", "path to txt or dir of txts for auto-rag", false ); const char* messages = R"([  {"role": "system", "content": "You are a helpful assistant."},  {"role": "user", "content": "My name is Henry Ndubuaku"} ])"; const char* options = R"({  "max_tokens": 50,  "stop_sequences": ["<|im_end|>"] })"; char response[4096]; int result = cactus_complete( model, // model handle messages, // JSON chat messages response, // response buffer sizeof(response), // buffer size options, // generation options nullptr, // tools JSON nullptr, // streaming callback nullptr // user data );

Example response from Gemma3-270m

{ "success": true, // generation succeeded "error": null, // error details if failed "cloud_handoff": false, // true if cloud model used "response": "Hi there!", "function_calls": [], // parsed tool calls "confidence": 0.8193, // model confidence "time_to_first_token_ms": 45.23, "total_time_ms": 163.67, "prefill_tps": 1621.89, "decode_tps": 168.42, "ram_usage_mb": 245.67, "prefill_tokens": 28, "decode_tokens": 50, "total_tokens": 78 }

Cactus Graph

#include "cactus.h" CactusGraph graph; auto a = graph.input({2, 3}, Precision::FP16); auto b = graph.input({3, 4}, Precision::INT8); auto x1 = graph.matmul(a, b, false); auto x2 = graph.transpose(x1); auto result = graph.matmul(b, x2, true); float a_data[6] = {1.1f, 2.3f, 3.4f, 4.2f, 5.7f, 6.8f}; float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}; graph.set_input(a, a_data, Precision::FP16); graph.set_input(b, b_data, Precision::INT8); graph.execute(); void* output_data = graph.get_output(result); graph.hard_reset();

API & SDK References

Reference	Language	Description
Engine API	C	Chat completion, streaming, tool calling, transcription, embeddings, RAG, vision, VAD, vector index, cloud handoff
Graph API	C++	Tensor operations, matrix multiplication, attention, normalization, activation functions
Python SDK	Python	Mac, Linux
Swift SDK	Swift	iOS, macOS, tvOS, watchOS, Android
Kotlin SDK	Kotlin	Android, iOS (via KMP)
Flutter SDK	Dart	iOS, macOS, Android
Rust SDK	Rust	Mac, Linux
React Native	JavaScript	iOS, Android

Model weights: Pre-converted weights for all supported models at huggingface.co/Cactus-Compute.

Benchmarks (CPU-only, no GPU)

All weights INT4 quantised
LFM: 1k-prefill / 100-decode, values are prefill tps / decode tps
LFM-VL: 256px input, values are latency / decode tps
Parakeet: 20s audio input, values are latency / decode tps
Missing latency = no NPU support yet

Device	LFM 1.2B	LFMVL 1.6B	Parakeet 1.1B	VL RAM Usage
Mac M4 Pro	582/100	0.2s/98	0.1s/900k+	76MB
iPad/Mac M3	350/60	0.3s/69	0.3s/800k+	70MB
iPhone 17 Pro	327/48	0.3s/48	0.3s/300k+	108MB
iPhone 13 Mini	148/34	0.3s/35	0.7s/90k+	1GB
Galaxy S25 Ultra	255/37	-/34	-/250k+	1.5GB
Pixel 6a	70/15	-/15	-/17k+	1GB
Galaxy A17 5G	32/10	-/11	-/40k+	727MB
CMF Phone 2 Pro	-	-	-	-
Raspberry Pi 5	69/11	13.3s/11	4.5s/180k+	869MB

Supported Transcription Model

STT: 20s audio input on Macbook Air M3 chip
Benchmark dataset: internal evals with production users

Model	Params	End2End ms	Latency ms	Decode toks/sec	NPU	RTF	WER
UsefulSensors/moonshine-base	61M	361	182	262	yes	0.0180	0.1395
openai/whisper-tiny	39M	232	137	581	yes	0.0116	0.1860
openai/whisper-base	74M	329	178	358	yes	0.0164	0.1628
openai/whisper-small	244M	856	332	108	yes	0.0428	0.0930
openai/whisper-medium	769M	2085	923	49	yes	0.1041	0.0930
nvidia/parakeet-ctc-0.6b	600M	201	201	5214285	yes	0.0101	0.0930
nvidia/parakeet-tdt-0.6b-v3	600M	718	718	3583333	yes	0.0359	0.0465
nvidia/parakeet-ctc-1.1b	1.1B	279	278	4562500	yes	0.0139	0.1628
snakers4/silero-vad	-	-	-	-	-	-	-

Supported LLMs

Gemma weights are often gated on HuggingFace, needs tokens
Run huggingface-cli login and input your huggingface token

Model	Features
google/gemma-3-270m-it	completion
google/functiongemma-270m-it	tools
google/gemma-3-1b-it	completion, gated
google/gemma-3n-E2B-it	completion, tools
google/gemma-3n-E4B-it	completion, tools
Qwen/Qwen3-0.6B	completion, tools, embed
Qwen/Qwen3-Embedding-0.6B	embed
Qwen/Qwen3.5-0.8B	vision, completion, tools, embed
Qwen/Qwen3-1.7B	completion, tools, embed
Qwen/Qwen3.5-2B	vision, completion, tools, embed
LiquidAI/LFM2-350M	completion, tools, embed
LiquidAI/LFM2-700M	completion, tools, embed
LiquidAI/LFM2-8B-A1B	completion, tools, embed
LiquidAI/LFM2.5-1.2B-Thinking	completion, tools, embed
LiquidAI/LFM2.5-1.2B-Instruct	completion, tools, embed
LiquidAI/LFM2-2.6B	completion, tools, embed
LiquidAI/LFM2-VL-450M	vision, txt & img embed, Apple NPU
LiquidAI/LFM2.5-VL-1.6B	vision, txt & img embed, Apple NPU
tencent/Youtu-LLM-2B	completion, tools, embed
nomic-ai/nomic-embed-text-v2-moe	embed

Roadmap

Date	Status	Milestone
Sep 2025	Done	Released v1
Oct 2025	Done	Chunked prefill, KVCache Quant (2x prefill)
Nov 2025	Done	Cactus Attention (10 & 1k prefill = same decode)
Dec 2025	Done	Team grows to +6 Research Engineers
Jan 2026	Done	Apple NPU/RAM, 5-11x faster iOS/Mac
Feb 2026	Done	Hybrid inference, INT4, lossless Quant (1.5x)
Mar 2026	Coming	Qualcomm/Google NPUs, 5-11x faster Android
Apr 2026	Coming	Mediatek/Exynos NPUs, Cactus@ICLR
May 2026	Coming	Wearables & custom chips optimisations
Jun 2026	Coming	Torch/JAX model transpilers

Using this repo

┌──────────────────────────────────────────────────────────────────────────────┐ │ │ │ Step 0: if on Linux (Ubuntu/Debian) │ │ sudo apt-get install python3 python3-venv python3-pip cmake │ │ build-essential libcurl4-openssl-dev │ │ │ │ Step 1: clone and setup │ │ git clone https://github.com/cactus-compute/cactus && cd cactus │ │ source ./setup │ │ │ │ Step 2: use the commands │ │──────────────────────────────────────────────────────────────────────────────│ │ │ │ cactus auth manage Cloud API key │ │ --status show key status │ │ --clear remove saved key │ │ │ │ cactus run <model> opens playground (auto downloads) │ │ --precision INT4|INT8|FP16 quantization (default: INT4) │ │ --token <token> HF token (gated models) │ │ --reconvert force reconversion from source │ │ │ │ cactus transcribe [model] live mic transcription (parakeet-1.1b) │ │ --file <audio.wav> transcribe file instead of mic │ │ --precision INT4|INT8|FP16 quantization (default: INT4) │ │ --token <token> HF token (gated models) │ │ --reconvert force reconversion from source │ │ │ │ cactus download <model> downloads model to ./weights │ │ --precision INT4|INT8|FP16 quantization (default: INT4) │ │ --token <token> HuggingFace API token │ │ --reconvert force reconversion from source │ │ │ │ cactus convert <model> [dir] convert model, supports LoRA merge │ │ --precision INT4|INT8|FP16 quantization (default: INT4) │ │ --lora <path> LoRA adapter to merge │ │ --token <token> HuggingFace API token │ │ │ │ cactus build build for ARM → build/libcactus.a │ │ --apple Apple (iOS/macOS) │ │ --android Android │ │ --flutter Flutter (all platforms) │ │ --python shared lib for Python FFI │ │ │ │ cactus test run unit tests and benchmarks │ │ --model <model> default: LFM2-VL-450M │ │ --transcribe_model <model> default: moonshine-base │ │ --benchmark use larger models │ │ --precision INT4|INT8|FP16 regenerate weights with precision │ │ --reconvert force reconversion from source │ │ --no-rebuild skip building library │ │ --llm / --stt / --performance run specific test suite │ │ --ios run on connected iPhone │ │ --android run on connected Android │ │ │ │ cactus clean remove all build artifacts │ │ cactus --help show all commands and flags │ │ │ └──────────────────────────────────────────────────────────────────────────────┘

Maintaining Organisations

Citation

If you use Cactus in your research, please cite it as follows:

@software{cactus, title = {Cactus: AI Inference Engine for Phones & Wearables}, author = {Ndubuaku, Henry and Cactus Team}, url = {https://github.com/cactus-compute/cactus}, year = {2025} }

N/B: Scroll all the way up and click the shields link for resources!

Name		Name	Last commit message	Last commit date
Latest commit History 668 Commits
.githooks		.githooks
.github		.github
android		android
apple		apple
assets		assets
blog		blog
cactus		cactus
docs		docs
flutter		flutter
libs		libs
python		python
rust		rust
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
CACTUS_VERSION		CACTUS_VERSION
CONTRIBUTING.md		CONTRIBUTING.md
DCO.md		DCO.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
llms.txt		llms.txt
mkdocs.yml		mkdocs.yml
models.json		models.json
setup		setup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cactus

Quick Demo (Mac)

Cactus Engine

Cactus Graph

API & SDK References

Benchmarks (CPU-only, no GPU)

Supported Transcription Model

Supported LLMs

Roadmap

Using this repo

Maintaining Organisations

Citation

About

Uh oh!

Releases 14

Packages

Uh oh!

Contributors 51

Languages

Folders and files

Latest commit

History

Repository files navigation

Cactus

Quick Demo (Mac)

Cactus Engine

Cactus Graph

API & SDK References

Benchmarks (CPU-only, no GPU)

Supported Transcription Model

Supported LLMs

Roadmap

Using this repo

Maintaining Organisations

Citation

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 14

Packages 0

Uh oh!

Contributors 51

Languages

Packages