Skip to content

Sam07Well/cactus

 
 

Repository files navigation

Logo

Cross-platform & energy-efficient kernels, runtime and AI inference engine for mobile devices.

┌─────────────────┐ │ Cactus FFI │ ←── OpenAI compatible C API for integration (tools, RAG, cloud handoff) └─────────────────┘ │ ┌─────────────────┐ │ Cactus Engine │ ←── High-level transformer engine (NPU support, INT4/INT8/FP16/MIXED) └─────────────────┘ │ ┌─────────────────┐ │ Cactus Models │ ←── Implements SOTA models using Cactus Graphs └─────────────────┘ │ ┌─────────────────┐ │ Cactus Graph │ ←── Unified zero-copy computation graph (think NumPy for mobile) └─────────────────┘ │ ┌─────────────────┐ │ Cactus Kernels │ ←── Low-level ARM-specific SIMD operations (think CUDA for mobile) └─────────────────┘ 

Cactus Graph & Kernel

#include cactus.h CactusGraph graph; auto a = graph.input({2, 3}, Precision::FP16); auto b = graph.input({3, 4}, Precision::INT8); auto x1 = graph.matmul(a, b, false); auto x2 = graph.transpose(x1); auto result = graph.matmul(b, x2, true); float a_data[6] = {1.1f, 2.3f, 3.4f, 4.2f, 5.7f, 6.8f}; float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}; graph.set_input(a, a_data, Precision::FP16); graph.set_input(b, b_data, Precision::INT8); graph.execute(); void* output_data = graph.get_output(result); graph.hard_reset(); 

Cactus Engine & FFI

#include cactus.h cactus_set_pro_key(""); // email founders@cactuscompute.com for optional key cactus_model_t model = cactus_init( "path/to/weight/folder", // section to generate weigths below "txt/or/md/file/or/dir/with/many", // nullptr if none, cactus does automatic fast RAG ); const char* messages = R"([  {"role": "system", "content": "You are a helpful assistant."},  {"role": "user", "content": "My name is Henry Ndubuaku"} ])"; const char* options = R"({  "max_tokens": 50,  "stop_sequences": ["<|im_end|>"] })"; char response[4096]; int result = cactus_complete( model, // model handle from cactus_init messages, // JSON array of chat messages response, // buffer to store response JSON sizeof(response), // size of response buffer options, // optional: generation options (nullptr for defaults) nullptr, // optional: tools JSON for function calling  nullptr, // optional: streaming callback fn(token, id, user_data) nullptr // optional: user data passed to callback );

Example response from Gemma3-270m

{ "success": true, // when successfully generated locally "error": null, // returns specific errors if success = false "cloud_handoff": false, // true when model is unconfident, simply route to cloud "response": "Hi there!", // null when error is not null or cloud_handoff = true "function_calls": [], // parsed to [{"name":"set_alarm","arguments":{"hour":"10","minute":"0"}}] "confidence": 0.8193, // how confident the model is with its response "time_to_first_token_ms": 45.23, // latency (time to first token) "total_time_ms": 163.67, // total execution time "prefill_tps": 1621.89, // prefill tokens per second "decode_tps": 168.42, // decode tokens per second "ram_usage_mb": 245.67, // current process RAM usage in MB "prefill_tokens": 28, "decode_tokens": 50, "total_tokens": 78 }

Performance

  • Models: LFM2-VL-450m & Whisper-Small
  • Precision: Cactus smartly blends INT4, INT8 and F16 for all weights.
  • Decode = toks/sec, P/D = prefill/decode, VLM = 256×256 image, STT = 30s audio
  • Cactus Pro: Uses NPU for realtime and large context (Apple for now), scores are marked with *
Device Short Decode 4k-P/D VLM-TTFT VLM-Dec STT-TTFT STT-Dec
Mac M4 Pro 170 989/150 0.2s/0.1s* 168 0.9s/0.2s* 92
Mac M3 Pro 140 890/123 0.3s/0.1s* 149 1.5s/0.4s* 81
iPad/Mac M4 134 603/106 0.3s/0.1s* 129 1.8s0.3s* 70
iPad/Mac M3 117 525/93 0.4s/0.1s* 111 2.8s/0.7s* 61
iPhone 17 Pro 126 428/84 0.5s/0.1s* 120 3.0s/0.6s* 80
iPhone 16 Pro 106 380/81 0.6s/0.2s* 101 4.3s/0.7s* 75
iPhone 15 Pro 90 330/75 0.7s/0.3s* 92 4.5s/0.8s* 70
Galaxy S25 Ultra 80 355/52 0.7s 70 3.6s/- 32
Nothing 3 56 320/46 0.8s 54 4.5s 55
Pixel 6a 25 108/24 2.3s 25 9.6 15
Raspberry Pi 5 20 292/18 1.7s 23 15s 16

Supported models

  • Cactus smartly and compactly blends INT4, INT8 and F16 for all weights.
  • You can still quantize everything with one precision, but mixed is optimal
Model Zipped Size Completion Tools Vision Embed Speech Pro
google/gemma-3-270m-it 252MB
google/functiongemma-270m-it 252MB
openai/whisper-small 283MB Apple
LiquidAI/LFM2-350M 244MB
LiquidAI/LFM2-VL-450M 448MB Apple
nomic-ai/nomic-embed-text-v2-moe 451MB
Qwen/Qwen3-0.6B 514MB
Qwen/Qwen3-Embedding-0.6B 514MB
LiquidAI/LFM2-700M 498MB
google/gemma-3-1b-it 642MB
LiquidAI/LFM2.5-1.2B-Instruct 474MB
LiquidAI/LFM2-1.2B-RAG 474MB
LiquidAI/LFM2-1.2B-Tool 474MB
openai/whisper-medium 658MB Apple
LiquidAI/LFM2.5-VL-1.6B 954MB Apple
Qwen/Qwen3-1.7B 749MB

Using this repo on a Mac

git clone https://github.com/cactus-compute/cactus && cd cactus && source ./setup
  • [model] is a HuggingFace name from the table above (default: google/gemma-3-270m-it)
  • Common flags: --precision INT4|INT8|FP16 (default: INT4), --token <hf_token>
  • Always run source ./setup in any new terminal.
Command Description
cactus run [model] Opens playground (auto downloads model)
cactus download [model] Downloads model to ./weights
cactus convert [model] [dir] Converts model, supports LoRA merging (--lora <path>)
cactus build Builds for ARM (--apple or --android)
cactus test Runs tests (--ios / --android, --model [name/path])
cactus clean Removes build artifacts
cactus --help Shows all commands and flags

Using in your apps

Try demo apps

About

Kernels & AI inference engine for mobile devices.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • C++ 64.0%
  • C 19.8%
  • Python 7.2%
  • Kotlin 2.5%
  • Shell 2.1%
  • Dart 1.8%
  • Other 2.6%