Python bindings for Cactus Engine via FFI. Auto-installed when you run source ./setup.
# Setup environment source ./setup # Build shared library for Python cactus build --python # Download models cactus download LiquidAI/LFM2-VL-450M cactus download openai/whisper-smallfrom cactus import cactus_init, cactus_complete, cactus_destroy import json model = cactus_init("weights/lfm2-vl-450m") messages = [{"role": "user", "content": "What is 2+2?"}] response = json.loads(cactus_complete(model, messages)) print(response["response"]) cactus_destroy(model)Initialize a model and return its handle.
| Parameter | Type | Description |
|---|---|---|
model_path | str | Path to model weights directory |
corpus_dir | str | Optional path to RAG corpus directory for document Q&A |
model = cactus_init("weights/lfm2-vl-450m") rag_model = cactus_init("weights/lfm2-rag", corpus_dir="./documents")Run chat completion. Returns JSON string with response and metrics.
| Parameter | Type | Description |
|---|---|---|
model | handle | Model handle from cactus_init |
messages | list|str | List of message dicts or JSON string |
tools | list | Optional tool definitions for function calling |
temperature | float | Sampling temperature |
top_p | float | Top-p sampling |
top_k | int | Top-k sampling |
max_tokens | int | Maximum tokens to generate |
stop_sequences | list | Stop sequences |
force_tools | bool | Constrain output to tool call format |
tool_rag_top_k | int | Select top-k relevant tools via Tool RAG (default: 2, 0 = use all tools) |
confidence_threshold | float | Minimum confidence for local generation (default: 0.7, triggers cloud_handoff when below) |
callback | fn | Streaming callback fn(token, token_id, user_data) |
# Basic completion messages = [{"role": "user", "content": "Hello!"}] response = cactus_complete(model, messages, max_tokens=100) print(json.loads(response)["response"]) # With tools tools = [{ "name": "get_weather", "description": "Get weather for a location", "parameters": { "type": "object", "properties": {"location": {"type": "string"}}, "required": ["location"] } }] response = cactus_complete(model, messages, tools=tools) # Streaming def on_token(token, token_id, user_data): print(token, end="", flush=True) cactus_complete(model, messages, callback=on_token)Response format (all fields always present):
{ "success": true, "error": null, "cloud_handoff": false, "response": "Hello! How can I help?", "function_calls": [], "confidence": 0.85, "time_to_first_token_ms": 45.2, "total_time_ms": 163.7, "prefill_tps": 619.5, "decode_tps": 168.4, "ram_usage_mb": 245.67, "prefill_tokens": 28, "decode_tokens": 50, "total_tokens": 78 }Cloud handoff response (when model detects low confidence):
{ "success": false, "error": null, "cloud_handoff": true, "response": null, "function_calls": [], "confidence": 0.18, "time_to_first_token_ms": 45.2, "total_time_ms": 45.2, "prefill_tps": 619.5, "decode_tps": 0.0, "ram_usage_mb": 245.67, "prefill_tokens": 28, "decode_tokens": 0, "total_tokens": 28 }When cloud_handoff is True, the model's confidence dropped below confidence_threshold (default: 0.7) and recommends deferring to a cloud-based model for better results. Handle this in your application:
result = json.loads(cactus_complete(model, messages)) if result["cloud_handoff"]: # Defer to cloud API (e.g., OpenAI, Anthropic) response = call_cloud_api(messages) else: response = result["response"]Transcribe audio using a Whisper model. Returns JSON string.
| Parameter | Type | Description |
|---|---|---|
model | handle | Whisper model handle |
audio_path | str | Path to audio file (WAV) |
prompt | str | Whisper prompt for language/task |
whisper = cactus_init("weights/whisper-small") prompt = "<|startoftranscript|><|en|><|transcribe|><|notimestamps|>" response = cactus_transcribe(whisper, "audio.wav", prompt=prompt) print(json.loads(response)["response"]) cactus_destroy(whisper)Get text embeddings. Returns list of floats.
| Parameter | Type | Description |
|---|---|---|
model | handle | Model handle |
text | str | Text to embed |
normalize | bool | L2-normalize embeddings (default: False) |
embedding = cactus_embed(model, "Hello world") print(f"Dimension: {len(embedding)}")Get image embeddings from a VLM. Returns list of floats.
embedding = cactus_image_embed(model, "image.png")Get audio embeddings from a Whisper model. Returns list of floats.
embedding = cactus_audio_embed(whisper, "audio.wav")Reset model state (clear KV cache). Call between unrelated conversations.
cactus_reset(model)Stop an ongoing generation (useful with streaming callbacks).
cactus_stop(model)Free model memory. Always call when done.
cactus_destroy(model)Get the last error message, or None if no error.
error = cactus_get_last_error() if error: print(f"Error: {error}")Set Cactus Pro key for NPU acceleration (Apple devices).
cactus_set_pro_key("your-key") # email founders@cactuscompute.comTokenize text. Returns list of token IDs.
tokens = cactus_tokenize(model, "Hello world") print(tokens) # [1234, 5678, ...]Query RAG corpus for relevant text chunks. Requires model initialized with corpus_dir.
| Parameter | Type | Description |
|---|---|---|
model | handle | Model handle (must have corpus_dir set) |
query | str | Query text |
top_k | int | Number of chunks to retrieve (default: 5) |
model = cactus_init("weights/lfm2-rag", corpus_dir="./documents") chunks = cactus_rag_query(model, "What is machine learning?", top_k=3) for chunk in chunks: print(f"Score: {chunk['score']:.2f} - {chunk['text'][:100]}...")Pass images in the messages for vision-language models:
vlm = cactus_init("weights/lfm2-vl-450m") messages = [{ "role": "user", "content": "Describe this image", "images": ["path/to/image.png"] }] response = cactus_complete(vlm, messages) print(json.loads(response)["response"])See python/example.py for a complete example covering:
- Text completion
- Text/image/audio embeddings
- Vision (VLM)
- Speech transcription
python python/example.py