Axion is a high-performance LLM serving platform built with Rust that provides OpenAI-compatible APIs for chat completions, embeddings, and reranking. Designed for production environments, Axion delivers exceptional throughput and low latency through advanced optimization techniques.
- Dual Backend System: Automatically uses MAX serve for supported models, falls back to Candle for unsupported models
- OpenAI-Compatible API: Drop-in replacement for OpenAI API endpoints with full compatibility
- Streaming Support: Real-time streaming responses using Server-Sent Events (SSE)
- Request Caching: LRU cache system for faster repeated requests
- Continuous Batching: Efficient request batching for improved throughput
- Multi-Model Support: Extensive support for Llama, Qwen3, Gemma, Mistral, GLM4, Granite, Olmo, and other architectures
- Hardware Acceleration: Automatic GPU detection and utilization
- Memory Efficiency: Optimized memory management and KV-cache reuse
- Concurrent Processing: High-throughput request handling
- Adaptive Batching: Dynamic batch formation based on request patterns
- Endpoint:
POST /v1/chat/completions - Features: Streaming and non-streaming responses, full OpenAI parameter compatibility
- Example:
curl -X POST http://localhost:3000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.2-3B-Instruct", "messages": [ {"role": "user", "content": "Hello!"} ], "temperature": 0.7, "max_tokens": 150, "stream": false }'- Endpoint:
POST /v1/embeddings - Features: High-performance embeddings using fastembed
- Example:
curl -X POST http://localhost:3000/v1/embeddings \ -H "Content-Type: application/json" \ -d '{ "input": "Hello, World!", "model": "BAAI/bge-small-en-v1.5" }'- Endpoint:
POST /v1/rerank - Features: Semantic reranking using fastembed for improved search results
- Example:
curl -X POST http://localhost:3000/v1/rerank \ -H "Content-Type: application/json" \ -d '{ "query": "what is a panda?", "documents": ["A bear species", "A software library", "An animal"], "model": "BAAI/bge-reranker-base", "top_n": 3 }'- Endpoint:
GET /health - Returns: Server status, backend availability, loaded model information, and system metrics
┌─────────────┐ │ Client │ └─────┬───────┘ │ HTTPS ▼ ┌─────────────────────────┐ │ Axion Server │ │ (Axum + Tower HTTP) │ └─────┬───────────────────┘ │ ├──► Cache Layer (LRU) │ ├──► Continuous Batcher │ ▼ ┌─────────────────────────┐ │ Inference Engine │ │ (Smart Routing) │ └─────┬───────────────────┘ │ ├──► MAX Client ──────► max serve (OpenAI API) │ │ │ ▼ │ Model Process │ └──► Candle Backend ──► Native Inference │ (Llama, Qwen, etc.) │ └──► GPU/CPU Execution - Rust: Latest stable version (1.70+)
- Git LFS: For large model files
- MAX CLI (Optional): For MAX backend support
- CUDA (Optional): For GPU acceleration
- Clone the repository:
git clone <repository-url> cd axion git lfs install git lfs pull- Build the project:
cargo build --release- Run with default settings:
# Use default model cargo run --release # Specify a model MODEL_NAME="meta-llama/Llama-3.2-3B-Instruct" cargo run --release # With custom configuration MODEL_NAME="microsoft/Phi-3-mini-4k-instruct" \ SERVER_PORT=8080 \ RUST_LOG=axion=info \ cargo run --releaseMODEL_NAME: Primary model to serve (default:meta-llama/Llama-3.2-3B-Instruct)MAX_SEQ_LEN: Maximum sequence length (default:4096)
SERVER_HOST: Server host address (default:0.0.0.0)SERVER_PORT: Server port (default:3000)MAX_CONNECTIONS: Maximum concurrent connections (default:100)
CACHE_CAPACITY: Number of cached responses (default:1000)BATCH_TIMEOUT_MS: Batching timeout in milliseconds (default:50)MAX_BATCH_SIZE: Maximum batch size (default:8)CONCURRENT_REQUESTS: Maximum concurrent requests (default:10)
RUST_LOG: Logging level (default:axion=info,tower_http=info)
When a model is supported by MAX, Axion automatically:
- Spawns
max serve --model {model_name} - Waits for MAX to become ready
- Routes all requests to MAX's OpenAI-compatible endpoint
- Monitors health and manages process lifecycle
MAX supports:
- Llama models (Llama, Llama2, Llama3, Llama3.1, Llama3.2)
- Mistral models (Mistral, MistralNeMo, Mixtral)
- Qwen models (Qwen, Qwen2, Qwen3)
- Gemma models (Gemma, Gemma2)
- Phi models (Phi, Phi2, Phi3)
- DeepSeek models
- And other HuggingFace transformers
If MAX is unavailable or unsupported, Axion uses Candle:
- Loads model using model-specific implementation
- Performs native inference with Candle framework
- Automatically uses GPU if available
- Applies model-specific optimizations
Candle supports:
- Llama family models
- Qwen3 and quantized variants
- Gemma family models
- Mistral family models
- GLM4 family models
- IBM Granite models
- OLMo models
- And other transformer architectures
- LRU cache with configurable capacity (default: 1000 entries)
- Caches non-streaming chat completions based on model parameters
- Cache key includes model, messages, temperature, and other relevant parameters
- Thread-safe implementation for concurrent access
- Dynamic batch formation with configurable timeout
- Configurable maximum batch size
- Reduces computational overhead for concurrent requests
- Maintains low latency through intelligent batching
- Automatic GPU detection and utilization
- CUDA support for NVIDIA GPUs
- Optimized memory management for GPU inference
- CPU optimization with SIMD instructions
src/ ├── main.rs # Server entry point and HTTP handlers ├── api_types.rs # OpenAI-compatible API type definitions ├── inference_engine.rs # Main inference coordinator and backend routing ├── max_client.rs # MAX serve integration and process management ├── candle_inference.rs # Native Candle backend implementation ├── embedding_service.rs # Embedding generation service ├── rerank_service.rs # Document reranking service ├── cache.rs # LRU cache implementation ├── batching.rs # Continuous batching system ├── embed.rs # Example embedding code ├── rerank.rs # Example reranking code └── models/ # Model-specific Candle implementations ├── llama.rs # Llama architecture implementation ├── qwen3.rs # Qwen3 architecture implementation ├── gemma.rs # Gemma architecture implementation ├── mistral.rs # Mistral architecture implementation ├── glm4.rs # GLM4 architecture implementation ├── granite.rs # Granite architecture implementation ├── olmo.rs # OLMo architecture implementation └── quant_qwen3.rs # Quantized Qwen3 implementation MAX automatically supports new models when MAX adds support. Simply use the model identifier.
To add support for a new transformer architecture:
- Create model implementation in
src/models/{architecture_name}.rs - Add variant to
ModelBackendenum insrc/candle_inference.rs - Implement model loading and generation methods
- Update configuration parsing if needed
# Run all tests cargo test # Run tests with detailed output cargo test -- --nocapture # Format code cargo fmt # Run linter cargo clippy # Run performance tests cargo test --release -- --ignored performanceComplete documentation is available in the Docs/ directory, covering all aspects of the system:
- Architecture overview
- API reference
- Model-specific implementations
- Configuration guides
- Performance optimization
Typical performance characteristics:
- Throughput: 5-50+ requests per second depending on model and configuration
- Latency: 10ms-2s+ depending on request type and model
- Memory Usage: Model-dependent + runtime overhead
- GPU Utilization: 30-90% with proper batch sizing
# Example Dockerfile FROM rust:latest as builder WORKDIR /app COPY . . RUN cargo build --release FROM debian:bullseye-slim RUN apt-get update && apt-get install -y ca-certificates && rm -rf /var/lib/apt/lists/* COPY --from=builder /app/target/release/axion /usr/local/bin/axion EXPOSE 3000 CMD ["axion"]- Supports Kubernetes deployments
- Configurable resource limits
- Health check endpoints for liveness/readiness probes
- Environment variable configuration for different environments
- Strict validation of all API parameters
- Size limits for input and output
- Sanitization of model identifiers
- Protection against injection attacks
- Optional API key authentication
- Rate limiting capabilities
- Network access controls
- Model access restrictions
This project is licensed under the terms specified in the LICENSE file.
We welcome contributions! Please see our contribution guidelines for details:
- Fork the repository
- Create a feature branch for your changes
- Add tests for new functionality
- Update documentation as needed
- Submit a pull request with a clear description
- Follow Rust coding standards and best practices
- Write comprehensive tests for new features
- Document public APIs thoroughly
- Maintain performance and security standards
- Enhanced monitoring and metrics
- Model quantization support
- Advanced caching strategies
- Improved error handling and recovery
- Multi-GPU support
- Model hot-swapping
- Custom backend plugins
- Advanced batching algorithms
- Distributed inference
- Enhanced security features
