A high-performance inference server for large language models with OpenAI-compatible API endpoints. Now available for both Windows and Linux systems!
- 🪟 Windows: Full support with Visual Studio and MSVC
- 🐧 Linux: Native support with GCC/Clang
- 🎮 GPU Acceleration: NVIDIA CUDA and Vulkan support
- 📦 Easy Installation: Direct binary installation or build from source
- 🚀 Fast Inference: Built with llama.cpp for optimized model inference
- 🔗 OpenAI Compatible: Drop-in replacement for OpenAI API endpoints
- 📡 Streaming Support: Real-time streaming responses for chat completions
- 🎛️ Multi-Model Management: Load and manage multiple models simultaneously
- 📊 Real-time Metrics: Monitor completion performance with TPS, TTFT, and success rates
- ⚙️ Lazy Loading: Defer model loading until first request with
load_immediately=false - 🔧 Configurable: Flexible model loading parameters and inference settings
- 🔒 Authentication: API key and rate limiting support
- 🌐 Cross-Platform: Windows and Linux native builds
- 📚 RAG Retrieval: Native FAISS vector store (default) with optional Qdrant backend
System Requirements:
- Ubuntu 20.04+ or equivalent Linux distribution (CentOS 8+, Fedora 32+, Arch Linux)
- GCC 9+ or Clang 10+
- CMake 3.14+
- Git with submodule support
- At least 4GB RAM (8GB+ recommended for larger models)
- CUDA Toolkit 11.0+ (optional, for NVIDIA GPU acceleration)
- Vulkan SDK (optional, for alternative GPU acceleration)
Ubuntu/Debian:
# Update package list sudo apt update # Install essential build tools sudo apt install -y build-essential cmake git pkg-config # Install required libraries sudo apt install -y libcurl4-openssl-dev libyaml-cpp-dev # Optional: Install PoDoFo dependencies for PDF support sudo apt install -y libfreetype6-dev libjpeg-dev libpng-dev libtiff-dev libxml2-dev libfontconfig1-dev # Optional: Install FAISS dependencies sudo apt install libopenblas-dev liblapack-dev # Optional: Install CUDA for GPU support # Follow NVIDIA's official installation guide for your distributionCentOS/RHEL/Fedora:
# For CentOS/RHEL 8+ sudo dnf groupinstall "Development Tools" sudo dnf install cmake git curl-devel yaml-cpp-devel # For Fedora sudo dnf install gcc-c++ cmake git libcurl-devel yaml-cpp-devel # Optional: Install PoDoFo dependencies for PDF support (Fedora) sudo dnf install freetype-devel libjpeg-devel libpng-devel libtiff-devel libxml2-devel fontconfig-devel # Optional: Install FAISS dependencies sudo dnf install openblas-devel lapack-develArch Linux:
sudo pacman -S base-devel cmake git curl yaml-cpp # Optional: Install PoDoFo dependencies for PDF support sudo pacman -S freetype2 libjpeg-turbo libpng libtiff libxml2 fontconfig # Optional: Install FAISS dependencies sudo pacman -S openblas lapack# Note: Package manager installation will be available in future releases # For now, use the build from source method below1. Clone the Repository with the Submodules: FAISS is bundled as a submodule in external/faiss. If you don't add --recursive, FAISS will be disabled (the build creates a stub). Re-run the submodule command then reconfigure.
git clone https://github.com/kolosalai/kolosal-server.git --recursive cd kolosal-server2. Create Build Directory:
mkdir build && cd build3. Configure Build:
Standard Build (CPU-only):
cmake -DCMAKE_BUILD_TYPE=Release ..With CUDA Support:
cmake -DCMAKE_BUILD_TYPE=Release -DLLAMA_CUDA=ON ..With Vulkan Support:
cmake -DCMAKE_BUILD_TYPE=Release -DLLAMA_VULKAN=ON ..With PoDoFo PDF Support (requires dependencies installed):
cmake -DCMAKE_BUILD_TYPE=Release -DUSE_PODOFO=ON ..With FAISS Support (requires dependencies installed):
cmake -DCMAKE_BUILD_TYPE=Release -DUSE_FAISS=ON ..Combined Options:
# CUDA + PoDoFo cmake -DCMAKE_BUILD_TYPE=Release -DLLAMA_CUDA=ON -DUSE_PODOFO=ON .. # Vulkan + PoDoFo cmake -DCMAKE_BUILD_TYPE=Release -DLLAMA_VULKAN=ON -DUSE_PODOFO=ON ..Debug Build:
cmake -DCMAKE_BUILD_TYPE=Debug ..4. Build the Project:
# Use all available CPU cores make -j # Or specify number of cores manually make -j45. Verify Build:
# Check if the executable was created cd Release && ls -la kolosal-server # Test basic functionality ./kolosal-server --help6. Install to System Path (Optional):
# Install binary to /usr/local/bin sudo cp build/Release/kolosal-server /usr/local/bin/ # Make it executable sudo chmod +x /usr/local/bin/kolosal-server # Now you can run from anywhere kolosal-server --helpStart the Server:
# From build/Release directory ./kolosal-server # Check where the config file is ./kolosal-server --config # Or specify a config file ./kolosal-server --config ../config.yamlBackground Service:
# Run in background nohup ./kolosal-server > server.log 2>&1 & # Check if running ps aux | grep kolosal-serverCheck Server Status:
# Test if server is responding curl http://localhost:8080/v1/healthCreate Service File:
sudo tee /etc/systemd/system/kolosal-server.service > /dev/null << EOF [Unit] Description=Kolosal Server - LLM Inference Server After=network.target [Service] Type=simple User=kolosal Group=kolosal WorkingDirectory=/opt/kolosal-server ExecStart=/opt/kolosal-server/kolosal-server --config /etc/kolosal-server/config.yaml Restart=always RestartSec=5 StandardOutput=journal StandardError=journal [Install] WantedBy=multi-user.target EOFEnable and Start Service:
# Create user for service sudo useradd -r -s /bin/false kolosal # Install binary and config sudo mkdir -p /opt/kolosal-server /etc/kolosal-server sudo cp build/kolosal-server /opt/kolosal-server/ sudo cp config.example.yaml /etc/kolosal-server/config.yaml sudo chown -R kolosal:kolosal /opt/kolosal-server # Enable and start service sudo systemctl daemon-reload sudo systemctl enable kolosal-server sudo systemctl start kolosal-server # Check status sudo systemctl status kolosal-serverCommon Build Issues:
-
Missing dependencies:
# Check for missing packages ldd build/kolosal-server # Install missing development packages sudo apt install -y libssl-dev libcurl4-openssl-dev
-
CMake version too old:
# Install newer CMake from Kitware APT repository wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - | sudo tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null sudo apt-add-repository 'deb https://apt.kitware.com/ubuntu/ focal main' sudo apt update && sudo apt install cmake
-
CUDA compilation errors:
# Verify CUDA installation nvcc --version nvidia-smi # Set CUDA environment variables if needed export CUDA_HOME=/usr/local/cuda export PATH=$CUDA_HOME/bin:$PATH export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
-
Permission issues:
# Fix ownership sudo chown -R $USER:$USER ./build # Make executable chmod +x build/kolosal-server
Performance Optimization:
-
CPU Optimization:
# Build with native optimizations cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-march=native" ..
-
Memory Settings:
# For systems with limited RAM, reduce parallel jobs make -j2 # Set memory limits in config echo "server.max_memory_mb: 4096" >> config.yaml
-
GPU Memory:
# Monitor GPU usage watch nvidia-smi # Adjust GPU layers in model config # Reduce n_gpu_layers if running out of VRAM
- macOS 10.15 (Catalina) or later
- Xcode Command Line Tools or Xcode
- CMake 3.14+
- Homebrew (recommended for dependency management)
# Install Homebrew if not already installed /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" # Install build tools and required libraries brew install cmake git curl yaml-cpp # Optional: Install PoDoFo dependencies for PDF support brew install freetype jpeg libpng libtiff libxml2Building:
git clone https://github.com/kolosalai/kolosal-server.git cd kolosal-server git submodule update --init --recursive mkdir build && cd build # Standard build cmake -DCMAKE_BUILD_TYPE=Release .. # With PoDoFo PDF support (if dependencies are installed) # cmake -DCMAKE_BUILD_TYPE=Release -DUSE_PODOFO=ON .. # With Metal acceleration (automatically enabled on Apple Silicon) # Metal support is automatically enabled on macOS make -j$(sysctl -n hw.ncpu)Running the Server:
./kolosal-server- Windows 10/11
- Visual Studio 2019 or later
- CMake 3.20+
- VCPKG
- CUDA Toolkit (optional, for GPU acceleration)
- Run
git clone https://github.com/kolosalai/kolosal-server.git --recursive - Run
cd kolosal-server - Make a
vcpkg.jsonfile at the root of the project{ "name": "kolosal-server", "version-string": "1.0.0", "dependencies": [ "curl", "fontconfig", "freetype", "libjpeg-turbo", "libpng", "openssl", "libxml2", "tiff", "zlib", "openblas", "lapack-reference" ] }``` - Run
mkdir build && cd build - Run
cmake -S . -B build -G "Visual Studio 17 2022" -A x64 ^ -DCMAKE_TOOLCHAIN_FILE="$env:VCPKG_ROOT\scripts\buildsystems\vcpkg.cmake" ^ -DVCPKG_TARGET_TRIPLET=x64-windows ^ -DCMAKE_MSVC_RUNTIME_LIBRARY=MultiThreadedDLL ^ -DCMAKE_POLICY_DEFAULT_CMP0091=NEW - Run
cd .. && cmake --build build --config Release --target kolosal_server_exe
./Release/kolosal-server.exeThe server will start on http://localhost:8080 by default.
Kolosal Server supports configuration through JSON and YAML files for advanced setup including authentication, logging, model preloading, and server parameters.
The retrieval endpoints (/add_documents, /retrieve, /remove_documents, /list_documents, /info_documents) use a pluggable vector store:
- FAISS (default, in-process, zero external dependencies)
- Qdrant (optional external service)
If database.vector_database is omitted, FAISS is selected automatically.
database: vector_database: faiss # or qdrant faiss: index_type: Flat index_path: ./data/faiss_index dimensions: 1536 normalize_vectors: true metric_type: IP # IP + normalization approximates cosine qdrant: enabled: true host: localhost port: 6333 collection_name: documents default_embedding_model: text-embedding-3-smallFAISS build notes:
- Controlled by CMake option
USE_FAISS(ON by default) - GPU acceleration toggles automatically if CUDA is found and
USE_CUDAis enabled - Disable with
-DUSE_FAISS=OFF
Example build enabling CUDA + FAISS:
cmake -DCMAKE_BUILD_TYPE=Release -DLLAMA_CUDA=ON -DUSE_FAISS=ON ..server: port: "8080" models: - id: "my-model" path: "./models/model.gguf" load_immediately: trueserver: port: "8080" max_connections: 500 worker_threads: 8 auth: enabled: true require_api_key: true api_keys: - "sk-your-api-key-here" models: - id: "gpt-3.5-turbo" path: "./models/gpt-3.5-turbo.gguf" load_immediately: true main_gpu_id: 0 load_params: n_ctx: 4096 n_gpu_layers: 50 features: metrics: true # Enable /metrics and /completion-metricsFor complete configuration documentation including all parameters, authentication setup, CORS configuration, and more examples, see the Configuration Guide.
Before using chat completions, you need to add a model engine:
curl -X POST http://localhost:8080/engines \ -H "Content-Type: application/json" \ -d '{ "engine_id": "my-model", "model_path": "path/to/your/model.gguf", "load_immediately": true, "n_ctx": 2048, "n_gpu_layers": 0, "main_gpu_id": 0 }'For faster startup times, you can defer model loading until first use:
curl -X POST http://localhost:8080/engines \ -H "Content-Type: application/json" \ -d '{ "engine_id": "my-model", "model_path": "https://huggingface.co/model-repo/model.gguf", "load_immediately": false, "n_ctx": 4096, "n_gpu_layers": 30, "main_gpu_id": 0 }'curl -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "my-model", "messages": [ { "role": "user", "content": "Hello, how are you today?" } ], "stream": false, "temperature": 0.7, "max_tokens": 100 }'Response:
{ "choices": [ { "finish_reason": "stop", "index": 0, "message": { "content": "Hello! I'm doing well, thank you for asking. How can I help you today?", "role": "assistant" } } ], "created": 1749981228, "id": "chatcmpl-80HTkM01z7aaaThFbuALkbTu", "model": "my-model", "object": "chat.completion", "system_fingerprint": "fp_4d29efe704", "usage": { "completion_tokens": 15, "prompt_tokens": 9, "total_tokens": 24 } }curl -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Accept: text/event-stream" \ -d '{ "model": "my-model", "messages": [ { "role": "user", "content": "Tell me a short story about a robot." } ], "stream": true, "temperature": 0.8, "max_tokens": 150 }'Response (Server-Sent Events):
data: {"choices":[{"delta":{"content":"","role":"assistant"},"finish_reason":null,"index":0}],"created":1749981242,"id":"chatcmpl-1749981241-1","model":"my-model","object":"chat.completion.chunk","system_fingerprint":"fp_4d29efe704"} data: {"choices":[{"delta":{"content":"Once"},"finish_reason":null,"index":0}],"created":1749981242,"id":"chatcmpl-1749981241-1","model":"my-model","object":"chat.completion.chunk","system_fingerprint":"fp_4d29efe704"} data: {"choices":[{"delta":{"content":" upon"},"finish_reason":null,"index":0}],"created":1749981242,"id":"chatcmpl-1749981241-1","model":"my-model","object":"chat.completion.chunk","system_fingerprint":"fp_4d29efe704"} data: {"choices":[{"delta":{"content":""},"finish_reason":"stop","index":0}],"created":1749981242,"id":"chatcmpl-1749981241-1","model":"my-model","object":"chat.completion.chunk","system_fingerprint":"fp_4d29efe704"} data: [DONE] curl -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "my-model", "messages": [ { "role": "system", "content": "You are a helpful programming assistant." }, { "role": "user", "content": "How do I create a simple HTTP server in Python?" }, { "role": "assistant", "content": "You can create a simple HTTP server in Python using the built-in http.server module..." }, { "role": "user", "content": "Can you show me the code?" } ], "stream": false, "temperature": 0.7, "max_tokens": 200 }'curl -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "my-model", "messages": [ { "role": "user", "content": "What is the capital of France?" } ], "stream": false, "temperature": 0.1, "top_p": 0.9, "max_tokens": 50, "seed": 42, "presence_penalty": 0.0, "frequency_penalty": 0.0 }'curl -X POST http://localhost:8080/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "my-model", "prompt": "The future of artificial intelligence is", "stream": false, "temperature": 0.7, "max_tokens": 100 }'Response:
{ "choices": [ { "finish_reason": "stop", "index": 0, "text": " bright and full of possibilities. As we continue to advance in machine learning and deep learning technologies, we can expect to see significant improvements in various fields..." } ], "created": 1749981288, "id": "cmpl-80HTkM01z7aaaThFbuALkbTu", "model": "my-model", "object": "text_completion", "usage": { "completion_tokens": 25, "prompt_tokens": 8, "total_tokens": 33 } }curl -X POST http://localhost:8080/v1/completions \ -H "Content-Type: application/json" \ -H "Accept: text/event-stream" \ -d '{ "model": "my-model", "prompt": "Write a haiku about programming:", "stream": true, "temperature": 0.8, "max_tokens": 50 }'Response (Server-Sent Events):
data: {"choices":[{"finish_reason":"","index":0,"text":""}],"created":1749981290,"id":"cmpl-1749981289-1","model":"my-model","object":"text_completion"} data: {"choices":[{"finish_reason":"","index":0,"text":"Code"}],"created":1749981290,"id":"cmpl-1749981289-1","model":"my-model","object":"text_completion"} data: {"choices":[{"finish_reason":"","index":0,"text":" flows"}],"created":1749981290,"id":"cmpl-1749981289-1","model":"my-model","object":"text_completion"} data: {"choices":[{"finish_reason":"stop","index":0,"text":""}],"created":1749981290,"id":"cmpl-1749981289-1","model":"my-model","object":"text_completion"} data: [DONE] curl -X POST http://localhost:8080/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "my-model", "prompt": [ "The weather today is", "In other news," ], "stream": false, "temperature": 0.5, "max_tokens": 30 }'curl -X POST http://localhost:8080/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "my-model", "prompt": "Explain quantum computing:", "stream": false, "temperature": 0.2, "top_p": 0.9, "max_tokens": 100, "seed": 123, "presence_penalty": 0.0, "frequency_penalty": 0.1 }'curl -X GET http://localhost:8080/v1/enginescurl -X GET http://localhost:8080/engines/my-model/statuscurl -X DELETE http://localhost:8080/engines/my-modelThe server provides real-time completion metrics for monitoring performance and usage:
curl -X GET http://localhost:8080/completion-metricsResponse:
{ "completion_metrics": { "summary": { "total_requests": 15, "completed_requests": 14, "failed_requests": 1, "success_rate_percent": 93.33, "total_input_tokens": 120, "total_output_tokens": 350, "avg_turnaround_time_ms": 1250.5, "avg_tps": 12.8, "avg_output_tps": 8.4, "avg_ttft_ms": 245.2, "avg_rps": 0.85 }, "per_engine": [ { "model_name": "my-model", "engine_id": "default", "total_requests": 15, "completed_requests": 14, "failed_requests": 1, "total_input_tokens": 120, "total_output_tokens": 350, "tps": 12.8, "output_tps": 8.4, "avg_ttft": 245.2, "rps": 0.85, "last_updated": "2025-06-16T17:04:12.123Z" } ], "timestamp": "2025-06-16T17:04:12.123Z" } }Alternative endpoints:
# OpenAI-style endpoint curl -X GET http://localhost:8080/v1/completion-metrics # Alternative path curl -X GET http://localhost:8080/completion/metrics| Metric | Description |
|---|---|
total_requests | Total number of completion requests received |
completed_requests | Number of successfully completed requests |
failed_requests | Number of requests that failed |
success_rate_percent | Success rate as a percentage |
total_input_tokens | Total input tokens processed |
total_output_tokens | Total output tokens generated |
avg_turnaround_time_ms | Average time from request to completion (ms) |
avg_tps | Average tokens per second (input + output) |
avg_output_tps | Average output tokens per second |
avg_ttft_ms | Average time to first token (ms) |
avg_rps | Average requests per second |
# Get completion metrics $metrics = Invoke-RestMethod -Uri "http://localhost:8080/completion-metrics" -Method GET Write-Output "Success Rate: $($metrics.completion_metrics.summary.success_rate_percent)%" Write-Output "Average TPS: $($metrics.completion_metrics.summary.avg_tps)"curl -X GET http://localhost:8080/v1/healthFor Windows users, here are PowerShell equivalents:
$body = @{ engine_id = "my-model" model_path = "C:\path\to\model.gguf" load_immediately = $true n_ctx = 2048 n_gpu_layers = 0 } | ConvertTo-Json Invoke-RestMethod -Uri "http://localhost:8080/engines" -Method POST -Body $body -ContentType "application/json"$body = @{ model = "my-model" messages = @( @{ role = "user" content = "Hello, how are you?" } ) stream = $false temperature = 0.7 max_tokens = 100 } | ConvertTo-Json -Depth 3 Invoke-RestMethod -Uri "http://localhost:8080/v1/chat/completions" -Method POST -Body $body -ContentType "application/json"$body = @{ model = "my-model" prompt = "The future of AI is" stream = $false temperature = 0.7 max_tokens = 50 } | ConvertTo-Json Invoke-RestMethod -Uri "http://localhost:8080/v1/completions" -Method POST -Body $body -ContentType "application/json"| Parameter | Type | Default | Description |
|---|---|---|---|
model | string | required | The ID of the model to use |
messages | array | required | List of message objects |
stream | boolean | false | Whether to stream responses |
temperature | number | 1.0 | Sampling temperature (0.0-2.0) |
top_p | number | 1.0 | Nucleus sampling parameter |
max_tokens | integer | 128 | Maximum tokens to generate |
seed | integer | random | Random seed for reproducible outputs |
presence_penalty | number | 0.0 | Presence penalty (-2.0 to 2.0) |
frequency_penalty | number | 0.0 | Frequency penalty (-2.0 to 2.0) |
| Parameter | Type | Default | Description |
|---|---|---|---|
model | string | required | The ID of the model to use |
prompt | string/array | required | Text prompt or array of prompts |
stream | boolean | false | Whether to stream responses |
temperature | number | 1.0 | Sampling temperature (0.0-2.0) |
top_p | number | 1.0 | Nucleus sampling parameter |
max_tokens | integer | 16 | Maximum tokens to generate |
seed | integer | random | Random seed for reproducible outputs |
presence_penalty | number | 0.0 | Presence penalty (-2.0 to 2.0) |
frequency_penalty | number | 0.0 | Frequency penalty (-2.0 to 2.0) |
| Field | Type | Description |
|---|---|---|
role | string | Role: "system", "user", or "assistant" |
content | string | The content of the message |
| Parameter | Type | Default | Description |
|---|---|---|---|
engine_id | string | required | Unique identifier for the engine |
model_path | string | required | Path to the GGUF model file or URL |
load_immediately | boolean | true | Whether to load the model immediately or defer until first use |
n_ctx | integer | 4096 | Context window size |
n_gpu_layers | integer | 100 | Number of layers to offload to GPU |
main_gpu_id | integer | 0 | Primary GPU device ID |
The server returns standard HTTP status codes and JSON error responses:
{ "error": { "message": "Model 'non-existent-model' not found or could not be loaded", "type": "invalid_request_error", "param": null, "code": null } }Common error codes:
400- Bad Request (invalid JSON, missing parameters)404- Not Found (model/engine not found)500- Internal Server Error (inference failures)
For developers looking to contribute to or extend Kolosal Server, comprehensive documentation is available in the docs/ directory:
- Developer Guide - Complete setup, architecture, and development workflows
- Configuration Guide - Complete server configuration in JSON and YAML formats
- Architecture Overview - Detailed system design and component relationships
- Adding New Routes - Step-by-step guide for implementing API endpoints
- Adding New Models - Guide for creating data models and JSON handling
- API Specification - Complete API reference with examples
- Documentation Index - Complete documentation overview
- Project Structure - Understanding the codebase
- Contributing Guidelines - How to contribute
Kolosal Server is built on top of excellent open-source projects and we want to acknowledge their contributions:
This project is powered by llama.cpp, developed by Georgi Gerganov and the ggml-org community. llama.cpp provides the high-performance inference engine that makes Kolosal Server possible.
- Project: https://github.com/ggml-org/llama.cpp
- License: MIT License
- Description: Inference of Meta's LLaMA model (and others) in pure C/C++
We extend our gratitude to the llama.cpp team for their incredible work on optimized LLM inference, which forms the foundation of our server's performance capabilities.
- yaml-cpp: YAML parsing and emitting library
- nlohmann/json: JSON library for Modern C++
- libcurl: Client-side URL transfer library
- prometheus-cpp: Prometheus metrics library for C++
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
We welcome contributions! Please see our Developer Documentation for detailed guides on:
- Getting Started: Developer Guide
- Understanding the System: Architecture Overview
- Adding Features: Route and Model guides
- API Changes: API Specification
- Fork the repository
- Follow the Developer Guide for setup
- Create a feature branch
- Implement your changes following our guides
- Add tests and update documentation
- Submit a Pull Request
- Issues: Report bugs and feature requests on GitHub Issues
- Documentation: Check the docs/ directory for comprehensive guides
- Discussions: Join Kolosal AI Discord for questions and community support