Skip to content

llamastack/llama-stack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3,702 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Llama Stack

PyPI version PyPI - Downloads Docker Hub - Pulls License Discord Unit Tests Integration Tests

Quick Start | Documentation | OpenAI API Compatibility | Discord

Open-source agentic API server for building AI applications. OpenAI-compatible. Any model, any infrastructure.

Llama Stack is a drop-in replacement for the OpenAI API that you can run anywhere — your laptop, your datacenter, or the cloud. Use any OpenAI-compatible client or agentic framework. Swap between Llama, GPT, Gemini, Mistral, or any model without changing your application code.

from openai import OpenAI client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake") response = client.chat.completions.create( model="llama-3.3-70b", messages=[{"role": "user", "content": "Hello"}], )

What you get

  • Chat Completions & Embeddings — standard /v1/chat/completions, /v1/completions, and /v1/embeddings endpoints, compatible with any OpenAI client
  • Responses API — server-side agentic orchestration with tool calling, MCP server integration, and built-in file search (RAG) in a single API call (learn more)
  • Vector Stores & Files/v1/vector_stores and /v1/files for managed document storage and search
  • Batches/v1/batches for offline batch processing
  • Open Responses conformant — the Responses API implementation passes the Open Responses conformance test suite

Use any model, use any infrastructure

Llama Stack has a pluggable provider architecture. Develop locally with Ollama, deploy to production with vLLM, or connect to a managed service — the API stays the same.

┌─────────────────────────────────────────────────────────────────────────┐ │ Llama Stack Server │ │ (same API, same code, any environment) │ │ │ │ /v1/chat/completions /v1/responses /v1/vector_stores /v1/files │ │ /v1/embeddings /v1/batches /v1/models /v1/connectors │ ├───────────────────┬──────────────────┬──────────────────────────────────┤ │ Inference │ Vector stores │ Tools & connectors │ │ Ollama │ FAISS │ MCP servers │ │ vLLM, TGI │ Milvus │ Brave, Tavily (web search) │ │ AWS Bedrock │ Qdrant │ File search (built-in RAG) │ │ Azure OpenAI │ PGVector │ │ │ Fireworks │ ChromaDB │ File storage & processing │ │ Together │ Weaviate │ Local filesystem, S3 │ │ ...15+ more │ Elasticsearch │ PDF, HTML (file processors) │ │ │ SQLite-vec │ │ └───────────────────┴──────────────────┴──────────────────────────────────┘ 

See the provider documentation for the full list.

Get started

Install and run a Llama Stack server:

# One-line install curl -LsSf https://github.com/llamastack/llama-stack/raw/main/scripts/install.sh | bash # Or install via uv uv pip install llama-stack # Start the server (uses the starter distribution with Ollama) llama stack run

Then connect with any OpenAI client — Python, TypeScript, curl, or any framework that speaks the OpenAI API.

See the Quick Start guide for detailed setup.

Resources

Client SDKs:

Language SDK Package
Python llama-stack-client-python PyPI version
TypeScript llama-stack-client-typescript NPM version

Community

We hold regular community calls every Thursday at 09:00 AM PST — see the Community Event on Discord for details.

Star History Chart

Thanks to all our amazing contributors!

Llama Stack contributors