Sending Requests#

This notebook provides a quick-start guide to use SGLang in chat completions after installation. Once your server is running, API documentation is available at http://localhost:30000/docs (Swagger UI), http://localhost:30000/redoc (ReDoc), or http://localhost:30000/openapi.json (OpenAPI spec, useful for AI agents). Replace 30000 with your port if using a different one.

For Vision Language Models, see OpenAI APIs - Vision.
For Embedding Models, see OpenAI APIs - Embedding and Encode (embedding model).
For Reward Models, see Classify (reward model).

Launch A Server#

[1]:

from sglang.test.doc_patch import launch_server_cmd from sglang.utils import wait_for_server, print_highlight, terminate_process # This is equivalent to running the following command in your terminal # python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 server_process, port = launch_server_cmd(""" python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct \  --host 0.0.0.0 --log-level warning """) wait_for_server(f"http://localhost:{port}", process=server_process) 

 [2026-03-23 03:00:21] INFO utils.py:148: Note: detected 192 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable. [2026-03-23 03:00:21] INFO utils.py:151: Note: NumExpr detected 192 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16. [2026-03-23 03:00:21] INFO utils.py:164: NumExpr defaulting to 16 threads.

 [2026-03-23 03:00:26] INFO utils.py:148: Note: detected 192 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable. [2026-03-23 03:00:26] INFO utils.py:151: Note: NumExpr detected 192 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16. [2026-03-23 03:00:26] INFO utils.py:164: NumExpr defaulting to 16 threads. /actions-runner/_work/sglang/sglang/python/sglang/launch_server.py:51: UserWarning: 'python -m sglang.launch_server' is still supported, but 'sglang serve' is the recommended entrypoint. Example: sglang serve --model-path <model> [options] warnings.warn( [2026-03-23 03:00:28] WARNING model_config.py:1098: Transformers version 5.3.0 is used for model type qwen2. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround. [2026-03-23 03:00:28] INFO server_args.py:2239: Attention backend not specified. Use fa3 backend by default. [2026-03-23 03:00:28] INFO server_args.py:3522: Set soft_watchdog_timeout since in CI /actions-runner/_work/sglang/sglang/python/sglang/srt/entrypoints/http_server.py:175: FastAPIDeprecationWarning: ORJSONResponse is deprecated, FastAPI now serializes data directly to JSON bytes via Pydantic when a return type or response model is set, which is faster and doesn't need a custom response class. Read more in the FastAPI docs: https://fastapi.tiangolo.com/advanced/custom-response/#orjson-or-response-model and https://fastapi.tiangolo.com/tutorial/response-model/ from sglang.srt.utils.json_response import ( [2026-03-23 03:00:29] Transformers version 5.3.0 is used for model type qwen2. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround. [2026-03-23 03:00:33] INFO utils.py:148: Note: detected 192 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable. [2026-03-23 03:00:33] INFO utils.py:151: Note: NumExpr detected 192 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16. [2026-03-23 03:00:33] INFO utils.py:164: NumExpr defaulting to 16 threads. [2026-03-23 03:00:33] INFO utils.py:148: Note: detected 192 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable. [2026-03-23 03:00:33] INFO utils.py:151: Note: NumExpr detected 192 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16. [2026-03-23 03:00:33] INFO utils.py:164: NumExpr defaulting to 16 threads. [2026-03-23 03:00:36] Transformers version 5.3.0 is used for model type qwen2. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround. [2026-03-23 03:00:37] Transformers version 5.3.0 is used for model type qwen2. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround. [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 7.25it/s] Compiling num tokens (num_tokens=4): 100%|██████████| 58/58 [00:03<00:00, 16.49it/s] Capturing num tokens (num_tokens=4 avail_mem=117.21 GB): 100%|██████████| 58/58 [00:02<00:00, 22.89it/s] /usr/local/lib/python3.10/dist-packages/fastapi/routing.py:116: FastAPIDeprecationWarning: ORJSONResponse is deprecated, FastAPI now serializes data directly to JSON bytes via Pydantic when a return type or response model is set, which is faster and doesn't need a custom response class. Read more in the FastAPI docs: https://fastapi.tiangolo.com/advanced/custom-response/#orjson-or-response-model and https://fastapi.tiangolo.com/tutorial/response-model/ response = await f(request)

NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
To reduce the log length, we set the log level to warning for the server, the default log level is info.
We are running those notebooks in a CI environment, so the throughput is not representative of the actual performance.

Using cURL#

[2]:

import subprocess, json curl_command = f""" curl -s http://localhost:{port}/v1/chat/completions \  -H "Content-Type: application/json" \  -d '{{"model": "qwen/qwen2.5-0.5b-instruct", "messages": [{{"role": "user", "content": "What is the capital of France?"}}]}}' """ response = json.loads(subprocess.check_output(curl_command, shell=True)) print_highlight(response) 

{'id': '244ffb6e317340a3842d3f8cb32464a2', 'object': 'chat.completion', 'created': 1774234856, 'model': 'qwen/qwen2.5-0.5b-instruct', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': 'The capital of France is Paris.', 'reasoning_content': None, 'tool_calls': None}, 'logprobs': None, 'finish_reason': 'stop', 'matched_stop': 151645}], 'usage': {'prompt_tokens': 36, 'total_tokens': 44, 'completion_tokens': 8, 'prompt_tokens_details': None, 'reasoning_tokens': 0}, 'metadata': {'weight_version': 'default'}}

Using Python Requests#

[3]:

import requests url = f"http://localhost:{port}/v1/chat/completions" data = { "model": "qwen/qwen2.5-0.5b-instruct", "messages": [{"role": "user", "content": "What is the capital of France?"}], } response = requests.post(url, json=data) print_highlight(response.json()) 

{'id': '9d8dc55078bc4447b071aed65ec32ef0', 'object': 'chat.completion', 'created': 1774234856, 'model': 'qwen/qwen2.5-0.5b-instruct', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': 'The capital of France is Paris.', 'reasoning_content': None, 'tool_calls': None}, 'logprobs': None, 'finish_reason': 'stop', 'matched_stop': 151645}], 'usage': {'prompt_tokens': 36, 'total_tokens': 44, 'completion_tokens': 8, 'prompt_tokens_details': None, 'reasoning_tokens': 0}, 'metadata': {'weight_version': 'default'}}

Using OpenAI Python Client#

[4]:

import openai client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None") response = client.chat.completions.create( model="qwen/qwen2.5-0.5b-instruct", messages=[ {"role": "user", "content": "List 3 countries and their capitals."}, ], temperature=0, max_tokens=64, ) print_highlight(response) 

ChatCompletion(id='aebf68023f234162af33da55d3f1a06f', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Sure, here are three countries and their respective capitals:\n\n1. **United States** - Washington, D.C.\n2. **Canada** - Ottawa\n3. **Australia** - Canberra', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=151645)], created=1774234857, model='qwen/qwen2.5-0.5b-instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=39, prompt_tokens=37, total_tokens=76, completion_tokens_details=None, prompt_tokens_details=None, reasoning_tokens=0), metadata={'weight_version': 'default'})

Streaming#

[5]:

import openai client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None") # Use stream=True for streaming responses response = client.chat.completions.create( model="qwen/qwen2.5-0.5b-instruct", messages=[ {"role": "user", "content": "List 3 countries and their capitals."}, ], temperature=0, max_tokens=64, stream=True, ) # Handle the streaming output for chunk in response: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True) 

 Sure, here are three countries and their respective capitals: 1. **United States** - Washington, D.C. 2. **Canada** - Ottawa 3. **Australia** - Canberra

Using Native Generation APIs#

You can also use the native /generate endpoint with requests, which provides more flexibility. An API reference is available at Sampling Parameters.

[6]:

import requests response = requests.post( f"http://localhost:{port}/generate", json={ "text": "The capital of France is", "sampling_params": { "temperature": 0, "max_new_tokens": 32, }, }, ) print_highlight(response.json()) 

{'text': ' Paris. It is the largest city in Europe and the second largest city in the world. It is located in the south of France, on the banks of the', 'output_ids': [12095, 13, 1084, 374, 279, 7772, 3283, 304, 4505, 323, 279, 2086, 7772, 3283, 304, 279, 1879, 13, 1084, 374, 7407, 304, 279, 9806, 315, 9625, 11, 389, 279, 13959, 315, 279], 'meta_info': {'id': '2ac8186071764d2085b25bf3cbe6dbd9', 'finish_reason': {'type': 'length', 'length': 32}, 'prompt_tokens': 5, 'weight_version': 'default', 'total_retractions': 0, 'completion_tokens': 32, 'cached_tokens': 2, 'cached_tokens_details': None, 'dp_rank': None, 'e2e_latency': 0.2410253812558949, 'response_sent_to_client_ts': 1774234857.6178513}}

Streaming#

[7]:

import requests, json response = requests.post( f"http://localhost:{port}/generate", json={ "text": "The capital of France is", "sampling_params": { "temperature": 0, "max_new_tokens": 32, }, "stream": True, }, stream=True, ) prev = 0 for chunk in response.iter_lines(decode_unicode=False): chunk = chunk.decode("utf-8") if chunk and chunk.startswith("data:"): if chunk == "data: [DONE]": break data = json.loads(chunk[5:].strip("\n")) output = data["text"] print(output[prev:], end="", flush=True) prev = len(output) 

 Paris. It is the largest city in Europe and the second largest city in the world. It is located in the south of France, on the banks of the

[8]:

terminate_process(server_process) 

Sending Requests

Contents

Sending Requests#

Launch A Server#

Using cURL#

Using Python Requests#

Using OpenAI Python Client#

Streaming#

Using Native Generation APIs#

Streaming#