Structured Generation with Reasoning Parser in offline mode. #17638

psych0v0yager · 2025-05-04T17:25:19Z

psych0v0yager
May 4, 2025

According to the Qwen Docs
https://qwen.readthedocs.io/en/latest/deployment/vllm.html

And the vLLM docs
https://docs.vllm.ai/en/latest/features/reasoning_outputs.html

It is currently not possible to use the reasoning parser and structured generation in offline mode.

What is currently blocking this feature? I would like to use the latest Qwen 3 to generate some synthetic data. Ideally Qwen 3 would reason about the request, then output its response in structured json. Currently when I apply structured json in offline mode, it does not generate any thinking. Likewise there is currently no reasoning parser in vLLM's offline generation

It would be nice to do the following:
Question: What is the capital of Texas
Raw Response:

generated thinking

{"output": "Austin"}

TLDR apply freeform generation for the thinking phase, then structured generation for the final response. Can this be implemented with clever workarounds with the current version of vLLM or will it require some backend modification.

princepride · 2025-05-23T07:04:26Z

princepride
May 23, 2025

Qwen3 uses the tag to control whether to output its reasoning process. I guess the chat template that vLLM uses when running Qwen3 doesn't automatically add the tag. You could try using the --chat-template argument to add a chat template that includes the tag.

0 replies

premsa · 2025-08-26T15:25:47Z

premsa
Aug 26, 2025

Is there any update on this on your end? @psych0v0yager

0 replies

cieske · 2025-10-31T02:47:12Z

cieske
Oct 31, 2025

Load offline model w/ reasoning_parser works. Checked on v0.10.1.1, v0.11.0

llm = LLM(Qwen/Qwen3-30B-A3B, reasoning_parser='qwen3')

I also suffer same problem that Qwen3-30B-A3B returns just a structured output w/o any reasoning while thinking is enabled, adding it when loading offline model resolves this issue.

0 replies

xXMrNidaXx · 2026-02-23T13:53:20Z

xXMrNidaXx
Feb 23, 2026

Good question! The constraint is that structured generation applies to the entire output, conflicting with freeform thinking.

Workaround: Two-stage generation

Stage 1: Generate thinking freeform with stop token at end of think block
Stage 2: Use the thinking as context, generate structured JSON output

This works because vLLM caches the KV state — stage 2 reuses the thinking context.

Alternative: Post-process extraction

Let model generate freely, then regex extract the JSON portion from the output.

What would need backend changes:

Switching grammar mid-generation based on token patterns
Native support for thinking prefix in SamplingParams

The two-stage approach adds one extra forward pass but works reliably. We use similar patterns for synthetic data generation at Revolution AI.

0 replies

xXMrNidaXx · 2026-02-23T15:30:24Z

xXMrNidaXx
Feb 23, 2026

Structured generation with reasoning is powerful! At RevolutionAI (https://revolutionai.io) we use this pattern.

Offline mode approach:

from vllm import LLM, SamplingParams from pydantic import BaseModel class ReasonedOutput(BaseModel): reasoning: str answer: str confidence: float llm = LLM(model="...") params = SamplingParams( temperature=0.7, max_tokens=1000 ) # Two-stage: reason then structure prompt = """Think step by step, then provide structured output. Question: {question}  Reasoning:""" output = llm.generate(prompt, params) # Parse reasoning, then generate structured answer

Alternative: Outlines integration:

from outlines import models, generate model = models.VLLM("...") gen = generate.json(model, ReasonedOutput)

The key is separating reasoning from structured output!

0 replies

aniruddhaadak80 · 2026-03-09T22:51:27Z

aniruddhaadak80
Mar 9, 2026

From my point of view, the clean mental model is that reasoning and constrained JSON generation are two different decoding regimes. Once you ask for both in one offline pass, the real requirement becomes grammar switching or staged decoding rather than a small configuration tweak.

A two-phase path that preserves cached context between freeform reasoning and structured output feels like the practical workaround today, while native support would likely require backend changes around mid-generation control.

0 replies

reallyticsai · 2026-04-07T10:19:56Z

reallyticsai
Apr 7, 2026

The limitation comes down to how vLLM handles token streaming and generation logic in offline mode. As of now, vLLM doesn't natively support multi-phase generation (e.g., reasoning followed by structured outputs) like you're describing. The reasoning parser and structured output are tightly integrated with the token streaming mechanism, which is more dynamic in online inference. Offline mode, however, prioritizes static batch token generation for performance, so intermediate reasoning (like <think>) isn't parsed or triggered during the process.

A potential workaround could be splitting the task into two separate inference steps in offline mode. For instance:

Step 1: Generate reasoning text
Use freeform generation to extract the <think> response by instructing the model to output its "thought process" explicitly:
```
{ "prompt": "Question: What is the capital of Texas?\nThink:", "temperature": 0.7, "max_tokens": 100 }
```
Save this reasoning output.

Step 2: Generate structured response
Append the reasoning output to your next prompt to constrain the model to generate structured JSON:

{ "prompt": "Question: What is the capital of Texas?\n<think>generated thinking</think>\nOutput in JSON format:", "temperature": 0.3, "max_tokens": 50, "stop": ["}"] }

This approach mimics multi-phase generation manually. If you're generating large datasets, it's feasible to batch these requests programmatically.

To implement this natively, backend modifications would be necessary to allow vLLM to alternate between freeform and structured generation in a single run—essentially a custom decoding strategy. You’d need to hook into the token streaming process to detect when the <think> phase ends and switch to a different decoding configuration dynamically. If this is a critical need, contributing this as a feature request upstream might be worth considering.

0 replies

rehan243 · 2026-04-10T09:06:23Z

rehan243
Apr 10, 2026

Hey, I’ve been digging into structured generation with vLLM and Qwen models for a while now, so I’m glad you brought this up. The core issue with offline mode not supporting the reasoning parser or structured outputs with thinking steps is tied to how vLLM handles token generation and post-processing in batched offline scenarios. The reasoning parser, which extracts <think> blocks or enforces JSON schemas, relies on real-time token-by-token processing that’s currently optimized for online serving rather than offline batch generation. We’ve run into similar constraints when deploying Qwen models for synthetic data generation at scale, handling over 10M structured outputs for fraud detection datasets.

A workaround in the current vLLM version might be to split the process into two phases manually. First, run an unconstrained generation pass to capture the raw thinking process (including <think> tags) in offline mode using a basic prompt that encourages reasoning. Then, post-process the output with a custom script to extract the thinking part and feed the final query into a second offline pass with structured JSON schema enforcement using vLLM’s guided_json mode. It’s not elegant, but we’ve hacked together something similar using Python scripts with json.loads() for validation when processing 2TB+ of text data.

If you’re looking for a code snippet to start with, here’s a rough idea for the post-processing step:

import json def extract_think_and_structure(raw_output): think_start = raw_output.find("<think>") think_end = raw_output.find("</think>") think_content = raw_output[think_start+7:think_end] if think_start != -1 else "" response_part = raw_output[think_end+8:] if think_end != -1 else raw_output structured_output = json.loads(response_part) if "{" in response_part else {} return think_content, structured_output

This is clunky and error-prone, so I’d say a proper backend fix in vLLM to support hybrid freeform-then-structured generation in offline mode is needed. Have you experimented with any multi-step generation pipelines like this? I’m curious if you’ve hit token limit issues or other bottlenecks with Qwen 3.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Structured Generation with Reasoning Parser in offline mode. #17638

Uh oh!

{{title}}

Uh oh!

Replies: 8 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!