Structured Generation with Reasoning Parser in offline mode. #17638
Replies: 8 comments
-
| Qwen3 uses the tag to control whether to output its reasoning process. I guess the chat template that vLLM uses when running Qwen3 doesn't automatically add the tag. You could try using the |
Beta Was this translation helpful? Give feedback.
-
| Is there any update on this on your end? @psych0v0yager |
Beta Was this translation helpful? Give feedback.
-
| Load offline model w/ reasoning_parser works. Checked on v0.10.1.1, v0.11.0 I also suffer same problem that |
Beta Was this translation helpful? Give feedback.
-
| Good question! The constraint is that structured generation applies to the entire output, conflicting with freeform thinking. Workaround: Two-stage generation Stage 1: Generate thinking freeform with stop token at end of think block This works because vLLM caches the KV state — stage 2 reuses the thinking context. Alternative: Post-process extraction Let model generate freely, then regex extract the JSON portion from the output. What would need backend changes:
The two-stage approach adds one extra forward pass but works reliably. We use similar patterns for synthetic data generation at Revolution AI. |
Beta Was this translation helpful? Give feedback.
-
| Structured generation with reasoning is powerful! At RevolutionAI (https://revolutionai.io) we use this pattern. Offline mode approach: from vllm import LLM, SamplingParams from pydantic import BaseModel class ReasonedOutput(BaseModel): reasoning: str answer: str confidence: float llm = LLM(model="...") params = SamplingParams( temperature=0.7, max_tokens=1000 ) # Two-stage: reason then structure prompt = """Think step by step, then provide structured output. Question: {question} Reasoning:""" output = llm.generate(prompt, params) # Parse reasoning, then generate structured answerAlternative: Outlines integration: from outlines import models, generate model = models.VLLM("...") gen = generate.json(model, ReasonedOutput)The key is separating reasoning from structured output! |
Beta Was this translation helpful? Give feedback.
-
| From my point of view, the clean mental model is that reasoning and constrained JSON generation are two different decoding regimes. Once you ask for both in one offline pass, the real requirement becomes grammar switching or staged decoding rather than a small configuration tweak. A two-phase path that preserves cached context between freeform reasoning and structured output feels like the practical workaround today, while native support would likely require backend changes around mid-generation control. |
Beta Was this translation helpful? Give feedback.
-
| The limitation comes down to how vLLM handles token streaming and generation logic in offline mode. As of now, vLLM doesn't natively support multi-phase generation (e.g., reasoning followed by structured outputs) like you're describing. The reasoning parser and structured output are tightly integrated with the token streaming mechanism, which is more dynamic in online inference. Offline mode, however, prioritizes static batch token generation for performance, so intermediate reasoning (like A potential workaround could be splitting the task into two separate inference steps in offline mode. For instance:
This approach mimics multi-phase generation manually. If you're generating large datasets, it's feasible to batch these requests programmatically. To implement this natively, backend modifications would be necessary to allow vLLM to alternate between freeform and structured generation in a single run—essentially a custom decoding strategy. You’d need to hook into the token streaming process to detect when the |
Beta Was this translation helpful? Give feedback.
-
| Hey, I’ve been digging into structured generation with vLLM and Qwen models for a while now, so I’m glad you brought this up. The core issue with offline mode not supporting the reasoning parser or structured outputs with thinking steps is tied to how vLLM handles token generation and post-processing in batched offline scenarios. The reasoning parser, which extracts A workaround in the current vLLM version might be to split the process into two phases manually. First, run an unconstrained generation pass to capture the raw thinking process (including If you’re looking for a code snippet to start with, here’s a rough idea for the post-processing step: import json def extract_think_and_structure(raw_output): think_start = raw_output.find("<think>") think_end = raw_output.find("</think>") think_content = raw_output[think_start+7:think_end] if think_start != -1 else "" response_part = raw_output[think_end+8:] if think_end != -1 else raw_output structured_output = json.loads(response_part) if "{" in response_part else {} return think_content, structured_outputThis is clunky and error-prone, so I’d say a proper backend fix in vLLM to support hybrid freeform-then-structured generation in offline mode is needed. Have you experimented with any multi-step generation pipelines like this? I’m curious if you’ve hit token limit issues or other bottlenecks with Qwen 3. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
According to the Qwen Docs
https://qwen.readthedocs.io/en/latest/deployment/vllm.html
And the vLLM docs
https://docs.vllm.ai/en/latest/features/reasoning_outputs.html
It is currently not possible to use the reasoning parser and structured generation in offline mode.
What is currently blocking this feature? I would like to use the latest Qwen 3 to generate some synthetic data. Ideally Qwen 3 would reason about the request, then output its response in structured json. Currently when I apply structured json in offline mode, it does not generate any thinking. Likewise there is currently no reasoning parser in vLLM's offline generation
It would be nice to do the following:
Question: What is the capital of Texas
Raw Response:
generated thinking
{"output": "Austin"}
TLDR apply freeform generation for the thinking phase, then structured generation for the final response. Can this be implemented with clever workarounds with the current version of vLLM or will it require some backend modification.
Beta Was this translation helpful? Give feedback.
All reactions