Your AI returns broken JSON? Put this in between.
Works with any AI model: ChatGPT, Claude, Gemini, Llama. Zero setup.
Automatically extracts JSON from markdown/text, repairs common AI mistakes, validates structure. Returns clean data when successful, detailed feedback for retries when not.
Quick Links: Fast Track (2 min) • Why This Tool? • Code Example • Install • Full Documentation ↓
Want to start using this right away? Here's how:
- Download the
ai_json_cleanroom.pyfile to your project - Tell your AI coding assistant:
"When I receive JSON from an AI response, process it through
validate_ai_json()fromai_json_cleanroom.pyfirst" - Done. Your AI assistant (ChatGPT, Claude, Copilot, Cursor) will handle the integration
Ready in 2 minutes. Works immediately.
Show me the code → • Why do I need this? →
The situation: You request JSON from your AI. Sometimes you receive:
| What you get | What breaks |
|---|---|
Sure! Here's the JSON: {"name": "Alice"} | Extra text crashes json.loads() |
{'name': 'Alice'} | Python quotes instead of JSON |
{"users": [{"id": 1}, {"i | Truncated mid-response (token limit) |
Current solution: Try/catch blocks, regex patterns, manual fixes, repeated API calls.
This tool: Handles all cases automatically. One function call.
# Standard installation git clone https://github.com/jordicor/ai-json-cleanroom.git cd ai-json-cleanroom pip install -e . # Optional: 3.6x faster parsing pip install orjsonReady. Import and use: from ai_json_cleanroom import validate_ai_json
from ai_json_cleanroom import validate_ai_json # Anything your AI returns (messy, wrapped, incomplete) ai_response = "Here's your data:\n```json\n{'name': 'Alice', age: 30} // Python-style syntax\n```\n" # One line to clean and validate result = validate_ai_json(ai_response) if result.json_valid: print(result.data) # Clean: {'name': 'Alice', 'age': 30} else: print(result.errors) # Detailed error informationDone. No configuration needed. It works out of the box.
Check result.warnings to see what was fixed automatically.
The cleaner automatically:
- Found the JSON inside markdown code fence
- Fixed single quotes to double quotes
- Added quotes to the unquoted key
age - Removed the inline comment
- Validated the final structure
Processing time: ~1ms. Zero configuration required.
Useful tip: Check result.likely_truncated to detect when the AI hit its token limit. This saves unnecessary retry API calls.
That's everything you need. The tool works immediately with smart defaults.
Everything below is optional documentation for:
- Understanding how the tool works internally
- Advanced configuration options
- Framework integrations (LangChain, Instructor, etc.)
- Your AI assistant to read and understand the full API
For most users: The sections above are sufficient. Start building.
Want to learn more? Continue reading below.
💡 Found this useful? Star the repo ⭐ to help others discover it!
If you've worked with AI models, you know the frustration. You ask for JSON, and what do you get?
Sometimes it's wrapped in a friendly explanation. Sometimes it has Python-style single quotes. Sometimes it just... stops mid-array because it hit the token limit. And your json.loads() crashes. Again.
This is a common scenario when working with AI models. That's why AI JSON Cleanroom exists: to handle the messy reality of AI outputs so you can focus on building.
Here's what actually happens when you ask an AI model for JSON:
| Your Request | What You Expect | What You Actually Get | Why It Happens |
|---|---|---|---|
| "Return user data as JSON" | {"name": "Alice"} | Sure! Here's the JSON:{"name": "Alice"} | AI models are trained to be helpful and conversational |
| "Give me valid JSON only" | {"active": true} | {'active': True} | Model confusion between Python and JSON syntax |
| "Return a large dataset" | Complete JSON | {"data": [{"id": 1}, {"id": 2}, {"i | Token limit reached mid-generation |
| "Format as JSON object" | {"text": "He said \"hi\""} | {"text": "He said "hi""} | Improper quote escaping |
| "Output JSON with comments" | Valid JSON | {name: "Alice", age: 30} | JavaScript object literal syntax |
| "Generate configuration" | Clean JSON | {"items": [1, 2, 3,]} | Trailing commas (valid in JS/Python, not JSON) |
Why existing solutions fall short:
json.loads(): Throws exceptions on malformed input, no context provided- LangChain parsers: Validate structure but don't repair common AI mistakes
- Instructor/Pydantic: Excellent for type mapping, but require clean JSON first
- Custom regex: Brittle, incomplete, and maintenance-heavy
AI JSON Cleanroom is a production-ready, zero-dependency (stdlib only) JSON cleaner designed specifically for AI outputs. It acts as a post-processing layer that extracts, repairs, validates, and provides structured feedback.
Key Benefits:
- Smart Extraction - Automatically finds JSON in markdown, code blocks, or mixed text
- Conservative Repair - Fixes common AI mistakes without corrupting your data
- Truncation Detection - Knows when output was cut off (saves you API calls!)
- Schema Validation - Validate structure with JSON Schema or simple path rules
- Non-Throwing API - Always returns a result, never crashes your pipeline
- Performance - 3.6x faster parsing with optional
orjson(but works fine without it) - Granular Control - Choose exactly which repairs to apply (or use smart defaults)
Pro Tip: Start with default options - they handle 95% of AI quirks. Only customize when you hit specific issues.
Your code crashes when parsing the AI's response
- You ask the AI for JSON data
- The AI returns something that looks like JSON
- But
json.loads()throws an error and your script stops - You're tired of try-except blocks that don't tell you what went wrong
The AI wraps the JSON in extra text
- Instead of just
{"name": "Alice"}, you get: "Here's the data you requested:{"name": "Alice"}Let me know if you need anything else!" - Your parser fails because there's text before/after the actual JSON
- You've tried telling the AI "return ONLY JSON" but it keeps adding explanations anyway
- You don't want to write regex patterns to extract the JSON part
The response is incomplete and you don't know why
- Sometimes the JSON just... stops:
{"users": [{"name": "Alice", "age": 30}, {"name": "Bob", "a - Your code crashes with confusing error messages
- You retry multiple times, same problem
- The issue: AIs have a maximum length they can respond (token limit). When they hit it, they stop mid-sentence
- This tool detects when JSON was cut off and tells you immediately - so you know to ask for a shorter response or increase the limit
The JSON "looks right" but still fails to parse
- You can see the data structure clearly
- But Python complains about "invalid syntax" or "expecting property name"
- Common hidden issues:
- Mixed quote styles:
{'name': "Alice"}(Python allows both, JSON only allows double quotes) - Trailing commas:
[1, 2, 3,](JavaScript/Python allow this, JSON doesn't) - Comments:
{"name": "Alice" // her name}(many languages allow comments, JSON doesn't) - Python booleans:
{"active": True}instead of{"active": true}
- Mixed quote styles:
- These are easy mistakes for AIs to make, and hard to spot by eye
You're working with different AI models or providers
- GPT has different quirks than Claude, which has different quirks than Llama
- Each one "breaks" JSON in its own special way
- You don't want to write different parsing logic for each model
- You want one solution that handles them all
You need to know what was changed
- When something gets fixed automatically, you want to know what was fixed
- Not just "it works now" without explanation
- You might need to log the changes or decide if they're acceptable
- Every fix this tool makes is reported, so you stay in control
You want something that just works
- No spending hours reading documentation to set it up
- No installing a dozen dependencies that might conflict with your other packages
- You just want to fix your JSON parsing problem and move on with your project
- Single file, drop it in your project, import it, done
It takes the messy response from an AI and:
- Finds the JSON part (even if wrapped in text or markdown code blocks)
- Fixes common issues (quotes, commas, Python vs JSON syntax)
- Tells you if the response was cut off (so you don't waste time retrying)
- Reports everything it changed (so you know what happened)
- Validates the structure (optional - you can define rules for what fields should exist)
Think of it as a safety net between the AI's response and your code. The AI does its best, but when it messes up, this catches it.
Automatically extracts JSON from various formats:
from ai_json_cleanroom import validate_ai_json # From markdown code fence (AI models often wrap JSON in ```json blocks) markdown_output = 'Here is the data:\n```json\n{"status": "success"}\n```\n' result = validate_ai_json(markdown_output) # Extracted: {"status": "success"} # From mixed text mixed_output = 'The result is {"status": "success"} as requested.' result = validate_ai_json(mixed_output) # Extracted: {"status": "success"} # From generic code fence generic_fence = '```\n{"status": "success"}\n```' result = validate_ai_json(generic_fence) # Extracted: {"status": "success"}Fixes common AI mistakes with configurable safeguards:
from ai_json_cleanroom import validate_ai_json, ValidateOptions # Single quotes → double quotes result = validate_ai_json("{'name': 'Alice'}") # Repaired: {"name": "Alice"} # Python constants → JSON result = validate_ai_json('{"active": True, "value": None}') # Repaired: {"active": true, "value": null} # Unquoted keys → quoted keys result = validate_ai_json('{name: "Alice", age: 30}') # Repaired: {"name": "Alice", "age": 30} # Comments removal result = validate_ai_json(''' { "name": "Alice", // user name /* age field */ "age": 30 } ''') # Repaired: {"name": "Alice", "age": 30} # Trailing commas result = validate_ai_json('{"items": [1, 2, 3,]}') # Repaired: {"items": [1, 2, 3]} # Inner unescaped quotes result = validate_ai_json('{"text": "She said "hello" to me"}') # Repaired: {"text": "She said \"hello\" to me"}Safeguards:
- Maximum modifications limit (default: 200 changes or 2% of input size)
- Disabled if truncation detected
- Incremental parse-check after each repair pass
- Detailed repair metadata in
result.info
Identifies incomplete outputs before wasting retries:
from ai_json_cleanroom import validate_ai_json truncated = '{"users": [{"name": "Alice", "age": 30}, {"name": "Bob", "age":' result = validate_ai_json(truncated) print(result.likely_truncated) # True print(result.errors[0].message) # "No JSON payload found in input." print(result.errors[0].detail) # {'truncation_reasons': ['unclosed_braces_or_brackets', 'suspicious_trailing_character']}Detection signals:
- Unclosed strings
- Unbalanced braces/brackets
- Suspicious trailing characters (
,,:,{,[) - Ellipsis at end (
...)
Validate against JSON Schema subset:
from ai_json_cleanroom import validate_ai_json schema = { "type": "object", "required": ["name", "email"], "properties": { "name": { "type": "string", "minLength": 1, "maxLength": 100 }, "email": { "type": "string", "pattern": r"^[\w\.-]+@[\w\.-]+\.\w+$" }, "age": { "type": "integer", "minimum": 0, "maximum": 150 }, "tags": { "type": "array", "minItems": 1, "items": {"type": "string"} } }, "additionalProperties": False } result = validate_ai_json(ai_output, schema=schema) if not result.json_valid: for error in result.errors: print(f"{error.code}: {error.message} at {error.path}")Supported schema keywords:
- Types:
object,array,string,number,integer,boolean,null - Object:
required,properties,patternProperties,additionalProperties - Array:
items,additionalItems,minItems,maxItems,uniqueItems - String:
minLength,maxLength,pattern - Number:
minimum,maximum,exclusiveMinimum,exclusiveMaximum,multipleOf - Combinators:
anyOf,oneOf,allOf - Constraints:
enum,const,allow_empty
Validate specific paths with wildcard support:
from ai_json_cleanroom import validate_ai_json expectations = [ { "path": "users[*].email", "required": True, "pattern": r"^[\w\.-]+@[\w\.-]+\.\w+$" }, { "path": "users[*].status", "required": True, "in": ["active", "pending", "inactive"] }, { "path": "metadata.version", "required": True, "type": "string", "pattern": r"^\d+\.\d+\.\d+$" }, { "path": "items[*].price", "minimum": 0, "type": "number" } ] result = validate_ai_json(ai_output, expectations=expectations)Expectation options:
path: JSONPath-like with wildcards ([*]for arrays,*for object values)required: Whether path must exist (default:True)type: Expected type(s)equals: Exact value matchin: Value must be in listpattern: Regex pattern for stringsmin_length,max_length: String length constraintsmin_items,max_items: Array size constraintsminimum,maximum: Numeric bounds
Always returns a ValidationResult - never crashes:
from ai_json_cleanroom import validate_ai_json result = validate_ai_json(any_input) # Always safe to access print(f"Valid: {result.json_valid}") print(f"Truncated: {result.likely_truncated}") print(f"Errors: {len(result.errors)}") print(f"Warnings: {len(result.warnings)}") print(f"Data: {result.data}") # None if invalid print(f"Info: {result.info}") # Extraction/parsing metadata # Structured error handling for error in result.errors: print(f"Code: {error.code}") print(f"Path: {error.path}") print(f"Message: {error.message}") print(f"Detail: {error.detail}")Not sure which options to enable? This guide explains each repair strategy with practical examples.
What it does: Converts Python-style single quotes 'text' to JSON double quotes "text"
When to keep it ON:
- Working with GPT models (they often mix Python/JSON syntax)
- Processing outputs from code-generation models
- General use - this is safe and commonly needed
When to turn it OFF:
- Your AI model never uses single quotes (rare)
- You're processing pure JSON from a non-AI source
Example scenario:
# GPT often returns this mix: input = "{'name': 'Alice', \"age\": 30}" # Mixed quotes # With fix_single_quotes=True: # ✅ Becomes: {"name": "Alice", "age": 30} # With fix_single_quotes=False: # ❌ Parse fails on single quotesWhat it does: Adds quotes to JavaScript-style unquoted object keys
When to keep it ON:
- Working with models trained on JavaScript/TypeScript code
- Processing outputs that might include object literals
- Claude models (sometimes output JS-style objects)
When to turn it OFF:
- Strict JSON-only environment
- You want to detect and reject JS-style syntax
Real-world example:
# Claude sometimes returns: input = "{name: 'Alice', age: 30, active: true}" # With quote_unquoted_keys=True: # ✅ Becomes: {"name": "Alice", "age": 30, "active": true}What it does: Converts Python/JS constants (True/False/None) to JSON (true/false/null)
When to keep it ON:
- Always, unless you have a specific reason not to
- Essential for Python-trained models
Example:
# Models often mix languages: input = '{"active": True, "deleted": False, "parent": None}' # With replace_constants=True: # ✅ Becomes: {"active": true, "deleted": false, "parent": null}What it does: Removes JavaScript-style comments (// and /* */)
When to keep it ON:
- Models that explain their JSON with comments
- When processing configuration-style outputs
Example:
input = ''' { "name": "Alice", // user name /* age field */ "age": 30 } ''' # ✅ Comments are safely removedWhat it does: Handles smart/typographic quotes that break JSON parsing
Options:
"always"- Convert smart quotes before parsing (safest)"auto"- Only convert if initial parse fails (balanced approach)"never"- Keep smart quotes as-is (when you want to preserve them)
When to use each:
"always": Default choice, handles copy-paste from documents"auto": When performance matters and smart quotes are rare"never": When processing content where quote style matters
Example:
# From copy-paste or models trained on web text: input = '{"text": "She said "hello" to me"}' # Smart quotes # With normalize_curly_quotes="always": # ✅ Becomes: {"text": "She said \"hello\" to me"}What it does: Master toggle for all repair strategies
When to turn OFF:
- You want to validate only, not repair
- Debugging to see raw parsing errors
- You have your own repair logic
What they do: Safety limits to prevent over-correction
When to increase:
- Very messy outputs from older models
- Known high-error scenarios
When to decrease:
- You want stricter validation
- Suspicious of too many modifications
Example configuration:
from ai_json_cleanroom import validate_ai_json, ValidateOptions # For very messy outputs: options = ValidateOptions( max_total_repairs=500, # Allow more fixes max_repairs_percent=0.05 # Allow 5% of content to be modified ) # For strict validation: options = ValidateOptions( max_total_repairs=10, # Minimal fixes only max_repairs_percent=0.001 # Less than 0.1% modifications )📝 Note: Start with defaults. They're battle-tested on thousands of real AI outputs. Only adjust if you have specific issues.
Even with OpenAI's JSON mode, you're not guaranteed clean JSON. Why? The model might still:
- Wrap JSON in markdown code fences (happens ~15% of the time)
- Get truncated if your request is too large
- Add "helpful" explanatory text before or after
Use Cleanroom as a safety net - it adds virtually no overhead when JSON is clean, but saves you when it's not:
from openai import OpenAI from ai_json_cleanroom import validate_ai_json client = OpenAI() response = client.chat.completions.create( model="gpt-5.1-2025-11-13", messages=[ {"role": "system", "content": "You are a helpful assistant that outputs JSON."}, {"role": "user", "content": "Generate user profile for Alice Johnson, age 30"} ], response_format={"type": "json_object"} ) # Clean and validate the response result = validate_ai_json( response.choices[0].message.content, schema={ "type": "object", "required": ["name", "age"], "properties": { "name": {"type": "string"}, "age": {"type": "integer", "minimum": 0} } } ) if result.json_valid: user_data = result.data print(f"User: {user_data['name']}, Age: {user_data['age']}") else: # Use structured feedback for retry error_msg = "\n".join([e.message for e in result.errors]) print(f"Validation failed:\n{error_msg}") # Optionally retry with feedbackToken Limit Handling: If you get truncation, Cleanroom tells you immediately - no need to waste API calls trying to parse incomplete JSON
Retry Strategy: Use the specific error messages for targeted retry prompts
Cost Savings: Check result.likely_truncated before retrying with higher token limits
Claude loves to be helpful. It often:
- Wraps JSON in markdown code fences with explanations
- Adds conversational text before and after
- Uses varied quote styles depending on context
Cleanroom handles Claude's chattiness automatically:
import anthropic from ai_json_cleanroom import validate_ai_json client = anthropic.Anthropic() message = client.messages.create( model="claude-haiku-4-5", max_tokens=1024, messages=[ { "role": "user", "content": "Generate a JSON object with user info for Alice, age 30" } ] ) # Claude might return: # "Here's the user data:\n```json\n{\"name\": \"Alice\", \"age\": 30}\n```\nLet me know if you need anything else!" result = validate_ai_json(message.content[0].text) if result.json_valid: print(f"Extracted data: {result.data}") print(f"Extraction source: {result.info['source']}") # 'code_fence' else: if result.likely_truncated: print("Response was truncated, increasing max_tokens...") else: print(f"Validation errors: {result.errors}")Use Cleanroom as a pre-processor before LangChain's parsers:
from langchain_openai import ChatOpenAI from langchain.prompts import ChatPromptTemplate from langchain.output_parsers import JsonOutputParser from ai_json_cleanroom import validate_ai_json, ValidateOptions import json # Initialize LangChain LLM llm = ChatOpenAI(model="gpt-5.1", temperature=0) # Create prompt prompt = ChatPromptTemplate.from_template( "Generate a JSON object with information about {topic}. Return only valid JSON." ) # Get LLM response chain = prompt | llm response = chain.invoke({"topic": "Python programming"}) # Step 1: Clean with ai-json-cleanroom cleaned = validate_ai_json( response.content, options=ValidateOptions( enable_safe_repairs=True, extract_json=True ) ) if cleaned.json_valid: # Step 2: Pass to LangChain parser if needed parser = JsonOutputParser() # Convert back to string for LangChain parser structured = parser.parse(json.dumps(cleaned.data)) print(structured) # Or use cleaned.data directly print(cleaned.data) else: print(f"Cleaning failed: {cleaned.errors}") if cleaned.likely_truncated: print("Retry with higher max_tokens")Cleanroom and Instructor work perfectly together - clean first, then map to Pydantic models:
from pydantic import BaseModel, Field import instructor from openai import OpenAI from ai_json_cleanroom import validate_ai_json # Define Pydantic model class User(BaseModel): name: str = Field(description="User's full name") age: int = Field(ge=0, le=150, description="User's age") email: str = Field(pattern=r"^[\w\.-]+@[\w\.-]+\.\w+$") tags: list[str] = Field(default_factory=list) # Get raw AI output (without Instructor's patching) client = OpenAI() response = client.chat.completions.create( model="gpt-5.1", messages=[ {"role": "user", "content": "Generate user info for Alice, 30 years old"} ] ) raw_output = response.choices[0].message.content # Step 1: Clean with ai-json-cleanroom result = validate_ai_json(raw_output) if result.json_valid: # Step 2: Map to Pydantic with Instructor try: user = User(**result.data) print(f"User: {user.name}, Age: {user.age}, Email: {user.email}") except Exception as e: print(f"Pydantic validation failed: {e}") else: print(f"JSON cleaning failed: {result.errors}")Alternative with Instructor's client:
import instructor from openai import OpenAI from ai_json_cleanroom import validate_ai_json # Patch OpenAI client with Instructor client = instructor.from_openai(OpenAI()) # If Instructor fails, fallback to Cleanroom try: user = client.chat.completions.create( model="gpt-5.1", response_model=User, messages=[{"role": "user", "content": "Generate user info"}] ) except Exception as e: # Fallback: get raw response and clean manually raw_response = client.chat.completions.create( model="gpt-5.1", messages=[{"role": "user", "content": "Generate user info as JSON"}] ) result = validate_ai_json(raw_response.choices[0].message.content) if result.json_valid: user = User(**result.data)Handle streaming responses by collecting chunks first:
from openai import OpenAI from ai_json_cleanroom import validate_ai_json client = OpenAI() # Collect streaming chunks chunks = [] stream = client.chat.completions.create( model="gpt-5.1", messages=[{"role": "user", "content": "Generate user JSON"}], stream=True ) for chunk in stream: if chunk.choices[0].delta.content: chunks.append(chunk.choices[0].delta.content) # Validate complete output full_output = ''.join(chunks) result = validate_ai_json(full_output) if result.likely_truncated: print("Stream was truncated, consider retrying with higher limits") print(f"Truncation reasons: {result.errors[0].detail.get('truncation_reasons')}") elif result.json_valid: print(f"Valid JSON received: {result.data}") else: print(f"Validation errors: {result.errors}")Use validation errors to provide specific feedback for retries:
from ai_json_cleanroom import validate_ai_json import openai def generate_with_retry(prompt, schema, max_retries=3): """Generate JSON with automatic retry on validation failure.""" client = openai.OpenAI() for attempt in range(max_retries): response = client.chat.completions.create( model="gpt-5.1", messages=[ {"role": "system", "content": "You are a helpful assistant that returns valid JSON."}, {"role": "user", "content": prompt} ] ) result = validate_ai_json( response.choices[0].message.content, schema=schema ) if result.json_valid: return result.data # Build feedback for retry if result.likely_truncated: prompt = f"{prompt}\n\nIMPORTANT: Your previous response was truncated. Please ensure the complete JSON is returned." else: error_messages = [f"- {e.path}: {e.message}" for e in result.errors] feedback = "\n".join(error_messages) prompt = f"{prompt}\n\nYour previous JSON had these issues:\n{feedback}\n\nPlease fix these and return valid JSON." raise ValueError(f"Failed to generate valid JSON after {max_retries} attempts") # Usage schema = { "type": "object", "required": ["name", "email", "age"], "properties": { "name": {"type": "string"}, "email": {"type": "string", "pattern": r"^[\w\.-]+@[\w\.-]+\.\w+$"}, "age": {"type": "integer", "minimum": 0} } } user_data = generate_with_retry( "Generate a user profile for Alice Johnson", schema=schema ) print(user_data)The Problem: You explicitly ask for JSON only, but get:
I'll help you with that! Here's the JSON data: {"status": "success"} Let me know if you need anything else! The Solution:
# Cleanroom automatically extracts the JSON part result = validate_ai_json(chatty_response) print(result.data) # Just the JSON: {"status": "success"} print(result.info['source']) # Tells you where it found it: 'balanced_block'The Problem: Large responses get truncated:
{"users": [{"id": 1, "name": "Alice"}, {"id": 2, "naThe Solution:
result = validate_ai_json(truncated_response) if result.likely_truncated: # You know exactly what happened print("Response truncated - reasons:", result.errors[0].detail['truncation_reasons']) # Output: ['unclosed_braces_or_brackets', 'unterminated_string'] # Smart retry with higher token limit retry_with_higher_limit()The Problem: Your AI model mixes Python and JSON syntax:
output = "{'users': [\"Alice\", \"Bob\"], 'count': 2}"The Solution:
result = validate_ai_json(output) # Automatically fixes to: {"users": ["Alice", "Bob"], "count": 2}The Problem: You need certain fields but don't want full schema validation.
The Solution: Use path expectations:
expectations = [ {"path": "users[*].email", "required": True}, {"path": "metadata.version", "pattern": r"^\d+\.\d+\.\d+$"} ] result = validate_ai_json(ai_output, expectations=expectations) # Validates that all users have emails and version is semverThe Problem: AI model adds helpful comments that contain important context:
{ "temperature": 0.7, // Higher for creativity "max_tokens": 100 // Keep responses concise }The Solution:
# First, extract with comments preserved to see them raw_response = ai_output # Clean for parsing result = validate_ai_json(raw_response) # The comments are removed for valid JSON print(result.data) # {"temperature": 0.7, "max_tokens": 100} # If you need the comments, parse them separately from raw_responseThe Problem: GPT uses Python syntax, Claude wraps in markdown, Gemini truncates.
The Solution: One configuration handles all:
from ai_json_cleanroom import validate_ai_json # Same code for ALL models def clean_any_ai_output(output): result = validate_ai_json(output) # Default options handle everything if result.json_valid: return result.data elif result.likely_truncated: raise ValueError("Output truncated - increase token limit") else: raise ValueError(f"Could not parse: {result.errors}") # Works with GPT, Claude, Gemini, Llama, etc.
⚠️ Important: Truncation detection always runs first. If JSON is truncated, repairs are skipped to avoid corrupting partial data.
Possible causes and solutions:
-
Truncation detected
- Cleanroom disables repairs for truncated input (safety measure)
- Solution: Get complete output first, then retry
-
Repair limit reached
- Default limit: 200 changes or 2% of input size
- Solution: Increase limits if needed:
options = ValidateOptions( max_total_repairs=500, # Raise limit max_repairs_percent=0.05 # Allow 5% modifications )
-
Specific repair disabled
- Check your options - maybe
fix_single_quotes=False? - Solution: Enable the specific repair you need
- Check your options - maybe
Common hidden issues:
- Invisible Unicode characters (zero-width spaces, etc.)
- Smart quotes from copy-paste:
"text"vs"text" - Line breaks inside strings without proper escaping
Diagnosis:
result = validate_ai_json(your_input, options=ValidateOptions( normalize_curly_quotes="always" # Fixes smart quotes )) print(result.errors) # See specific character positionsIssue: Different models have different quirks.
Solution: Check the extraction source:
result = validate_ai_json(claude_output) print(f"Found JSON in: {result.info['source']}") # 'code_fence' = markdown block # 'balanced_block' = found in text # 'raw' = was already cleanSolutions:
- Install orjson:
pip install orjsonfor 3.6x speedup - Disable unnecessary repairs:
options = ValidateOptions( strip_js_comments=False, # If you never have comments normalize_curly_quotes="never" # If you never have smart quotes )
Solution: Check warnings and info:
result = validate_ai_json(messy_json) # See all repairs applied for warning in result.warnings: if warning.code == "repaired": print(f"Repairs applied: {warning.detail['applied']}") print(f"Number of changes: {warning.detail['counts']}") # See extraction details print(f"Extraction method: {result.info['source']}") print(f"Parser used: {result.info['parse_backend']}")Common issues:
- Pattern escaping: Remember to use raw strings for regex:
r"^\d+$" - Type mismatches: JSON numbers include floats - use
"type": "number"not"integer"unless you're sure - Required fields: Double-check field names are exact matches
Debug approach:
# Start without schema to see actual structure result = validate_ai_json(output) print(json.dumps(result.data, indent=2)) # Then add schema gradually schema = {"type": "object"} # Start simple # Add requirements one by oneMain validation function with comprehensive options.
def validate_ai_json( input_data: Union[str, bytes, Dict, List], schema: Optional[Dict[str, Any]] = None, expectations: Optional[List[Dict[str, Any]]] = None, options: Optional[ValidateOptions] = None ) -> ValidationResult: """ Validate AI output against JSON parseability, schema, and expectations. Args: input_data: String, bytes, or already-parsed dict/list schema: JSON Schema subset for validation expectations: List of path-based validation rules options: Configuration for parsing, extraction, and repair Returns: ValidationResult with json_valid, errors, warnings, data, and info """Result object returned by validate_ai_json().
@dataclass class ValidationResult: json_valid: bool # True if parsing and validation succeeded likely_truncated: bool # True if input appears truncated errors: List[ValidationIssue] # Validation errors warnings: List[ValidationIssue] # Non-blocking warnings data: Optional[Union[Dict, List]] # Parsed JSON if valid, else None info: Dict[str, Any] # Extraction/parsing metadata def to_dict(self) -> Dict[str, Any]: """Convert result to dictionary."""Metadata in info:
source: How JSON was found ("raw","code_fence","balanced_block","object")extraction: Details about extraction processparse_backend: Which parser was used ("orjson"or"json")curly_quotes_normalization_used: Whether typographic quotes were normalizedrepair: Details about applied repairs (if any)
Individual validation error or warning.
@dataclass class ValidationIssue: code: ErrorCode # Error type (enum) path: str # JSONPath where error occurred message: str # Human-readable description detail: Optional[Dict[str, Any]] # Additional context def to_dict(self) -> Dict[str, Any]: """Convert issue to dictionary."""Configuration for validation behavior.
@dataclass class ValidateOptions: # Extraction options strict: bool = False extract_json: bool = True allow_json_in_code_fences: bool = True allow_bare_top_level_scalars: bool = False tolerate_trailing_commas: bool = True stop_on_first_error: bool = False # Repair options enable_safe_repairs: bool = True allow_json5_like: bool = True # Master toggle for JSON5-like repairs replace_constants: bool = True # True/False/None → true/false/null replace_nans_infinities: bool = True # NaN/Infinity → null max_total_repairs: int = 200 max_repairs_percent: float = 0.02 # 2% of input size # Granular repair control (new in v1.1) normalize_curly_quotes: str = "always" # "always"|"auto"|"never" fix_single_quotes: bool = True quote_unquoted_keys: bool = True strip_js_comments: bool = True # Custom repair hooks (new in v1.1) custom_repair_hooks: Optional[List[Callable]] = NoneCurly quotes normalization modes:
"always"(default): Normalize typographic quotes before parsing"auto": Try parsing first; only normalize if parse fails"never": Never normalize (preserves typographic quotes as-is)
Enumeration of validation error types.
class ErrorCode(str, Enum): PARSE_ERROR = "parse_error" TRUNCATED = "truncated" MISSING_REQUIRED = "missing_required" TYPE_MISMATCH = "type_mismatch" ENUM_MISMATCH = "enum_mismatch" CONST_MISMATCH = "const_mismatch" NOT_ALLOWED_EMPTY = "not_allowed_empty" ADDITIONAL_PROPERTY = "additional_property" PATTERN_MISMATCH = "pattern_mismatch" MIN_LENGTH = "min_length" MAX_LENGTH = "max_length" MIN_ITEMS = "min_items" MAX_ITEMS = "max_items" MINIMUM = "minimum" MAXIMUM = "maximum" REPAIRED = "repaired" # Warning: repair was applied # ... and moredef extract_json_payload( text: str, options: Optional[ValidateOptions] = None ) -> Tuple[Optional[str], Dict[str, Any]]: """ Extract JSON string from raw text. Returns: (payload, info) - Payload is raw JSON string or None """ def detect_truncation(s: str) -> Tuple[bool, List[str]]: """ Heuristic truncation detector. Returns: (likely_truncated, reasons) """The CLI isn't just for automation - it's perfect for debugging AI outputs during development.
# Got weird output from your AI model? Test it immediately: echo '{"name": "test"}' | python ai_json_cleanroom.py --input - # Testing a saved AI response: python ai_json_cleanroom.py --input gpt_response.txt # See exactly what gets fixed: python ai_json_cleanroom.py --input messy.json --verbose # Output shows: # Fixed 3 single-quoted strings # Quoted 2 unquoted keys # Normalized 4 curly quotes # Removed 2 trailing commas # Validate inline text python ai_json_cleanroom.py --input '{"name": "Alice", "age": 30}' # With JSON Schema validation python ai_json_cleanroom.py --input response.txt --schema schema.json # With expectations python ai_json_cleanroom.py --input response.txt --expectations expectations.json# See what would be fixed without actually fixing: python ai_json_cleanroom.py --input data.json --dry-run # Test different repair strategies: python ai_json_cleanroom.py --input data.json \ --normalize-curly-quotes never \ --no-fix-single-quotes \ --verbose # See the difference # Process multiple files: for file in responses/*.json; do echo "Processing $file..." python ai_json_cleanroom.py --input "$file" --output-clean done# Disable extraction (input must be pure JSON) python ai_json_cleanroom.py --input data.json --no-extract # Disable repair stage python ai_json_cleanroom.py --input response.txt --no-repair # Disable specific repair passes python ai_json_cleanroom.py --input response.txt \ --no-fix-single-quotes \ --no-quote-unquoted-keys \ --no-strip-comments # Control curly quotes normalization python ai_json_cleanroom.py --input response.txt \ --normalize-curly-quotes auto # always|auto|never # Adjust repair limits python ai_json_cleanroom.py --input response.txt \ --max-repairs 500 \ --repairs-percent 0.05 # Strict mode (stop on first error) python ai_json_cleanroom.py --input response.txt --strict # Control output format python ai_json_cleanroom.py --input response.txt \ --indent 4 \ --ensure-asciiThe CLI outputs a JSON result with validation details:
{ "json_valid": true, "likely_truncated": false, "errors": [], "warnings": [ { "code": "repaired", "path": "$", "message": "Input JSON was repaired by conservative heuristics.", "detail": { "applied": ["single_quoted_to_double_quoted", "replace_constants"], "counts": { "single_quoted_strings_converted": 3, "replace_constants": {"true_false_none": 2} } } } ], "data": { "name": "Alice", "age": 30, "active": true }, "info": { "source": "code_fence", "parse_backend": "orjson", "curly_quotes_normalization_used": true } }Fine-tune which repair strategies to apply:
from ai_json_cleanroom import validate_ai_json, ValidateOptions options = ValidateOptions( # Master toggle (backward compatibility) allow_json5_like=True, # Individual repair passes (new in v1.1) fix_single_quotes=True, # 'foo' → "foo" quote_unquoted_keys=True, # {foo: 1} → {"foo": 1} strip_js_comments=True, # Remove // and /* */ comments # Python/JS constants replace_constants=True, # True/False/None → true/false/null replace_nans_infinities=True, # NaN/Infinity → null # Curly quotes handling normalize_curly_quotes="auto", # "always"|"auto"|"never" # Safety limits max_total_repairs=200, max_repairs_percent=0.02 ) result = validate_ai_json(ai_output, options=options)Add domain-specific repair logic:
from ai_json_cleanroom import validate_ai_json, ValidateOptions def my_custom_repair(text: str, options: ValidateOptions): """ Custom repair function. Args: text: Current text being repaired options: ValidateOptions instance Returns: (modified_text, changes_count, metadata_dict) """ modified = text changes = 0 metadata = {} # Example: Replace specific domain patterns if "UNDEFINED" in modified: modified = modified.replace("UNDEFINED", "null") changes += modified.count("null") - text.count("null") metadata["undefined_replacements"] = changes return modified, changes, metadata # Use custom hook options = ValidateOptions( custom_repair_hooks=[my_custom_repair] ) result = validate_ai_json(ai_output, options=options) # Check if custom hook was applied if "custom_hook:my_custom_repair" in result.info.get("repair", {}).get("applied", []): print("Custom repair was applied")Validate complex nested structures:
from ai_json_cleanroom import validate_ai_json # Validate API response structure expectations = [ # All users must have email and it must match pattern { "path": "data.users[*].email", "required": True, "pattern": r"^[\w\.-]+@[\w\.-]+\.\w+$" }, # All users must have valid status { "path": "data.users[*].status", "required": True, "in": ["active", "pending", "inactive", "banned"] }, # Nested array validation { "path": "data.users[*].orders[*].total", "type": "number", "minimum": 0 }, # Metadata version must be semver { "path": "metadata.api_version", "required": True, "pattern": r"^\d+\.\d+\.\d+$" }, # Optional field with constraints when present { "path": "data.users[*].phone", "required": False, "pattern": r"^\+?[\d\s\-\(\)]+$" } ] result = validate_ai_json(ai_output, expectations=expectations) # Get specific path errors for error in result.errors: if error.code == "path_not_found": print(f"Missing required path: {error.path}") elif error.code == "pattern_mismatch": print(f"Pattern mismatch at {error.path}: {error.message}")Track exactly what was repaired:
from ai_json_cleanroom import validate_ai_json result = validate_ai_json(messy_ai_output) if result.warnings: for warning in result.warnings: if warning.code == "repaired": applied_repairs = warning.detail.get("applied", []) counts = warning.detail.get("counts", {}) print("Applied repairs:") for repair in applied_repairs: print(f" - {repair}") print("\nRepair counts:") for repair_type, count in counts.items(): print(f" - {repair_type}: {count}") # Check repair info if "repair" in result.info: repair_info = result.info["repair"] print(f"Total repairs applied: {len(repair_info.get('applied', []))}") if "skipped" in repair_info: print(f"Repairs skipped: {repair_info['skipped']}")Performance comparison with stdlib json vs orjson:
| Operation | stdlib json | orjson | Speedup |
|---|---|---|---|
| Parse (simple) | 1.8 µs | 0.4 µs | 4.35x |
| Parse (complex) | 20.2 µs | 8.3 µs | 2.42x |
| Dump (simple) | 2.1 µs | 0.2 µs | 10.83x |
| Dump (complex) | 23.8 µs | 2.2 µs | 11.04x |
Benchmarks on Python 3.11.5, Intel Core i9-11900K @ 3.50GHz
Repair operations add minimal overhead:
| Scenario | Time | Notes |
|---|---|---|
| Clean JSON (no repairs) | ~1 µs | Direct parse with orjson |
| Markdown extraction + parse | 84 µs | Full validation pipeline |
| Multiple repairs + parse | 218 µs | Fix quotes, constants, trailing commas |
Average times from validation pipeline benchmarks (1000 iterations)
Install orjson for production use when:
- Processing high volumes of AI outputs
- Latency matters (API endpoints, real-time systems)
- Large JSON payloads (>10KB)
pip install orjsonThe library automatically uses orjson when available, with transparent fallback to stdlib json.
Use AI JSON Cleanroom if you:
- Work with any AI model (GPT, Claude, Gemini, Llama)
- Receive JSON wrapped in explanations or markdown
- Face token limit truncations
- Need detailed error messages for retries
- Want one solution for all AI quirks
- Value zero dependencies
You might not need it if you:
- Only work with clean, guaranteed JSON
- Control token generation completely (using guidance, lm-format-enforcer)
- Never hit token limits
- Your AI model never adds explanatory text
Instead of a complex feature matrix, here's what matters:
Your Current Approach → With Cleanroom
try: json.loads()→ Always get a result, never crashes- Regex extraction → Automatic markdown/fence detection
- Custom retry logic → Structured errors for targeted retries
- "Is it truncated?" → Immediate truncation detection with reasons
- Multiple fix attempts → One call handles everything
- Scattered error handling → Unified validation pipeline
Cleanroom + Instructor (Pydantic):
# 1. Clean with Cleanroom result = validate_ai_json(ai_output) # 2. Map to Pydantic model if result.json_valid: user = UserModel(**result.data)Cleanroom + LangChain:
# Use as pre-processor before LangChain parsers cleaned = validate_ai_json(response.content) if cleaned.json_valid: chain_result = parser.parse(json.dumps(cleaned.data))Cleanroom + Your Custom Logic:
# Get clean data, then apply your business rules result = validate_ai_json(ai_response) if result.json_valid: your_custom_processor(result.data)If you've ever written code like this:
# This is a common scenario... try: data = json.loads(ai_output) except: # Try to extract JSON with regex match = re.search(r'\{.*\}', ai_output, re.DOTALL) if match: try: # Fix quotes maybe? fixed = match.group().replace("'", '"') data = json.loads(fixed) except: # Give up raise ValueError("Can't parse AI output")Then yes, you need this tool. It handles all of that (and much more) in one line:
result = validate_ai_json(ai_output) # Done.MIT License
Copyright (c) 2025
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
If you consider this tool useful, please, consider starring the repo! ⭐ to help others find it