AI JSON Cleanroom

Your AI returns broken JSON? Put this in between.

Works with any AI model: ChatGPT, Claude, Gemini, Llama. Zero setup.

Automatically extracts JSON from markdown/text, repairs common AI mistakes, validates structure. Returns clean data when successful, detailed feedback for retries when not.

Quick Links: Fast Track (2 min) • Why This Tool? • Code Example • Install • Full Documentation ↓

Fast Track: Integration in 3 Steps

Want to start using this right away? Here's how:

Download the ai_json_cleanroom.py file to your project
Tell your AI coding assistant:

"When I receive JSON from an AI response, process it through validate_ai_json() from ai_json_cleanroom.py first"
Done. Your AI assistant (ChatGPT, Claude, Copilot, Cursor) will handle the integration

Ready in 2 minutes. Works immediately.

Show me the code → • Why do I need this? →

Why You Need This

The situation: You request JSON from your AI. Sometimes you receive:

What you get	What breaks
`Sure! Here's the JSON: {"name": "Alice"}`	Extra text crashes `json.loads()`
`{'name': 'Alice'}`	Python quotes instead of JSON
`{"users": [{"id": 1}, {"i`	Truncated mid-response (token limit)

Current solution: Try/catch blocks, regex patterns, manual fixes, repeated API calls.

This tool: Handles all cases automatically. One function call.

See all common problems →

Installation

# Standard installation git clone https://github.com/jordicor/ai-json-cleanroom.git cd ai-json-cleanroom pip install -e . # Optional: 3.6x faster parsing pip install orjson

Ready. Import and use: from ai_json_cleanroom import validate_ai_json

Quick Start

from ai_json_cleanroom import validate_ai_json # Anything your AI returns (messy, wrapped, incomplete) ai_response = "Here's your data:\n```json\n{'name': 'Alice', age: 30} // Python-style syntax\n```\n" # One line to clean and validate result = validate_ai_json(ai_response) if result.json_valid: print(result.data) # Clean: {'name': 'Alice', 'age': 30} else: print(result.errors) # Detailed error information

Done. No configuration needed. It works out of the box.

Check result.warnings to see what was fixed automatically.

What Just Happened?

The cleaner automatically:

Found the JSON inside markdown code fence
Fixed single quotes to double quotes
Added quotes to the unquoted key age
Removed the inline comment
Validated the final structure

Processing time: ~1ms. Zero configuration required.

Useful tip: Check result.likely_truncated to detect when the AI hit its token limit. This saves unnecessary retry API calls.

You're All Set

That's everything you need. The tool works immediately with smart defaults.

Everything below is optional documentation for:

Understanding how the tool works internally
Advanced configuration options
Framework integrations (LangChain, Instructor, etc.)
Your AI assistant to read and understand the full API

For most users: The sections above are sufficient. Start building.

Want to learn more? Continue reading below.

💡 Found this useful? Star the repo ⭐ to help others discover it!

Why This Tool Exists

If you've worked with AI models, you know the frustration. You ask for JSON, and what do you get?

Sometimes it's wrapped in a friendly explanation. Sometimes it has Python-style single quotes. Sometimes it just... stops mid-array because it hit the token limit. And your json.loads() crashes. Again.

This is a common scenario when working with AI models. That's why AI JSON Cleanroom exists: to handle the messy reality of AI outputs so you can focus on building.

The Problem (In Real Life)

Here's what actually happens when you ask an AI model for JSON:

Your Request	What You Expect	What You Actually Get	Why It Happens
"Return user data as JSON"	`{"name": "Alice"}`	`Sure! Here's the JSON:` `{"name": "Alice"}`	AI models are trained to be helpful and conversational
"Give me valid JSON only"	`{"active": true}`	`{'active': True}`	Model confusion between Python and JSON syntax
"Return a large dataset"	Complete JSON	`{"data": [{"id": 1}, {"id": 2}, {"i`	Token limit reached mid-generation
"Format as JSON object"	`{"text": "He said \"hi\""}`	`{"text": "He said "hi""}`	Improper quote escaping
"Output JSON with comments"	Valid JSON	`{name: "Alice", age: 30}`	JavaScript object literal syntax
"Generate configuration"	Clean JSON	`{"items": [1, 2, 3,]}`	Trailing commas (valid in JS/Python, not JSON)

Why existing solutions fall short:

json.loads(): Throws exceptions on malformed input, no context provided
LangChain parsers: Validate structure but don't repair common AI mistakes
Instructor/Pydantic: Excellent for type mapping, but require clean JSON first
Custom regex: Brittle, incomplete, and maintenance-heavy

The Solution

AI JSON Cleanroom is a production-ready, zero-dependency (stdlib only) JSON cleaner designed specifically for AI outputs. It acts as a post-processing layer that extracts, repairs, validates, and provides structured feedback.

Key Benefits:

Smart Extraction - Automatically finds JSON in markdown, code blocks, or mixed text
Conservative Repair - Fixes common AI mistakes without corrupting your data
Truncation Detection - Knows when output was cut off (saves you API calls!)
Schema Validation - Validate structure with JSON Schema or simple path rules
Non-Throwing API - Always returns a result, never crashes your pipeline
Performance - 3.6x faster parsing with optional orjson (but works fine without it)
Granular Control - Choose exactly which repairs to apply (or use smart defaults)

Pro Tip: Start with default options - they handle 95% of AI quirks. Only customize when you hit specific issues.

When Does This Tool Help You?

You're in the right place if:

Your code crashes when parsing the AI's response

You ask the AI for JSON data
The AI returns something that looks like JSON
But json.loads() throws an error and your script stops
You're tired of try-except blocks that don't tell you what went wrong

The AI wraps the JSON in extra text

Instead of just {"name": "Alice"}, you get: "Here's the data you requested: {"name": "Alice"} Let me know if you need anything else!"
Your parser fails because there's text before/after the actual JSON
You've tried telling the AI "return ONLY JSON" but it keeps adding explanations anyway
You don't want to write regex patterns to extract the JSON part

The response is incomplete and you don't know why

Sometimes the JSON just... stops: {"users": [{"name": "Alice", "age": 30}, {"name": "Bob", "a
Your code crashes with confusing error messages
You retry multiple times, same problem
The issue: AIs have a maximum length they can respond (token limit). When they hit it, they stop mid-sentence
This tool detects when JSON was cut off and tells you immediately - so you know to ask for a shorter response or increase the limit

The JSON "looks right" but still fails to parse

You can see the data structure clearly
But Python complains about "invalid syntax" or "expecting property name"
Common hidden issues:
- Mixed quote styles: {'name': "Alice"} (Python allows both, JSON only allows double quotes)
- Trailing commas: [1, 2, 3,] (JavaScript/Python allow this, JSON doesn't)
- Comments: {"name": "Alice" // her name} (many languages allow comments, JSON doesn't)
- Python booleans: {"active": True} instead of {"active": true}
These are easy mistakes for AIs to make, and hard to spot by eye

You're working with different AI models or providers

GPT has different quirks than Claude, which has different quirks than Llama
Each one "breaks" JSON in its own special way
You don't want to write different parsing logic for each model
You want one solution that handles them all

You need to know what was changed

When something gets fixed automatically, you want to know what was fixed
Not just "it works now" without explanation
You might need to log the changes or decide if they're acceptable
Every fix this tool makes is reported, so you stay in control

You want something that just works

No spending hours reading documentation to set it up
No installing a dozen dependencies that might conflict with your other packages
You just want to fix your JSON parsing problem and move on with your project
Single file, drop it in your project, import it, done

What this tool does

It takes the messy response from an AI and:

Finds the JSON part (even if wrapped in text or markdown code blocks)
Fixes common issues (quotes, commas, Python vs JSON syntax)
Tells you if the response was cut off (so you don't waste time retrying)
Reports everything it changed (so you know what happened)
Validates the structure (optional - you can define rules for what fields should exist)

Think of it as a safety net between the AI's response and your code. The AI does its best, but when it messes up, this catches it.

Features

1. Smart Extraction

Automatically extracts JSON from various formats:

from ai_json_cleanroom import validate_ai_json # From markdown code fence (AI models often wrap JSON in ```json blocks) markdown_output = 'Here is the data:\n```json\n{"status": "success"}\n```\n' result = validate_ai_json(markdown_output) # Extracted: {"status": "success"} # From mixed text mixed_output = 'The result is {"status": "success"} as requested.' result = validate_ai_json(mixed_output) # Extracted: {"status": "success"} # From generic code fence generic_fence = '```\n{"status": "success"}\n```' result = validate_ai_json(generic_fence) # Extracted: {"status": "success"}

2. Conservative Repair

Fixes common AI mistakes with configurable safeguards:

from ai_json_cleanroom import validate_ai_json, ValidateOptions # Single quotes → double quotes result = validate_ai_json("{'name': 'Alice'}") # Repaired: {"name": "Alice"} # Python constants → JSON result = validate_ai_json('{"active": True, "value": None}') # Repaired: {"active": true, "value": null} # Unquoted keys → quoted keys result = validate_ai_json('{name: "Alice", age: 30}') # Repaired: {"name": "Alice", "age": 30} # Comments removal result = validate_ai_json(''' {  "name": "Alice", // user name  /* age field */ "age": 30 } ''') # Repaired: {"name": "Alice", "age": 30} # Trailing commas result = validate_ai_json('{"items": [1, 2, 3,]}') # Repaired: {"items": [1, 2, 3]} # Inner unescaped quotes result = validate_ai_json('{"text": "She said "hello" to me"}') # Repaired: {"text": "She said \"hello\" to me"}

Safeguards:

Maximum modifications limit (default: 200 changes or 2% of input size)
Disabled if truncation detected
Incremental parse-check after each repair pass
Detailed repair metadata in result.info

3. Truncation Detection

Identifies incomplete outputs before wasting retries:

from ai_json_cleanroom import validate_ai_json truncated = '{"users": [{"name": "Alice", "age": 30}, {"name": "Bob", "age":' result = validate_ai_json(truncated) print(result.likely_truncated) # True print(result.errors[0].message) # "No JSON payload found in input." print(result.errors[0].detail) # {'truncation_reasons': ['unclosed_braces_or_brackets', 'suspicious_trailing_character']}

Detection signals:

Unclosed strings
Unbalanced braces/brackets
Suspicious trailing characters (,, :, {, [)
Ellipsis at end (...)

4. Schema Validation

Validate against JSON Schema subset:

from ai_json_cleanroom import validate_ai_json schema = { "type": "object", "required": ["name", "email"], "properties": { "name": { "type": "string", "minLength": 1, "maxLength": 100 }, "email": { "type": "string", "pattern": r"^[\w\.-]+@[\w\.-]+\.\w+$" }, "age": { "type": "integer", "minimum": 0, "maximum": 150 }, "tags": { "type": "array", "minItems": 1, "items": {"type": "string"} } }, "additionalProperties": False } result = validate_ai_json(ai_output, schema=schema) if not result.json_valid: for error in result.errors: print(f"{error.code}: {error.message} at {error.path}")

Supported schema keywords:

Types: object, array, string, number, integer, boolean, null
Object: required, properties, patternProperties, additionalProperties
Array: items, additionalItems, minItems, maxItems, uniqueItems
String: minLength, maxLength, pattern
Number: minimum, maximum, exclusiveMinimum, exclusiveMaximum, multipleOf
Combinators: anyOf, oneOf, allOf
Constraints: enum, const, allow_empty

5. Path-Based Expectations

Validate specific paths with wildcard support:

from ai_json_cleanroom import validate_ai_json expectations = [ { "path": "users[*].email", "required": True, "pattern": r"^[\w\.-]+@[\w\.-]+\.\w+$" }, { "path": "users[*].status", "required": True, "in": ["active", "pending", "inactive"] }, { "path": "metadata.version", "required": True, "type": "string", "pattern": r"^\d+\.\d+\.\d+$" }, { "path": "items[*].price", "minimum": 0, "type": "number" } ] result = validate_ai_json(ai_output, expectations=expectations)

Expectation options:

path: JSONPath-like with wildcards ([*] for arrays, * for object values)
required: Whether path must exist (default: True)
type: Expected type(s)
equals: Exact value match
in: Value must be in list
pattern: Regex pattern for strings
min_length, max_length: String length constraints
min_items, max_items: Array size constraints
minimum, maximum: Numeric bounds

6. Non-Throwing API

Always returns a ValidationResult - never crashes:

from ai_json_cleanroom import validate_ai_json result = validate_ai_json(any_input) # Always safe to access print(f"Valid: {result.json_valid}") print(f"Truncated: {result.likely_truncated}") print(f"Errors: {len(result.errors)}") print(f"Warnings: {len(result.warnings)}") print(f"Data: {result.data}") # None if invalid print(f"Info: {result.info}") # Extraction/parsing metadata # Structured error handling for error in result.errors: print(f"Code: {error.code}") print(f"Path: {error.path}") print(f"Message: {error.message}") print(f"Detail: {error.detail}")

Understanding the Configuration Options

Not sure which options to enable? This guide explains each repair strategy with practical examples.

When to Use Each Repair Strategy

`fix_single_quotes` (Default: True)

What it does: Converts Python-style single quotes 'text' to JSON double quotes "text"

When to keep it ON:

Working with GPT models (they often mix Python/JSON syntax)
Processing outputs from code-generation models
General use - this is safe and commonly needed

When to turn it OFF:

Your AI model never uses single quotes (rare)
You're processing pure JSON from a non-AI source

Example scenario:

# GPT often returns this mix: input = "{'name': 'Alice', \"age\": 30}" # Mixed quotes # With fix_single_quotes=True: # ✅ Becomes: {"name": "Alice", "age": 30} # With fix_single_quotes=False: # ❌ Parse fails on single quotes

`quote_unquoted_keys` (Default: True)

What it does: Adds quotes to JavaScript-style unquoted object keys

When to keep it ON:

Working with models trained on JavaScript/TypeScript code
Processing outputs that might include object literals
Claude models (sometimes output JS-style objects)

When to turn it OFF:

Strict JSON-only environment
You want to detect and reject JS-style syntax

Real-world example:

# Claude sometimes returns: input = "{name: 'Alice', age: 30, active: true}" # With quote_unquoted_keys=True: # ✅ Becomes: {"name": "Alice", "age": 30, "active": true}

`replace_constants` (Default: True)

What it does: Converts Python/JS constants (True/False/None) to JSON (true/false/null)

When to keep it ON:

Always, unless you have a specific reason not to
Essential for Python-trained models

Example:

# Models often mix languages: input = '{"active": True, "deleted": False, "parent": None}' # With replace_constants=True: # ✅ Becomes: {"active": true, "deleted": false, "parent": null}

`strip_js_comments` (Default: True)

What it does: Removes JavaScript-style comments (// and /* */)

When to keep it ON:

Models that explain their JSON with comments
When processing configuration-style outputs

Example:

input = ''' {  "name": "Alice", // user name  /* age field */ "age": 30 } ''' # ✅ Comments are safely removed

`normalize_curly_quotes` (Default: "always")

What it does: Handles smart/typographic quotes that break JSON parsing

Options:

"always" - Convert smart quotes before parsing (safest)
"auto" - Only convert if initial parse fails (balanced approach)
"never" - Keep smart quotes as-is (when you want to preserve them)

When to use each:

"always": Default choice, handles copy-paste from documents
"auto": When performance matters and smart quotes are rare
"never": When processing content where quote style matters

Example:

# From copy-paste or models trained on web text: input = '{"text": "She said "hello" to me"}' # Smart quotes # With normalize_curly_quotes="always": # ✅ Becomes: {"text": "She said \"hello\" to me"}

`enable_safe_repairs` (Default: True)

What it does: Master toggle for all repair strategies

When to turn OFF:

You want to validate only, not repair
Debugging to see raw parsing errors
You have your own repair logic

`max_total_repairs` and `max_repairs_percent` (Defaults: 200, 0.02)

What they do: Safety limits to prevent over-correction

When to increase:

Very messy outputs from older models
Known high-error scenarios

When to decrease:

You want stricter validation
Suspicious of too many modifications

Example configuration:

from ai_json_cleanroom import validate_ai_json, ValidateOptions # For very messy outputs: options = ValidateOptions( max_total_repairs=500, # Allow more fixes max_repairs_percent=0.05 # Allow 5% of content to be modified ) # For strict validation: options = ValidateOptions( max_total_repairs=10, # Minimal fixes only max_repairs_percent=0.001 # Less than 0.1% modifications )

📝 Note: Start with defaults. They're battle-tested on thousands of real AI outputs. Only adjust if you have specific issues.

Real-World Integrations

With OpenAI Structured Outputs

The Challenge

Even with OpenAI's JSON mode, you're not guaranteed clean JSON. Why? The model might still:

Wrap JSON in markdown code fences (happens ~15% of the time)
Get truncated if your request is too large
Add "helpful" explanatory text before or after

The Solution

Use Cleanroom as a safety net - it adds virtually no overhead when JSON is clean, but saves you when it's not:

from openai import OpenAI from ai_json_cleanroom import validate_ai_json client = OpenAI() response = client.chat.completions.create( model="gpt-5.1-2025-11-13", messages=[ {"role": "system", "content": "You are a helpful assistant that outputs JSON."}, {"role": "user", "content": "Generate user profile for Alice Johnson, age 30"} ], response_format={"type": "json_object"} ) # Clean and validate the response result = validate_ai_json( response.choices[0].message.content, schema={ "type": "object", "required": ["name", "age"], "properties": { "name": {"type": "string"}, "age": {"type": "integer", "minimum": 0} } } ) if result.json_valid: user_data = result.data print(f"User: {user_data['name']}, Age: {user_data['age']}") else: # Use structured feedback for retry error_msg = "\n".join([e.message for e in result.errors]) print(f"Validation failed:\n{error_msg}") # Optionally retry with feedback

Pro Tips:

Token Limit Handling: If you get truncation, Cleanroom tells you immediately - no need to waste API calls trying to parse incomplete JSON

Retry Strategy: Use the specific error messages for targeted retry prompts

Cost Savings: Check result.likely_truncated before retrying with higher token limits

With Anthropic Claude

The Challenge

Claude loves to be helpful. It often:

Wraps JSON in markdown code fences with explanations
Adds conversational text before and after
Uses varied quote styles depending on context

The Solution

Cleanroom handles Claude's chattiness automatically:

import anthropic from ai_json_cleanroom import validate_ai_json client = anthropic.Anthropic() message = client.messages.create( model="claude-haiku-4-5", max_tokens=1024, messages=[ { "role": "user", "content": "Generate a JSON object with user info for Alice, age 30" } ] ) # Claude might return: # "Here's the user data:\n```json\n{\"name\": \"Alice\", \"age\": 30}\n```\nLet me know if you need anything else!" result = validate_ai_json(message.content[0].text) if result.json_valid: print(f"Extracted data: {result.data}") print(f"Extraction source: {result.info['source']}") # 'code_fence' else: if result.likely_truncated: print("Response was truncated, increasing max_tokens...") else: print(f"Validation errors: {result.errors}")

With LangChain

Use Cleanroom as a pre-processor before LangChain's parsers:

from langchain_openai import ChatOpenAI from langchain.prompts import ChatPromptTemplate from langchain.output_parsers import JsonOutputParser from ai_json_cleanroom import validate_ai_json, ValidateOptions import json # Initialize LangChain LLM llm = ChatOpenAI(model="gpt-5.1", temperature=0) # Create prompt prompt = ChatPromptTemplate.from_template( "Generate a JSON object with information about {topic}. Return only valid JSON." ) # Get LLM response chain = prompt | llm response = chain.invoke({"topic": "Python programming"}) # Step 1: Clean with ai-json-cleanroom cleaned = validate_ai_json( response.content, options=ValidateOptions( enable_safe_repairs=True, extract_json=True ) ) if cleaned.json_valid: # Step 2: Pass to LangChain parser if needed parser = JsonOutputParser() # Convert back to string for LangChain parser structured = parser.parse(json.dumps(cleaned.data)) print(structured) # Or use cleaned.data directly print(cleaned.data) else: print(f"Cleaning failed: {cleaned.errors}") if cleaned.likely_truncated: print("Retry with higher max_tokens")

With Instructor (Pydantic)

Cleanroom and Instructor work perfectly together - clean first, then map to Pydantic models:

from pydantic import BaseModel, Field import instructor from openai import OpenAI from ai_json_cleanroom import validate_ai_json # Define Pydantic model class User(BaseModel): name: str = Field(description="User's full name") age: int = Field(ge=0, le=150, description="User's age") email: str = Field(pattern=r"^[\w\.-]+@[\w\.-]+\.\w+$") tags: list[str] = Field(default_factory=list) # Get raw AI output (without Instructor's patching) client = OpenAI() response = client.chat.completions.create( model="gpt-5.1", messages=[ {"role": "user", "content": "Generate user info for Alice, 30 years old"} ] ) raw_output = response.choices[0].message.content # Step 1: Clean with ai-json-cleanroom result = validate_ai_json(raw_output) if result.json_valid: # Step 2: Map to Pydantic with Instructor try: user = User(**result.data) print(f"User: {user.name}, Age: {user.age}, Email: {user.email}") except Exception as e: print(f"Pydantic validation failed: {e}") else: print(f"JSON cleaning failed: {result.errors}")

Alternative with Instructor's client:

import instructor from openai import OpenAI from ai_json_cleanroom import validate_ai_json # Patch OpenAI client with Instructor client = instructor.from_openai(OpenAI()) # If Instructor fails, fallback to Cleanroom try: user = client.chat.completions.create( model="gpt-5.1", response_model=User, messages=[{"role": "user", "content": "Generate user info"}] ) except Exception as e: # Fallback: get raw response and clean manually raw_response = client.chat.completions.create( model="gpt-5.1", messages=[{"role": "user", "content": "Generate user info as JSON"}] ) result = validate_ai_json(raw_response.choices[0].message.content) if result.json_valid: user = User(**result.data)

With Streaming Outputs

Handle streaming responses by collecting chunks first:

from openai import OpenAI from ai_json_cleanroom import validate_ai_json client = OpenAI() # Collect streaming chunks chunks = [] stream = client.chat.completions.create( model="gpt-5.1", messages=[{"role": "user", "content": "Generate user JSON"}], stream=True ) for chunk in stream: if chunk.choices[0].delta.content: chunks.append(chunk.choices[0].delta.content) # Validate complete output full_output = ''.join(chunks) result = validate_ai_json(full_output) if result.likely_truncated: print("Stream was truncated, consider retrying with higher limits") print(f"Truncation reasons: {result.errors[0].detail.get('truncation_reasons')}") elif result.json_valid: print(f"Valid JSON received: {result.data}") else: print(f"Validation errors: {result.errors}")

Retry Logic with Structured Feedback

Use validation errors to provide specific feedback for retries:

from ai_json_cleanroom import validate_ai_json import openai def generate_with_retry(prompt, schema, max_retries=3): """Generate JSON with automatic retry on validation failure.""" client = openai.OpenAI() for attempt in range(max_retries): response = client.chat.completions.create( model="gpt-5.1", messages=[ {"role": "system", "content": "You are a helpful assistant that returns valid JSON."}, {"role": "user", "content": prompt} ] ) result = validate_ai_json( response.choices[0].message.content, schema=schema ) if result.json_valid: return result.data # Build feedback for retry if result.likely_truncated: prompt = f"{prompt}\n\nIMPORTANT: Your previous response was truncated. Please ensure the complete JSON is returned." else: error_messages = [f"- {e.path}: {e.message}" for e in result.errors] feedback = "\n".join(error_messages) prompt = f"{prompt}\n\nYour previous JSON had these issues:\n{feedback}\n\nPlease fix these and return valid JSON." raise ValueError(f"Failed to generate valid JSON after {max_retries} attempts") # Usage schema = { "type": "object", "required": ["name", "email", "age"], "properties": { "name": {"type": "string"}, "email": {"type": "string", "pattern": r"^[\w\.-]+@[\w\.-]+\.\w+$"}, "age": {"type": "integer", "minimum": 0} } } user_data = generate_with_retry( "Generate a user profile for Alice Johnson", schema=schema ) print(user_data)

Common Scenarios & Solutions

Scenario 1: "My AI model keeps adding explanations"

The Problem: You explicitly ask for JSON only, but get:

I'll help you with that! Here's the JSON data: {"status": "success"} Let me know if you need anything else!

The Solution:

# Cleanroom automatically extracts the JSON part result = validate_ai_json(chatty_response) print(result.data) # Just the JSON: {"status": "success"} print(result.info['source']) # Tells you where it found it: 'balanced_block'

Scenario 2: "Token limits are cutting off my JSON"

The Problem: Large responses get truncated:

{"users": [{"id": 1, "name": "Alice"}, {"id": 2, "na

The Solution:

result = validate_ai_json(truncated_response) if result.likely_truncated: # You know exactly what happened print("Response truncated - reasons:", result.errors[0].detail['truncation_reasons']) # Output: ['unclosed_braces_or_brackets', 'unterminated_string'] # Smart retry with higher token limit retry_with_higher_limit()

Scenario 3: "Mixed quote styles are breaking everything"

The Problem: Your AI model mixes Python and JSON syntax:

output = "{'users': [\"Alice\", \"Bob\"], 'count': 2}"

The Solution:

result = validate_ai_json(output) # Automatically fixes to: {"users": ["Alice", "Bob"], "count": 2}

Scenario 4: "I need to validate specific fields exist"

The Problem: You need certain fields but don't want full schema validation.

The Solution: Use path expectations:

expectations = [ {"path": "users[*].email", "required": True}, {"path": "metadata.version", "pattern": r"^\d+\.\d+\.\d+$"} ] result = validate_ai_json(ai_output, expectations=expectations) # Validates that all users have emails and version is semver

Scenario 5: "The JSON has comments and I want to keep the information"

The Problem: AI model adds helpful comments that contain important context:

{ "temperature": 0.7, // Higher for creativity "max_tokens": 100 // Keep responses concise }

The Solution:

# First, extract with comments preserved to see them raw_response = ai_output # Clean for parsing result = validate_ai_json(raw_response) # The comments are removed for valid JSON print(result.data) # {"temperature": 0.7, "max_tokens": 100} # If you need the comments, parse them separately from raw_response

Scenario 6: "Different AI models fail in different ways"

The Problem: GPT uses Python syntax, Claude wraps in markdown, Gemini truncates.

The Solution: One configuration handles all:

from ai_json_cleanroom import validate_ai_json # Same code for ALL models def clean_any_ai_output(output): result = validate_ai_json(output) # Default options handle everything if result.json_valid: return result.data elif result.likely_truncated: raise ValueError("Output truncated - increase token limit") else: raise ValueError(f"Could not parse: {result.errors}") # Works with GPT, Claude, Gemini, Llama, etc.

⚠️ Important: Truncation detection always runs first. If JSON is truncated, repairs are skipped to avoid corrupting partial data.

Troubleshooting Guide

"Why isn't my JSON being repaired?"

Possible causes and solutions:

Truncation detected
- Cleanroom disables repairs for truncated input (safety measure)
- Solution: Get complete output first, then retry

Repair limit reached

Default limit: 200 changes or 2% of input size
Solution: Increase limits if needed:

options = ValidateOptions( max_total_repairs=500, # Raise limit max_repairs_percent=0.05 # Allow 5% modifications )

Specific repair disabled
- Check your options - maybe fix_single_quotes=False?
- Solution: Enable the specific repair you need

"The parser says JSON is invalid but it looks fine to me"

Common hidden issues:

Invisible Unicode characters (zero-width spaces, etc.)
Smart quotes from copy-paste: "text" vs "text"
Line breaks inside strings without proper escaping

Diagnosis:

result = validate_ai_json(your_input, options=ValidateOptions( normalize_curly_quotes="always" # Fixes smart quotes )) print(result.errors) # See specific character positions

"It works with GPT but fails with Claude"

Issue: Different models have different quirks.

Solution: Check the extraction source:

result = validate_ai_json(claude_output) print(f"Found JSON in: {result.info['source']}") # 'code_fence' = markdown block # 'balanced_block' = found in text # 'raw' = was already clean

"Performance is slow with large outputs"

Solutions:

Install orjson: pip install orjson for 3.6x speedup

Disable unnecessary repairs:

options = ValidateOptions( strip_js_comments=False, # If you never have comments normalize_curly_quotes="never" # If you never have smart quotes )

"I want to see what was changed"

Solution: Check warnings and info:

result = validate_ai_json(messy_json) # See all repairs applied for warning in result.warnings: if warning.code == "repaired": print(f"Repairs applied: {warning.detail['applied']}") print(f"Number of changes: {warning.detail['counts']}") # See extraction details print(f"Extraction method: {result.info['source']}") print(f"Parser used: {result.info['parse_backend']}")

"Schema validation is rejecting valid data"

Common issues:

Pattern escaping: Remember to use raw strings for regex: r"^\d+$"
Type mismatches: JSON numbers include floats - use "type": "number" not "integer" unless you're sure
Required fields: Double-check field names are exact matches

Debug approach:

# Start without schema to see actual structure result = validate_ai_json(output) print(json.dumps(result.data, indent=2)) # Then add schema gradually schema = {"type": "object"} # Start simple # Add requirements one by one

API Reference

`validate_ai_json()`

Main validation function with comprehensive options.

def validate_ai_json( input_data: Union[str, bytes, Dict, List], schema: Optional[Dict[str, Any]] = None, expectations: Optional[List[Dict[str, Any]]] = None, options: Optional[ValidateOptions] = None ) -> ValidationResult: """  Validate AI output against JSON parseability, schema, and expectations.   Args:  input_data: String, bytes, or already-parsed dict/list  schema: JSON Schema subset for validation  expectations: List of path-based validation rules  options: Configuration for parsing, extraction, and repair   Returns:  ValidationResult with json_valid, errors, warnings, data, and info  """

`ValidationResult`

Result object returned by validate_ai_json().

@dataclass class ValidationResult: json_valid: bool # True if parsing and validation succeeded likely_truncated: bool # True if input appears truncated errors: List[ValidationIssue] # Validation errors warnings: List[ValidationIssue] # Non-blocking warnings data: Optional[Union[Dict, List]] # Parsed JSON if valid, else None info: Dict[str, Any] # Extraction/parsing metadata def to_dict(self) -> Dict[str, Any]: """Convert result to dictionary."""

Metadata in info:

source: How JSON was found ("raw", "code_fence", "balanced_block", "object")
extraction: Details about extraction process
parse_backend: Which parser was used ("orjson" or "json")
curly_quotes_normalization_used: Whether typographic quotes were normalized
repair: Details about applied repairs (if any)

`ValidationIssue`

Individual validation error or warning.

@dataclass class ValidationIssue: code: ErrorCode # Error type (enum) path: str # JSONPath where error occurred message: str # Human-readable description detail: Optional[Dict[str, Any]] # Additional context def to_dict(self) -> Dict[str, Any]: """Convert issue to dictionary."""

`ValidateOptions`

Configuration for validation behavior.

@dataclass class ValidateOptions: # Extraction options strict: bool = False extract_json: bool = True allow_json_in_code_fences: bool = True allow_bare_top_level_scalars: bool = False tolerate_trailing_commas: bool = True stop_on_first_error: bool = False # Repair options enable_safe_repairs: bool = True allow_json5_like: bool = True # Master toggle for JSON5-like repairs replace_constants: bool = True # True/False/None → true/false/null replace_nans_infinities: bool = True # NaN/Infinity → null max_total_repairs: int = 200 max_repairs_percent: float = 0.02 # 2% of input size # Granular repair control (new in v1.1) normalize_curly_quotes: str = "always" # "always"|"auto"|"never" fix_single_quotes: bool = True quote_unquoted_keys: bool = True strip_js_comments: bool = True # Custom repair hooks (new in v1.1) custom_repair_hooks: Optional[List[Callable]] = None

Curly quotes normalization modes:

"always" (default): Normalize typographic quotes before parsing
"auto": Try parsing first; only normalize if parse fails
"never": Never normalize (preserves typographic quotes as-is)

`ErrorCode`

Enumeration of validation error types.

class ErrorCode(str, Enum): PARSE_ERROR = "parse_error" TRUNCATED = "truncated" MISSING_REQUIRED = "missing_required" TYPE_MISMATCH = "type_mismatch" ENUM_MISMATCH = "enum_mismatch" CONST_MISMATCH = "const_mismatch" NOT_ALLOWED_EMPTY = "not_allowed_empty" ADDITIONAL_PROPERTY = "additional_property" PATTERN_MISMATCH = "pattern_mismatch" MIN_LENGTH = "min_length" MAX_LENGTH = "max_length" MIN_ITEMS = "min_items" MAX_ITEMS = "max_items" MINIMUM = "minimum" MAXIMUM = "maximum" REPAIRED = "repaired" # Warning: repair was applied # ... and more

Helper Functions

def extract_json_payload( text: str, options: Optional[ValidateOptions] = None ) -> Tuple[Optional[str], Dict[str, Any]]: """  Extract JSON string from raw text.   Returns:  (payload, info) - Payload is raw JSON string or None  """ def detect_truncation(s: str) -> Tuple[bool, List[str]]: """  Heuristic truncation detector.   Returns:  (likely_truncated, reasons)  """

CLI Usage - Interactive Testing

The CLI isn't just for automation - it's perfect for debugging AI outputs during development.

Quick Testing During Development

# Got weird output from your AI model? Test it immediately: echo '{"name": "test"}' | python ai_json_cleanroom.py --input - # Testing a saved AI response: python ai_json_cleanroom.py --input gpt_response.txt # See exactly what gets fixed: python ai_json_cleanroom.py --input messy.json --verbose # Output shows: # Fixed 3 single-quoted strings # Quoted 2 unquoted keys # Normalized 4 curly quotes # Removed 2 trailing commas # Validate inline text python ai_json_cleanroom.py --input '{"name": "Alice", "age": 30}' # With JSON Schema validation python ai_json_cleanroom.py --input response.txt --schema schema.json # With expectations python ai_json_cleanroom.py --input response.txt --expectations expectations.json

Understanding Repair Behavior

# See what would be fixed without actually fixing: python ai_json_cleanroom.py --input data.json --dry-run # Test different repair strategies: python ai_json_cleanroom.py --input data.json \ --normalize-curly-quotes never \ --no-fix-single-quotes \ --verbose # See the difference # Process multiple files: for file in responses/*.json; do echo "Processing $file..." python ai_json_cleanroom.py --input "$file" --output-clean done

Options

# Disable extraction (input must be pure JSON) python ai_json_cleanroom.py --input data.json --no-extract # Disable repair stage python ai_json_cleanroom.py --input response.txt --no-repair # Disable specific repair passes python ai_json_cleanroom.py --input response.txt \ --no-fix-single-quotes \ --no-quote-unquoted-keys \ --no-strip-comments # Control curly quotes normalization python ai_json_cleanroom.py --input response.txt \ --normalize-curly-quotes auto # always|auto|never # Adjust repair limits python ai_json_cleanroom.py --input response.txt \ --max-repairs 500 \ --repairs-percent 0.05 # Strict mode (stop on first error) python ai_json_cleanroom.py --input response.txt --strict # Control output format python ai_json_cleanroom.py --input response.txt \ --indent 4 \ --ensure-ascii

Output Format

The CLI outputs a JSON result with validation details:

{ "json_valid": true, "likely_truncated": false, "errors": [], "warnings": [ { "code": "repaired", "path": "$", "message": "Input JSON was repaired by conservative heuristics.", "detail": { "applied": ["single_quoted_to_double_quoted", "replace_constants"], "counts": { "single_quoted_strings_converted": 3, "replace_constants": {"true_false_none": 2} } } } ], "data": { "name": "Alice", "age": 30, "active": true }, "info": { "source": "code_fence", "parse_backend": "orjson", "curly_quotes_normalization_used": true } }

Advanced Configuration

Granular Repair Control

Fine-tune which repair strategies to apply:

from ai_json_cleanroom import validate_ai_json, ValidateOptions options = ValidateOptions( # Master toggle (backward compatibility) allow_json5_like=True, # Individual repair passes (new in v1.1) fix_single_quotes=True, # 'foo' → "foo" quote_unquoted_keys=True, # {foo: 1} → {"foo": 1} strip_js_comments=True, # Remove // and /* */ comments # Python/JS constants replace_constants=True, # True/False/None → true/false/null replace_nans_infinities=True, # NaN/Infinity → null # Curly quotes handling normalize_curly_quotes="auto", # "always"|"auto"|"never" # Safety limits max_total_repairs=200, max_repairs_percent=0.02 ) result = validate_ai_json(ai_output, options=options)

Custom Repair Hooks

Add domain-specific repair logic:

from ai_json_cleanroom import validate_ai_json, ValidateOptions def my_custom_repair(text: str, options: ValidateOptions): """  Custom repair function.   Args:  text: Current text being repaired  options: ValidateOptions instance   Returns:  (modified_text, changes_count, metadata_dict)  """ modified = text changes = 0 metadata = {} # Example: Replace specific domain patterns if "UNDEFINED" in modified: modified = modified.replace("UNDEFINED", "null") changes += modified.count("null") - text.count("null") metadata["undefined_replacements"] = changes return modified, changes, metadata # Use custom hook options = ValidateOptions( custom_repair_hooks=[my_custom_repair] ) result = validate_ai_json(ai_output, options=options) # Check if custom hook was applied if "custom_hook:my_custom_repair" in result.info.get("repair", {}).get("applied", []): print("Custom repair was applied")

Path Expectations with Wildcards

Validate complex nested structures:

from ai_json_cleanroom import validate_ai_json # Validate API response structure expectations = [ # All users must have email and it must match pattern { "path": "data.users[*].email", "required": True, "pattern": r"^[\w\.-]+@[\w\.-]+\.\w+$" }, # All users must have valid status { "path": "data.users[*].status", "required": True, "in": ["active", "pending", "inactive", "banned"] }, # Nested array validation { "path": "data.users[*].orders[*].total", "type": "number", "minimum": 0 }, # Metadata version must be semver { "path": "metadata.api_version", "required": True, "pattern": r"^\d+\.\d+\.\d+$" }, # Optional field with constraints when present { "path": "data.users[*].phone", "required": False, "pattern": r"^\+?[\d\s\-\(\)]+$" } ] result = validate_ai_json(ai_output, expectations=expectations) # Get specific path errors for error in result.errors: if error.code == "path_not_found": print(f"Missing required path: {error.path}") elif error.code == "pattern_mismatch": print(f"Pattern mismatch at {error.path}: {error.message}")

Handling Repairs Metadata

Track exactly what was repaired:

from ai_json_cleanroom import validate_ai_json result = validate_ai_json(messy_ai_output) if result.warnings: for warning in result.warnings: if warning.code == "repaired": applied_repairs = warning.detail.get("applied", []) counts = warning.detail.get("counts", {}) print("Applied repairs:") for repair in applied_repairs: print(f" - {repair}") print("\nRepair counts:") for repair_type, count in counts.items(): print(f" - {repair_type}: {count}") # Check repair info if "repair" in result.info: repair_info = result.info["repair"] print(f"Total repairs applied: {len(repair_info.get('applied', []))}") if "skipped" in repair_info: print(f"Repairs skipped: {repair_info['skipped']}")

Performance

Benchmarks

Performance comparison with stdlib json vs orjson:

Operation	stdlib json	orjson	Speedup
Parse (simple)	1.8 µs	0.4 µs	4.35x
Parse (complex)	20.2 µs	8.3 µs	2.42x
Dump (simple)	2.1 µs	0.2 µs	10.83x
Dump (complex)	23.8 µs	2.2 µs	11.04x

Benchmarks on Python 3.11.5, Intel Core i9-11900K @ 3.50GHz

Repair Overhead

Repair operations add minimal overhead:

Scenario	Time	Notes
Clean JSON (no repairs)	~1 µs	Direct parse with orjson
Markdown extraction + parse	84 µs	Full validation pipeline
Multiple repairs + parse	218 µs	Fix quotes, constants, trailing commas

Average times from validation pipeline benchmarks (1000 iterations)

When to Use orjson

Install orjson for production use when:

Processing high volumes of AI outputs
Latency matters (API endpoints, real-time systems)
Large JSON payloads (>10KB)

pip install orjson

The library automatically uses orjson when available, with transparent fallback to stdlib json.

Should I Use This Tool?

Quick Decision Guide

Use AI JSON Cleanroom if you:

Work with any AI model (GPT, Claude, Gemini, Llama)
Receive JSON wrapped in explanations or markdown
Face token limit truncations
Need detailed error messages for retries
Want one solution for all AI quirks
Value zero dependencies

You might not need it if you:

Only work with clean, guaranteed JSON
Control token generation completely (using guidance, lm-format-enforcer)
Never hit token limits
Your AI model never adds explanatory text

Comparison with What You're Using Now

Instead of a complex feature matrix, here's what matters:

Your Current Approach → With Cleanroom

try: json.loads() → Always get a result, never crashes
Regex extraction → Automatic markdown/fence detection
Custom retry logic → Structured errors for targeted retries
"Is it truncated?" → Immediate truncation detection with reasons
Multiple fix attempts → One call handles everything
Scattered error handling → Unified validation pipeline

Works Great With Other Tools

Cleanroom + Instructor (Pydantic):

# 1. Clean with Cleanroom result = validate_ai_json(ai_output) # 2. Map to Pydantic model if result.json_valid: user = UserModel(**result.data)

Cleanroom + LangChain:

# Use as pre-processor before LangChain parsers cleaned = validate_ai_json(response.content) if cleaned.json_valid: chain_result = parser.parse(json.dumps(cleaned.data))

Cleanroom + Your Custom Logic:

# Get clean data, then apply your business rules result = validate_ai_json(ai_response) if result.json_valid: your_custom_processor(result.data)

The Bottom Line

If you've ever written code like this:

# This is a common scenario... try: data = json.loads(ai_output) except: # Try to extract JSON with regex match = re.search(r'\{.*\}', ai_output, re.DOTALL) if match: try: # Fix quotes maybe? fixed = match.group().replace("'", '"') data = json.loads(fixed) except: # Give up raise ValueError("Can't parse AI output")

Then yes, you need this tool. It handles all of that (and much more) in one line:

result = validate_ai_json(ai_output) # Done.

License

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

If you consider this tool useful, please, consider starring the repo! ⭐ to help others find it

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples		examples
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
ai_json_cleanroom.py		ai_json_cleanroom.py
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

License

jordicor/ai-json-cleanroom

Folders and files

Latest commit

History

Repository files navigation

AI JSON Cleanroom

Fast Track: Integration in 3 Steps

Why You Need This

Installation

Quick Start

What Just Happened?

You're All Set

Why This Tool Exists

The Problem (In Real Life)

The Solution

When Does This Tool Help You?

You're in the right place if:

What this tool does

Features

1. Smart Extraction

2. Conservative Repair

3. Truncation Detection

4. Schema Validation

5. Path-Based Expectations

6. Non-Throwing API

Understanding the Configuration Options

When to Use Each Repair Strategy

fix_single_quotes (Default: True)

quote_unquoted_keys (Default: True)

replace_constants (Default: True)

strip_js_comments (Default: True)

normalize_curly_quotes (Default: "always")

enable_safe_repairs (Default: True)

max_total_repairs and max_repairs_percent (Defaults: 200, 0.02)

Real-World Integrations

With OpenAI Structured Outputs

The Challenge

The Solution

Pro Tips:

With Anthropic Claude

The Challenge

The Solution

With LangChain

With Instructor (Pydantic)

With Streaming Outputs

Retry Logic with Structured Feedback

Common Scenarios & Solutions

Scenario 1: "My AI model keeps adding explanations"

Scenario 2: "Token limits are cutting off my JSON"

Scenario 3: "Mixed quote styles are breaking everything"

Scenario 4: "I need to validate specific fields exist"

Scenario 5: "The JSON has comments and I want to keep the information"

Scenario 6: "Different AI models fail in different ways"

Troubleshooting Guide

"Why isn't my JSON being repaired?"

"The parser says JSON is invalid but it looks fine to me"

"It works with GPT but fails with Claude"

"Performance is slow with large outputs"

"I want to see what was changed"

"Schema validation is rejecting valid data"

API Reference

validate_ai_json()

ValidationResult

ValidationIssue

ValidateOptions

ErrorCode

Helper Functions

CLI Usage - Interactive Testing

Quick Testing During Development

Understanding Repair Behavior

Options

Output Format

Advanced Configuration

Granular Repair Control

Custom Repair Hooks

Path Expectations with Wildcards

Handling Repairs Metadata

Performance

Benchmarks

`fix_single_quotes` (Default: True)

`quote_unquoted_keys` (Default: True)

`replace_constants` (Default: True)

`strip_js_comments` (Default: True)

`normalize_curly_quotes` (Default: "always")

`enable_safe_repairs` (Default: True)

`max_total_repairs` and `max_repairs_percent` (Defaults: 200, 0.02)

`validate_ai_json()`

`ValidationResult`

`ValidationIssue`

`ValidateOptions`

`ErrorCode`

Packages