AutoEvaluator: LLM-Based Evaluation Framework

AutoEvaluator is a powerful Python library that accelerates LLM output quality control through automated evaluation. Using LLMs to evaluate LLMs, it provides a simple, transparent, and developer-friendly API to identify True Positives (TP), False Positives (FP), and False Negatives (FN) in generated content against ground truth.

🚀 Features

Automated Evaluation: Compare LLM outputs against ground truth with precision
Multi-Provider Support: Works with AWS Bedrock, OpenAI, Anthropic, and Google Gemini
Comprehensive Metrics: Automatically calculates Precision, Recall, and F1 Score
Async-First Design: Built for high-performance concurrent evaluations
Structured Outputs: Leverages Instructor for type-safe, validated responses
Sentence-Level Granularity: Evaluates claims at the sentence level for detailed insights

🔧 Installation

Requirements

Python 3.9 or higher
An API key for at least one supported LLM provider

Install via pip

pip install autoevaluator

Install from source

git clone https://github.com/yourusername/autoevaluator.git cd autoevaluator pip install -e .

⚡ Quick Start

import asyncio from dotenv import load_dotenv load_dotenv() # Load env variables BEFORE importing autoevaluator from autoevaluator import evaluate, get_instructor_client async def main(): # Setup client for your preferred provider client = get_instructor_client(provider="openai", model="gpt-4o-mini") # Define the claim to evaluate claim = "Feynman was born in 1918 in Malaysia" # Define the ground truth ground_truth = "Feynman was born in 1918 in America." # Evaluate the claim result = await evaluate( claim=claim, ground_truth=ground_truth, client=client, model_name="gpt-4o-mini" ) print(result) # Run the async function asyncio.run(main())

Output:

{ 'TP': ['Feynman was born in 1918.'], 'FP': ['Feynman was born in Malaysia.'], 'FN': ['Feynman was born in America.'], 'precision': 0.5, 'recall': 0.5, 'f1_score': 0.5 }

🔌 Supported Providers

AutoEvaluator supports multiple LLM providers out of the box:

Provider	Models	Environment Variables
AWS Bedrock	Claude Sonnet 4.5	`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`
OpenAI	GPT-4o, GPT-4o-mini, etc.	`OPENAI_API_KEY`
Anthropic	Claude Sonnet 4, etc.	`ANTHROPIC_API_KEY`
Google Gemini	Gemini 2.0 Flash, etc.	`GOOGLE_API_KEY`

⚙️ Configuration

Environment Variables

Create a .env file in your project root:

# OpenAI OPENAI_API_KEY=your_openai_api_key # AWS Bedrock AWS_ACCESS_KEY_ID=your_aws_access_key AWS_SECRET_ACCESS_KEY=your_aws_secret_key AWS_REGION=ap-southeast-1 # Anthropic ANTHROPIC_API_KEY=your_anthropic_api_key # Google Gemini GOOGLE_API_KEY=your_google_api_key

Python Configuration

import os # Set environment variables programmatically os.environ["OPENAI_API_KEY"] = "your_openai_api_key" os.environ["AWS_ACCESS_KEY_ID"] = "your_aws_access_key" os.environ["AWS_SECRET_ACCESS_KEY"] = "your_aws_secret_key"

💡 Usage Examples

Example 1: Using OpenAI

import asyncio from dotenv import load_dotenv load_dotenv() # Load env variables BEFORE importing autoevaluator from autoevaluator import evaluate, get_instructor_client async def evaluate_with_openai(): client = get_instructor_client(provider="openai", model="gpt-4o-mini") claim = "The Earth is flat and the moon landing was in 1969." ground_truth = "The Earth is round. The moon landing was in 1969." result = await evaluate(claim, ground_truth, client=client, model_name="gpt-4o-mini") print(f"True Positives: {result['TP']}") print(f"False Positives: {result['FP']}") print(f"False Negatives: {result['FN']}") print(f"Precision: {result['precision']:.2f}") print(f"Recall: {result['recall']:.2f}") print(f"F1 Score: {result['f1_score']:.2f}") asyncio.run(evaluate_with_openai())

Example 2: Using AWS Bedrock

import asyncio from dotenv import load_dotenv load_dotenv() # Load env variables BEFORE importing autoevaluator from autoevaluator import evaluate, get_instructor_client async def evaluate_with_bedrock(): client = get_instructor_client(provider="bedrock") claim = "Python was created by Guido van Rossum in 1991." ground_truth = "Python was created by Guido van Rossum in 1991." result = await evaluate(claim, ground_truth, client=client, model_name="bedrock-claude") return result result = asyncio.run(evaluate_with_bedrock()) print(f"Perfect match! F1 Score: {result['f1_score']}")

Example 3: Using Anthropic

import asyncio from dotenv import load_dotenv load_dotenv() # Load env variables BEFORE importing autoevaluator from autoevaluator import evaluate, get_instructor_client async def evaluate_with_anthropic(): client = get_instructor_client( provider="anthropic", model="claude-sonnet-4-20250514" ) claim = "Water boils at 100°C at sea level." ground_truth = "Water boils at 100°C at sea level." result = await evaluate(claim, ground_truth, client=client, model_name="claude-sonnet-4-20250514") return result result = asyncio.run(evaluate_with_anthropic())

Example 4: Using Google Gemini

import asyncio from dotenv import load_dotenv load_dotenv() # Load env variables BEFORE importing autoevaluator from autoevaluator import evaluate, get_instructor_client async def evaluate_with_gemini(): client = get_instructor_client( provider="gemini", model="gemini-2.0-flash-exp" ) claim = "The speed of light is approximately 300,000 km/s in a vacuum." ground_truth = "The speed of light is approximately 300,000 kilometers per second in a vacuum." result = await evaluate( claim=claim, ground_truth=ground_truth, client=client, model_name="gemini-2.0-flash-exp" ) print(f"True Positives: {result['TP']}") print(f"False Positives: {result['FP']}") print(f"False Negatives: {result['FN']}") print(f"F1 Score: {result['f1_score']:.2f}") return result result = asyncio.run(evaluate_with_gemini())

Example 5: Batch Evaluation

import asyncio from dotenv import load_dotenv load_dotenv() # Load env variables BEFORE importing autoevaluator from autoevaluator import evaluate, get_instructor_client async def batch_evaluate(): client = get_instructor_client(provider="openai", model="gpt-4o-mini") test_cases = [ { "claim": "Einstein developed the theory of relativity.", "ground_truth": "Einstein developed the theory of relativity." }, { "claim": "The capital of France is London.", "ground_truth": "The capital of France is Paris." }, { "claim": "Water is composed of hydrogen and oxygen.", "ground_truth": "Water is composed of hydrogen and oxygen." } ] tasks = [ evaluate(tc["claim"], tc["ground_truth"], client=client, model_name="gpt-4o-mini") for tc in test_cases ] results = await asyncio.gather(*tasks) for i, result in enumerate(results, 1): print(f"\n--- Test Case {i} ---") print(f"F1 Score: {result['f1_score']:.2f}") print(f"Precision: {result['precision']:.2f}") print(f"Recall: {result['recall']:.2f}") asyncio.run(batch_evaluate())

📚 API Reference

`evaluate()`

Evaluates a claim against ground truth and returns detailed metrics.

async def evaluate( claim: str, ground_truth: str, client: instructor.AsyncInstructor, model_name: str = "gpt-4o-mini" ) -> Dict[str, Any]

Parameters:

claim (str): The text to be evaluated
ground_truth (str): The reference text to compare against
client (instructor.AsyncInstructor): Instructor-wrapped async client
model_name (str): Model identifier to use

Returns:

Dictionary containing:

TP (List[str]): List of true positive sentences
FP (List[str]): List of false positive sentences
FN (List[str]): List of false negative sentences
precision (float): Precision score (0.0 to 1.0)
recall (float): Recall score (0.0 to 1.0)
f1_score (float): F1 score (0.0 to 1.0)

`get_instructor_client()`

Creates an Instructor-wrapped client for the specified LLM provider.

def get_instructor_client( provider: Literal["bedrock", "openai", "anthropic", "gemini"] = "bedrock", model: Optional[str] = None, api_key: Optional[str] = None, mode: instructor.Mode = instructor.Mode.JSON, **kwargs ) -> instructor.AsyncInstructor

Parameters:

provider (str): LLM provider to use ("bedrock", "openai", "anthropic", "gemini")
model (Optional[str]): Model name (uses provider default if None)
api_key (Optional[str]): API key (falls back to environment variables)
mode (instructor.Mode): Instructor parsing mode
**kwargs: Additional provider-specific arguments

Returns:

An Instructor-wrapped async client ready for use.

`text_simplifier()`

Breaks down complex text into simple, single-clause sentences.

async def text_simplifier( text: str, model_name: str, client: instructor.AsyncInstructor ) -> TextSimplify

🔍 How It Works

AutoEvaluator uses a sophisticated multi-step process to evaluate claims:

Text Simplification: Complex sentences are broken down into simple, atomic claims
Question Generation: Each simplified sentence is converted into a fact-checking question
Bidirectional Verification: Questions are checked against both the claim and ground truth
Classification: Sentences are classified as TP, FP, or FN based on verification results
Metrics Calculation: Precision, Recall, and F1 scores are computed from the classifications

Architecture

Input Claim & Ground Truth ↓ Text Simplifier (breaks into atomic sentences) ↓ Question Generator (creates fact-check questions) ↓ Question Checker (verifies against ground truth) ↓ Classification (TP/FP/FN assignment) ↓ Metrics Calculation (Precision, Recall, F1) ↓ Structured Output

🎯 Advanced Usage

Custom Text Simplification

from autoevaluator import text_simplifier, get_instructor_client async def simplify_text(): client = get_instructor_client(provider="openai") complex_text = """Although the weather was bad and it was raining heavily,   we decided to go hiking because we had planned it for weeks.""" result = await text_simplifier( text=complex_text, model_name="gpt-4o-mini", client=client ) print("Simplified sentences:") for sentence in result.simplified_sentences: print(f"- {sentence}") asyncio.run(simplify_text())

Using Provider-Specific Convenience Functions

from autoevaluator.client import ( get_openai_instructor_client, get_bedrock_instructor_client, get_anthropic_instructor_client, get_gemini_instructor_client ) # OpenAI openai_client = get_openai_instructor_client(model="gpt-4o") # Bedrock bedrock_client = get_bedrock_instructor_client() # Anthropic anthropic_client = get_anthropic_instructor_client() # Gemini gemini_client = get_gemini_instructor_client(model="gemini-2.0-flash")

Error Handling

import asyncio from dotenv import load_dotenv load_dotenv() # Load env variables BEFORE importing autoevaluator from autoevaluator import evaluate, get_instructor_client async def safe_evaluate(): try: client = get_instructor_client(provider="openai") result = await evaluate( claim="Some claim", ground_truth="Some truth", client=client, model_name="gpt-4o-mini" ) return result except ValueError as e: print(f"Configuration error: {e}") except Exception as e: print(f"Evaluation error: {e}") asyncio.run(safe_evaluate())

📊 Performance Considerations

Async by Default: All operations are asynchronous for better performance
Batch Processing: Use asyncio.gather() for concurrent evaluations
Rate Limiting: Be mindful of provider rate limits when running batch evaluations
Caching: Consider caching results for repeated evaluations

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

🙏 Acknowledgments

Built with Instructor for structured outputs
Supports multiple LLM providers through unified interfaces
Inspired by the need for automated, reliable LLM evaluation

📧 Contact

Darveen Vijayan

📈 Changelog

Version 1.1.0

Multi-provider support (OpenAI, Bedrock, Anthropic, Gemini)
Async-first architecture
Improved text simplification
Enhanced error handling

Made with ❤️ by Darveen Vijayan

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
autoevaluator		autoevaluator
tests		tests
.gitignore		.gitignore
.tool-versions		.tool-versions
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AutoEvaluator: LLM-Based Evaluation Framework

🚀 Features

🔧 Installation

Requirements

Install via pip

Install from source

⚡ Quick Start

🔌 Supported Providers

⚙️ Configuration

Environment Variables

Python Configuration

💡 Usage Examples

Example 1: Using OpenAI

Example 2: Using AWS Bedrock

Example 3: Using Anthropic

Example 4: Using Google Gemini

Example 5: Batch Evaluation

📚 API Reference

evaluate()

get_instructor_client()

text_simplifier()

🔍 How It Works

Architecture

🎯 Advanced Usage

Custom Text Simplification

Using Provider-Specific Convenience Functions

Error Handling

📊 Performance Considerations

🤝 Contributing

📄 License

🙏 Acknowledgments

📧 Contact

📈 Changelog

Version 1.1.0

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`evaluate()`

`get_instructor_client()`

`text_simplifier()`

Packages