Skip to content

darveenvijayan/autoevaluator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

80 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

AutoEvaluator: LLM-Based Evaluation Framework

PyPI version Python Version Downloads License: MIT

LinkedIn Medium Twitter Follow

AutoEvaluator is a powerful Python library that accelerates LLM output quality control through automated evaluation. Using LLMs to evaluate LLMs, it provides a simple, transparent, and developer-friendly API to identify True Positives (TP), False Positives (FP), and False Negatives (FN) in generated content against ground truth.

πŸš€ Features

  • Automated Evaluation: Compare LLM outputs against ground truth with precision
  • Multi-Provider Support: Works with AWS Bedrock, OpenAI, Anthropic, and Google Gemini
  • Comprehensive Metrics: Automatically calculates Precision, Recall, and F1 Score
  • Async-First Design: Built for high-performance concurrent evaluations
  • Structured Outputs: Leverages Instructor for type-safe, validated responses
  • Sentence-Level Granularity: Evaluates claims at the sentence level for detailed insights

πŸ”§ Installation

Requirements

  • Python 3.9 or higher
  • An API key for at least one supported LLM provider

Install via pip

pip install autoevaluator

Install from source

git clone https://github.com/yourusername/autoevaluator.git cd autoevaluator pip install -e .

⚑ Quick Start

import asyncio from dotenv import load_dotenv load_dotenv() # Load env variables BEFORE importing autoevaluator from autoevaluator import evaluate, get_instructor_client async def main(): # Setup client for your preferred provider client = get_instructor_client(provider="openai", model="gpt-4o-mini") # Define the claim to evaluate claim = "Feynman was born in 1918 in Malaysia" # Define the ground truth ground_truth = "Feynman was born in 1918 in America." # Evaluate the claim result = await evaluate( claim=claim, ground_truth=ground_truth, client=client, model_name="gpt-4o-mini" ) print(result) # Run the async function asyncio.run(main())

Output:

{ 'TP': ['Feynman was born in 1918.'], 'FP': ['Feynman was born in Malaysia.'], 'FN': ['Feynman was born in America.'], 'precision': 0.5, 'recall': 0.5, 'f1_score': 0.5 }

πŸ”Œ Supported Providers

AutoEvaluator supports multiple LLM providers out of the box:

Provider Models Environment Variables
AWS Bedrock Claude Sonnet 4.5 AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION
OpenAI GPT-4o, GPT-4o-mini, etc. OPENAI_API_KEY
Anthropic Claude Sonnet 4, etc. ANTHROPIC_API_KEY
Google Gemini Gemini 2.0 Flash, etc. GOOGLE_API_KEY

βš™οΈ Configuration

Environment Variables

Create a .env file in your project root:

# OpenAI OPENAI_API_KEY=your_openai_api_key # AWS Bedrock AWS_ACCESS_KEY_ID=your_aws_access_key AWS_SECRET_ACCESS_KEY=your_aws_secret_key AWS_REGION=ap-southeast-1 # Anthropic ANTHROPIC_API_KEY=your_anthropic_api_key # Google Gemini GOOGLE_API_KEY=your_google_api_key

Python Configuration

import os # Set environment variables programmatically os.environ["OPENAI_API_KEY"] = "your_openai_api_key" os.environ["AWS_ACCESS_KEY_ID"] = "your_aws_access_key" os.environ["AWS_SECRET_ACCESS_KEY"] = "your_aws_secret_key"

πŸ’‘ Usage Examples

Example 1: Using OpenAI

import asyncio from dotenv import load_dotenv load_dotenv() # Load env variables BEFORE importing autoevaluator from autoevaluator import evaluate, get_instructor_client async def evaluate_with_openai(): client = get_instructor_client(provider="openai", model="gpt-4o-mini") claim = "The Earth is flat and the moon landing was in 1969." ground_truth = "The Earth is round. The moon landing was in 1969." result = await evaluate(claim, ground_truth, client=client, model_name="gpt-4o-mini") print(f"True Positives: {result['TP']}") print(f"False Positives: {result['FP']}") print(f"False Negatives: {result['FN']}") print(f"Precision: {result['precision']:.2f}") print(f"Recall: {result['recall']:.2f}") print(f"F1 Score: {result['f1_score']:.2f}") asyncio.run(evaluate_with_openai())

Example 2: Using AWS Bedrock

import asyncio from dotenv import load_dotenv load_dotenv() # Load env variables BEFORE importing autoevaluator from autoevaluator import evaluate, get_instructor_client async def evaluate_with_bedrock(): client = get_instructor_client(provider="bedrock") claim = "Python was created by Guido van Rossum in 1991." ground_truth = "Python was created by Guido van Rossum in 1991." result = await evaluate(claim, ground_truth, client=client, model_name="bedrock-claude") return result result = asyncio.run(evaluate_with_bedrock()) print(f"Perfect match! F1 Score: {result['f1_score']}")

Example 3: Using Anthropic

import asyncio from dotenv import load_dotenv load_dotenv() # Load env variables BEFORE importing autoevaluator from autoevaluator import evaluate, get_instructor_client async def evaluate_with_anthropic(): client = get_instructor_client( provider="anthropic", model="claude-sonnet-4-20250514" ) claim = "Water boils at 100Β°C at sea level." ground_truth = "Water boils at 100Β°C at sea level." result = await evaluate(claim, ground_truth, client=client, model_name="claude-sonnet-4-20250514") return result result = asyncio.run(evaluate_with_anthropic())

Example 4: Using Google Gemini

import asyncio from dotenv import load_dotenv load_dotenv() # Load env variables BEFORE importing autoevaluator from autoevaluator import evaluate, get_instructor_client async def evaluate_with_gemini(): client = get_instructor_client( provider="gemini", model="gemini-2.0-flash-exp" ) claim = "The speed of light is approximately 300,000 km/s in a vacuum." ground_truth = "The speed of light is approximately 300,000 kilometers per second in a vacuum." result = await evaluate( claim=claim, ground_truth=ground_truth, client=client, model_name="gemini-2.0-flash-exp" ) print(f"True Positives: {result['TP']}") print(f"False Positives: {result['FP']}") print(f"False Negatives: {result['FN']}") print(f"F1 Score: {result['f1_score']:.2f}") return result result = asyncio.run(evaluate_with_gemini())

Example 5: Batch Evaluation

import asyncio from dotenv import load_dotenv load_dotenv() # Load env variables BEFORE importing autoevaluator from autoevaluator import evaluate, get_instructor_client async def batch_evaluate(): client = get_instructor_client(provider="openai", model="gpt-4o-mini") test_cases = [ { "claim": "Einstein developed the theory of relativity.", "ground_truth": "Einstein developed the theory of relativity." }, { "claim": "The capital of France is London.", "ground_truth": "The capital of France is Paris." }, { "claim": "Water is composed of hydrogen and oxygen.", "ground_truth": "Water is composed of hydrogen and oxygen." } ] tasks = [ evaluate(tc["claim"], tc["ground_truth"], client=client, model_name="gpt-4o-mini") for tc in test_cases ] results = await asyncio.gather(*tasks) for i, result in enumerate(results, 1): print(f"\n--- Test Case {i} ---") print(f"F1 Score: {result['f1_score']:.2f}") print(f"Precision: {result['precision']:.2f}") print(f"Recall: {result['recall']:.2f}") asyncio.run(batch_evaluate())

πŸ“š API Reference

evaluate()

Evaluates a claim against ground truth and returns detailed metrics.

async def evaluate( claim: str, ground_truth: str, client: instructor.AsyncInstructor, model_name: str = "gpt-4o-mini" ) -> Dict[str, Any]

Parameters:

  • claim (str): The text to be evaluated
  • ground_truth (str): The reference text to compare against
  • client (instructor.AsyncInstructor): Instructor-wrapped async client
  • model_name (str): Model identifier to use

Returns:

Dictionary containing:

  • TP (List[str]): List of true positive sentences
  • FP (List[str]): List of false positive sentences
  • FN (List[str]): List of false negative sentences
  • precision (float): Precision score (0.0 to 1.0)
  • recall (float): Recall score (0.0 to 1.0)
  • f1_score (float): F1 score (0.0 to 1.0)

get_instructor_client()

Creates an Instructor-wrapped client for the specified LLM provider.

def get_instructor_client( provider: Literal["bedrock", "openai", "anthropic", "gemini"] = "bedrock", model: Optional[str] = None, api_key: Optional[str] = None, mode: instructor.Mode = instructor.Mode.JSON, **kwargs ) -> instructor.AsyncInstructor

Parameters:

  • provider (str): LLM provider to use ("bedrock", "openai", "anthropic", "gemini")
  • model (Optional[str]): Model name (uses provider default if None)
  • api_key (Optional[str]): API key (falls back to environment variables)
  • mode (instructor.Mode): Instructor parsing mode
  • **kwargs: Additional provider-specific arguments

Returns:

An Instructor-wrapped async client ready for use.

text_simplifier()

Breaks down complex text into simple, single-clause sentences.

async def text_simplifier( text: str, model_name: str, client: instructor.AsyncInstructor ) -> TextSimplify

πŸ” How It Works

AutoEvaluator uses a sophisticated multi-step process to evaluate claims:

  1. Text Simplification: Complex sentences are broken down into simple, atomic claims
  2. Question Generation: Each simplified sentence is converted into a fact-checking question
  3. Bidirectional Verification: Questions are checked against both the claim and ground truth
  4. Classification: Sentences are classified as TP, FP, or FN based on verification results
  5. Metrics Calculation: Precision, Recall, and F1 scores are computed from the classifications

Architecture

Input Claim & Ground Truth ↓ Text Simplifier (breaks into atomic sentences) ↓ Question Generator (creates fact-check questions) ↓ Question Checker (verifies against ground truth) ↓ Classification (TP/FP/FN assignment) ↓ Metrics Calculation (Precision, Recall, F1) ↓ Structured Output 

🎯 Advanced Usage

Custom Text Simplification

from autoevaluator import text_simplifier, get_instructor_client async def simplify_text(): client = get_instructor_client(provider="openai") complex_text = """Although the weather was bad and it was raining heavily,   we decided to go hiking because we had planned it for weeks.""" result = await text_simplifier( text=complex_text, model_name="gpt-4o-mini", client=client ) print("Simplified sentences:") for sentence in result.simplified_sentences: print(f"- {sentence}") asyncio.run(simplify_text())

Using Provider-Specific Convenience Functions

from autoevaluator.client import ( get_openai_instructor_client, get_bedrock_instructor_client, get_anthropic_instructor_client, get_gemini_instructor_client ) # OpenAI openai_client = get_openai_instructor_client(model="gpt-4o") # Bedrock bedrock_client = get_bedrock_instructor_client() # Anthropic anthropic_client = get_anthropic_instructor_client() # Gemini gemini_client = get_gemini_instructor_client(model="gemini-2.0-flash")

Error Handling

import asyncio from dotenv import load_dotenv load_dotenv() # Load env variables BEFORE importing autoevaluator from autoevaluator import evaluate, get_instructor_client async def safe_evaluate(): try: client = get_instructor_client(provider="openai") result = await evaluate( claim="Some claim", ground_truth="Some truth", client=client, model_name="gpt-4o-mini" ) return result except ValueError as e: print(f"Configuration error: {e}") except Exception as e: print(f"Evaluation error: {e}") asyncio.run(safe_evaluate())

πŸ“Š Performance Considerations

  • Async by Default: All operations are asynchronous for better performance
  • Batch Processing: Use asyncio.gather() for concurrent evaluations
  • Rate Limiting: Be mindful of provider rate limits when running batch evaluations
  • Caching: Consider caching results for repeated evaluations

🀝 Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License. See the LICENSE file for details.

πŸ™ Acknowledgments

  • Built with Instructor for structured outputs
  • Supports multiple LLM providers through unified interfaces
  • Inspired by the need for automated, reliable LLM evaluation

πŸ“§ Contact

Darveen Vijayan

πŸ“ˆ Changelog

Version 1.1.0

  • Multi-provider support (OpenAI, Bedrock, Anthropic, Gemini)
  • Async-first architecture
  • Improved text simplification
  • Enhanced error handling

Made with ❀️ by Darveen Vijayan

Releases

No releases published

Packages

 
 
 

Contributors

Languages