feat(exp): Add made by traceloop evaluators #3491

nina-kollman · 2025-12-04T11:19:55Z

Fixes TLP-1296

I have added tests that cover my changes.
If adding a new instrumentation or changing an existing one, I've added screenshots from some observability platform showing the change.
PR name follows conventional commits format: feat(instrumentation): ... or fix(instrumentation): ....
(If applicable) I have updated the documentation accordingly.

Important

Add experiment templates and evaluator factories for various evaluations, standardize task output, and enhance evaluator configuration handling.

New Features:
- Added experiment templates for advanced_quality_exp.py, content_safety_exp.py, metrics_exp.py, quality_exp.py, security_exp.py, and validation_exp.py to demonstrate various evaluators.
- Introduced EvaluatorMadeByTraceloop in evaluators_made_by_traceloop.py for creating evaluators with configurations like pii_detector, toxicity_detector, regex_validator, etc.
Bug Fixes:
- Standardized task output field name to text in run_research_experiment.py.
Models:
- Added EvaluatorDetails in config.py for evaluator configuration.
- Updated ExecuteEvaluatorRequest in model.py to include evaluator_config.
Misc:
- Updated run_experiment_evaluator() and trigger_experiment_evaluator() in evaluator.py to handle evaluator_config.
- Modified run() and _run_locally() in experiment.py to support EvaluatorSpec with configuration.

^{This description was created by}^{for cbe5abc. You can customize this summary. It will automatically update as commits are pushed.}

Summary by CodeRabbit

New Features
- Added experiment templates for quality, advanced quality, content safety, security, validation, and metrics evaluations.
- Added many pre-built evaluator presets (profanity, toxicity, PII, regex/JSON/SQL validators, semantic similarity, perplexity, etc.).
SDK Changes
- Experiments now accept per-evaluator configuration and propagate evaluator-specific options through the SDK.
- Introduced a unified evaluator spec type for flexible evaluator declarations.
Bug Fixes
- Standardized task output field name to "text".

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-04T11:20:02Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds evaluator configuration support across the SDK and many sample experiment modules. Introduces a Pydantic EvaluatorDetails model, an EvaluatorMadeByTraceloop factory, expands evaluator request payloads to include evaluator_config, and updates Experiment APIs/types to accept evaluator specs with configs.

Changes

Cohort / File(s)	Summary
Sample experiments (made_by_traceloop) `packages/sample-app/sample_app/experiment/made_by_traceloop/advanced_quality_exp.py`, `packages/sample-app/sample_app/experiment/made_by_traceloop/content_safety_exp.py`, `packages/sample-app/sample_app/experiment/made_by_traceloop/metrics_exp.py`, `packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py`, `packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py`, `packages/sample-app/sample_app/experiment/made_by_traceloop/validation_exp.py`	Added multiple sample experiment modules. Each introduces async response generators, task functions, and run_..._experiment entry points demonstrating different evaluator groups (quality, safety, metrics, security, validation, advanced quality).
Sample task output change `packages/sample-app/sample_app/experiment/run_research_experiment.py`	Changed `research_task` output key from `"sentence"` to `"text"`.
Evaluator model (config) `packages/traceloop-sdk/traceloop/sdk/evaluator/config.py`	New Pydantic model `EvaluatorDetails` with fields `slug: str`, `version: Optional[str] = None`, and `config: Optional[Dict[str, Any]] = None`.
Evaluator factory `packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py`	New `EvaluatorMadeByTraceloop` factory class with static methods returning configured `EvaluatorDetails` for many evaluators (pii_detector, toxicity, regex/json/sql validators, metrics, perplexity, semantic_similarity, topic_adherence, etc.).
Evaluator exports `packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py`	Exported `EvaluatorDetails` and `EvaluatorMadeByTraceloop` by updating imports and `__all__`.
Evaluator request model `packages/traceloop-sdk/traceloop/sdk/evaluator/model.py`	Added optional `evaluator_config: Optional[Dict[str, Any]] = None` to `ExecuteEvaluatorRequest`.
Evaluator execution logic `packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py`	API surface updated: `_build_evaluator_request`, `run_experiment_evaluator`, and `trigger_experiment_evaluator` accept `evaluator_config` and include it in constructed requests.
Experiment integration & types `packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py`, `packages/traceloop-sdk/traceloop/sdk/experiment/model.py`	Introduced `EvaluatorSpec = Union[str, EvaluatorDetails]`; changed Experiment.run/_run_locally/_run_in_github signatures to accept `List[EvaluatorSpec]`; normalized evaluators to (slug, version, config) triples and propagate `evaluator_config` into evaluator execution calls.

sequenceDiagram participant User as Caller participant Experiment as Experiment module participant Client as Traceloop Client participant EvaluatorSDK as Evaluator API layer participant EvalSvc as Evaluator service Note over User,Experiment: User provides evaluators (string or EvaluatorDetails) User->>Experiment: run(evaluators: List[EvaluatorSpec], ...) Experiment->>Experiment: normalize -> (slug, version, config) Experiment->>Client: create experiment run loop per task / evaluator Experiment->>EvaluatorSDK: run_experiment_evaluator(slug, input, evaluator_version, evaluator_config) EvaluatorSDK->>EvalSvc: ExecuteEvaluatorRequest {slug, version, input, evaluator_config} EvalSvc-->>EvaluatorSDK: ExecutionResponse EvaluatorSDK-->>Experiment: result end Experiment->>User: aggregated results

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Pay special attention to:
- Correct propagation of evaluator_config from Experiment.run through _build_evaluator_request into ExecuteEvaluatorRequest.
- Mapping and shapes in EvaluatorMadeByTraceloop methods versus backend evaluator expectations.
- Type changes: EvaluatorSpec usage and removal of previous local evaluator type aliases.
- Consistency of sample experiment task outputs expected by evaluators.

Possibly related PRs

fix(exp): Add run in github experiment #3459 — Modifies Experiment.run and experiment execution paths; overlaps on evaluator execution flow and API surface.
fix(traceloop-sdk): add type-checking support with mypy #3463 — Edits experiment/evaluator type surfaces; likely intersects with EvaluatorSpec/EvaluatorDetails changes.
feat(experiment): Add run experiment capabilities #3331 — Changes experiment/evaluator models and JSONL handling; probable overlap with evaluator request/response shapes.

Suggested reviewers

doronkopit5

Poem

🐰 I hopped in code with eager paws,
Slugs and configs, tiny laws,
Experiments bloom, evaluators sing,
I twitch my nose — CI takes wing,
A carrot for each passing test! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	⚠️ Warning	The PR includes a minor out-of-scope fix (renaming 'sentence' to 'text' in run_research_experiment.py), which is unrelated to the primary objective of adding Traceloop-made evaluators with configuration support.	Consider extracting the 'sentence' to 'text' key rename in run_research_experiment.py into a separate fix PR to keep changes focused on the main evaluator configuration objective.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(exp): Add made by traceloop evaluators' accurately describes the primary change—introducing Traceloop-made evaluators with configuration support.
Linked Issues check	✅ Passed	The PR implements evaluator configuration support (EvaluatorDetails, EvaluatorMadeByTraceloop factory class) and updates SDK methods to accept evaluator_config, aligning with TLP-1296's objective of supporting MBT evaluators with configuration.
Docstring Coverage	✅ Passed	Docstring coverage is 98.04% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch nk/add_eval_config

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py

ellipsis-dev

Caution

Changes requested ❌

Reviewed everything up to 9323a3a in 3 minutes and 0 seconds. Click for details.

Reviewed 2195 lines of code in 15 files
Skipped 0 files when reviewing.
Skipped posting 1 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/sample-app/sample_app/experiment/made_by_traceloop/advanced_quality_exp.py:58

Draft comment:
The 'advanced_quality_task' returns the response under the key 'completion', but later the code prints using 'text'. This key mismatch may lead to missing output in results. Consider using a consistent key (e.g., 'text' or 'completion').
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% The comment points to a potential issue: the task returns {"completion": ...} but later code accesses result.task_result.get("text", ""). This could be a real bug. However, I don't have visibility into what the Traceloop SDK does with the returned dictionary. It's possible that: 1. The SDK transforms "completion" to "text" 2. The SDK preserves the exact dictionary structure 3. This is intentional and "text" is meant to be empty (note the default value "") Since line 133 uses .get("text", "") with an empty string default, if "text" doesn't exist, it will just print an empty string - not cause an error. The code would still run, just potentially not show the response text. Without knowing the Traceloop SDK's behavior, I cannot definitively say this is a bug. The comment is speculative. I'm assuming this is a bug without knowing how the Traceloop SDK works. The code uses .get() with a default value, so it won't crash - it will just print empty strings if the key doesn't exist. This might be intentional behavior, or the SDK might transform the keys. Without access to the SDK documentation or other files, I cannot be certain this is actually a problem. The comment is speculative about SDK behavior I cannot verify. The rule states "Do NOT make speculative comments" and "you must see STRONG EVIDENCE that the comment is correct." Since the code won't crash (it uses .get() with defaults) and I don't know if the SDK transforms the data, there's no strong evidence this is actually wrong. This comment is speculative about how the Traceloop SDK handles the returned dictionary. Without being able to verify the SDK's behavior or see evidence that this causes actual problems, and given that the code uses safe .get() calls with defaults, I should delete this comment as it lacks strong evidence of being correct.

Workflow ID: wflow_ifsFxGjCKMoecYl9

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

packages/sample-app/sample_app/experiment/made_by_traceloop/advanced_quality_exp.py

packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py

packages/sample-app/sample_app/experiment/made_by_traceloop/rag_exp.py

ellipsis-dev

Important

Looks good to me! 👍

Reviewed 4fbcb23 in 43 seconds. Click for details.

Reviewed 20 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 2 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:67

Draft comment:
Removed print. Consider using a logging framework (e.g., logger.debug) to log environment-specific messages.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50% None

2. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:75

Draft comment:
Removed print statement for local execution. Consider using a centralized logging library if logging is needed.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50% None

Workflow ID: wflow_2Qqoj4H9k13BNrls

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 9

♻️ Duplicate comments (1)

packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py (1)

131-137: Static analysis alert is a false positive.

CodeQL flags line 136 for "clear-text logging of sensitive information," but this is incorrect. The code prints security_issues['secrets_detected'] which is an integer counter (e.g., "Secrets detected: 0 task(s)"), not actual secret values. The variable name happens to contain "secrets" but the value is just a count.

🧹 Nitpick comments (7)

packages/sample-app/sample_app/experiment/made_by_traceloop/validation_exp.py (1)

229-278: run_validation_examples() is defined but not invoked.

This function is never called in the main entry point. If it's intended as an optional demo utility, consider documenting this or providing a command-line flag to invoke it. Otherwise, remove it to reduce dead code.

packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py (1)

19-20: Module-level initialization may raise on missing environment variables.

Traceloop.init() at module load time means any import of this module will trigger initialization. For a sample app this is acceptable, but be aware that if TRACELOOP_API_KEY is not set, this may cause issues at import time rather than at runtime.

Consider moving initialization inside run_security_experiment() if lazy initialization is preferred:
-# Initialize Traceloop -client = Traceloop.init() +# Initialize Traceloop lazily +client = None async def run_security_experiment(): + global client + if client is None: + client = Traceloop.init()

packages/sample-app/sample_app/experiment/made_by_traceloop/advanced_quality_exp.py (1)

166-219: Remove extraneous f-string prefixes.

Multiple print statements use f-string prefixes but contain no placeholders. Consider removing the f prefix for clarity.

As per coding guidelines (Flake8 rule F541), apply this diff:

 if all_metrics["perplexities"]: avg_perplexity = sum(all_metrics["perplexities"]) / len(all_metrics["perplexities"]) min_perplexity = min(all_metrics["perplexities"]) max_perplexity = max(all_metrics["perplexities"]) - print(f"Perplexity Statistics:") + print("Perplexity Statistics:") print(f" Average: {avg_perplexity:.2f}") print(f" Range: {min_perplexity:.2f} - {max_perplexity:.2f}") if avg_perplexity < 10: - print(f" Assessment: Excellent fluency and predictability") + print(" Assessment: Excellent fluency and predictability") elif avg_perplexity < 50: - print(f" Assessment: Good fluency") + print(" Assessment: Good fluency") else: - print(f" Assessment: Higher perplexity - may indicate complexity or uncertainty") + print(" Assessment: Higher perplexity - may indicate complexity or uncertainty") print() if all_metrics["goal_accuracies"]: avg_goal_accuracy = sum(all_metrics["goal_accuracies"]) / len(all_metrics["goal_accuracies"]) min_goal = min(all_metrics["goal_accuracies"]) max_goal = max(all_metrics["goal_accuracies"]) - print(f"Agent Goal Accuracy Statistics:") + print("Agent Goal Accuracy Statistics:") print(f" Average: {avg_goal_accuracy:.2%}") print(f" Range: {min_goal:.2%} - {max_goal:.2%}") if avg_goal_accuracy >= 0.8: - print(f" Assessment: Strong goal achievement") + print(" Assessment: Strong goal achievement") elif avg_goal_accuracy >= 0.6: - print(f" Assessment: Moderate goal achievement") + print(" Assessment: Moderate goal achievement") else: - print(f" Assessment: Needs improvement in goal achievement") + print(" Assessment: Needs improvement in goal achievement") print() if all_metrics["semantic_similarities"]: avg_similarity = sum(all_metrics["semantic_similarities"]) / len(all_metrics["semantic_similarities"]) min_sim = min(all_metrics["semantic_similarities"]) max_sim = max(all_metrics["semantic_similarities"]) - print(f"Semantic Similarity Statistics:") + print("Semantic Similarity Statistics:") print(f" Average: {avg_similarity:.2%}") print(f" Range: {min_sim:.2%} - {max_sim:.2%}") if avg_similarity >= 0.8: - print(f" Assessment: High semantic alignment with reference") + print(" Assessment: High semantic alignment with reference") elif avg_similarity >= 0.6: - print(f" Assessment: Moderate semantic alignment") + print(" Assessment: Moderate semantic alignment") else: - print(f" Assessment: Low semantic alignment - responses differ significantly") + print(" Assessment: Low semantic alignment - responses differ significantly") print() if all_metrics["topic_adherences"]: avg_adherence = sum(all_metrics["topic_adherences"]) / len(all_metrics["topic_adherences"]) min_adh = min(all_metrics["topic_adherences"]) max_adh = max(all_metrics["topic_adherences"]) - print(f"Topic Adherence Statistics:") + print("Topic Adherence Statistics:") print(f" Average: {avg_adherence:.2%}") print(f" Range: {min_adh:.2%} - {max_adh:.2%}") if avg_adherence >= 0.8: - print(f" Assessment: Excellent topic consistency") + print(" Assessment: Excellent topic consistency") elif avg_adherence >= 0.6: - print(f" Assessment: Moderate topic focus") + print(" Assessment: Moderate topic focus") else: - print(f" Assessment: Responses frequently drift off-topic") + print(" Assessment: Responses frequently drift off-topic") print()

packages/sample-app/sample_app/experiment/made_by_traceloop/metrics_exp.py (1)

151-177: Remove extraneous f-string prefixes.

Several print statements use f-string prefixes without placeholders. Consider removing the f prefix for clarity.

As per coding guidelines (Flake8 rule F541), apply this diff:

 if all_metrics["char_counts"]: avg_chars = sum(all_metrics["char_counts"]) / len(all_metrics["char_counts"]) min_chars = min(all_metrics["char_counts"]) max_chars = max(all_metrics["char_counts"]) - print(f"Character Count Statistics:") + print("Character Count Statistics:") print(f" Average: {avg_chars:.0f} chars") print(f" Range: {min_chars} - {max_chars} chars\n") if all_metrics["word_counts"]: avg_words = sum(all_metrics["word_counts"]) / len(all_metrics["word_counts"]) min_words = min(all_metrics["word_counts"]) max_words = max(all_metrics["word_counts"]) - print(f"Word Count Statistics:") + print("Word Count Statistics:") print(f" Average: {avg_words:.0f} words") print(f" Range: {min_words} - {max_words} words\n") if all_metrics["char_ratios"]: avg_char_ratio = sum(all_metrics["char_ratios"]) / len(all_metrics["char_ratios"]) - print(f"Character Ratio (Response/Reference):") + print("Character Ratio (Response/Reference):") print(f" Average: {avg_char_ratio:.2f}x") if avg_char_ratio > 1.5: print(f" Note: Responses are {avg_char_ratio:.1f}x longer than reference\n") elif avg_char_ratio < 0.5: print(f" Note: Responses are {avg_char_ratio:.1f}x shorter than reference\n") else: - print(f" Note: Response length is well-matched to reference\n") + print(" Note: Response length is well-matched to reference\n") if all_metrics["word_ratios"]: avg_word_ratio = sum(all_metrics["word_ratios"]) / len(all_metrics["word_ratios"]) - print(f"Word Ratio (Response/Reference):") + print("Word Ratio (Response/Reference):") print(f" Average: {avg_word_ratio:.2f}x")

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (3)

184-196: Fix method signature indentation.

The closing parenthesis should align with the def keyword per PEP 8 continuation line conventions.

As per coding guidelines (Flake8 rule E125), apply this diff:
 @staticmethod def word_count( - ) -> EvaluatorDetails: + ) -> EvaluatorDetails: """ Word count evaluator - counts the number of words in text. Returns: EvaluatorDetails configured for word counting """ config: Dict[str, Any] = {} return EvaluatorDetails(slug="word-count", version=None, config=config)
236-248: Fix method signature indentation.

The closing parenthesis should align with the def keyword per PEP 8 continuation line conventions.

As per coding guidelines (Flake8 rule E125), apply this diff:
 @staticmethod def profanity_detector( - ) -> EvaluatorDetails: + ) -> EvaluatorDetails: """ Profanity detector evaluator - flags inappropriate language. Returns: EvaluatorDetails configured for profanity detection """ config: Dict[str, Any] = {} return EvaluatorDetails(slug="profanity-detector", version=None, config=config)
266-276: Use consistent config initialization.

This method returns config=None while all other similar methods (e.g., char_count, word_count, profanity_detector) initialize config as an empty dict.

Apply this diff for consistency:
 @staticmethod def secrets_detector( ) -> EvaluatorDetails: """ Secrets detector evaluator - monitors for credential and key leaks. Returns: EvaluatorDetails configured for secrets detection """ - return EvaluatorDetails(slug="secrets-detector", version=None, config=None) + config: Dict[str, Any] = {} +  + return EvaluatorDetails(slug="secrets-detector", version=None, config=config)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f782741 and 4fbcb23.

📒 Files selected for processing (15)

packages/sample-app/sample_app/experiment/made_by_traceloop/advanced_quality_exp.py (1 hunks)
packages/sample-app/sample_app/experiment/made_by_traceloop/content_safety_exp.py (1 hunks)
packages/sample-app/sample_app/experiment/made_by_traceloop/metrics_exp.py (1 hunks)
packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py (1 hunks)
packages/sample-app/sample_app/experiment/made_by_traceloop/rag_exp.py (1 hunks)
packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py (1 hunks)
packages/sample-app/sample_app/experiment/made_by_traceloop/validation_exp.py (1 hunks)
packages/sample-app/sample_app/experiment/run_research_experiment.py (1 hunks)
packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py (1 hunks)
packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1 hunks)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (6 hunks)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1 hunks)
packages/traceloop-sdk/traceloop/sdk/evaluator/model.py (1 hunks)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (8 hunks)
packages/traceloop-sdk/traceloop/sdk/experiment/model.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/traceloop-sdk/traceloop/sdk/evaluator/config.py
packages/sample-app/sample_app/experiment/run_research_experiment.py
packages/sample-app/sample_app/experiment/made_by_traceloop/rag_exp.py
packages/traceloop-sdk/traceloop/sdk/evaluator/model.py
packages/traceloop-sdk/traceloop/sdk/experiment/model.py
packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py
packages/sample-app/sample_app/experiment/made_by_traceloop/metrics_exp.py
packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py
packages/sample-app/sample_app/experiment/made_by_traceloop/advanced_quality_exp.py
packages/sample-app/sample_app/experiment/made_by_traceloop/content_safety_exp.py
packages/sample-app/sample_app/experiment/made_by_traceloop/validation_exp.py
packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py

🧬 Code graph analysis (8)

packages/sample-app/sample_app/experiment/made_by_traceloop/rag_exp.py (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)

agent_goal_accuracy (304-314)

packages/traceloop-sdk/traceloop/sdk/experiment/model.py (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)

EvaluatorDetails (5-20)

packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py (2)

packages/traceloop-sdk/traceloop/sdk/__init__.py (2)

Traceloop (36-267)

init (48-198)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (2)

answer_relevancy (211-221)

faithfulness (224-234)

packages/sample-app/sample_app/experiment/made_by_traceloop/metrics_exp.py (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (4)

char_count (159-169)

word_count (185-195)

char_count_ratio (172-182)

word_count_ratio (198-208)

packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py (1)

packages/traceloop-sdk/traceloop/sdk/__init__.py (2)

Traceloop (36-267)

init (48-198)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)

EvaluatorDetails (5-20)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (2)

packages/traceloop-sdk/traceloop/sdk/experiment/model.py (4)

RunInGithubRequest (87-97)

RunInGithubResponse (100-105)

TaskResult (70-75)

GithubContext (78-84)

packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)

EvaluatorDetails (5-20)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/model.py (1)

ExecutionResponse (40-44)

🪛 Flake8 (7.3.0)

packages/sample-app/sample_app/experiment/made_by_traceloop/metrics_exp.py

[error] 151-151: f-string is missing placeholders

(F541)

[error] 159-159: f-string is missing placeholders

(F541)

[error] 165-165: f-string is missing placeholders

(F541)

[error] 172-172: f-string is missing placeholders

(F541)

[error] 176-176: f-string is missing placeholders

(F541)

packages/sample-app/sample_app/experiment/made_by_traceloop/advanced_quality_exp.py

[error] 166-166: f-string is missing placeholders

(F541)

[error] 170-170: f-string is missing placeholders

(F541)

[error] 172-172: f-string is missing placeholders

(F541)

[error] 174-174: f-string is missing placeholders

(F541)

[error] 181-181: f-string is missing placeholders

(F541)

[error] 185-185: f-string is missing placeholders

(F541)

[error] 187-187: f-string is missing placeholders

(F541)

[error] 189-189: f-string is missing placeholders

(F541)

[error] 196-196: f-string is missing placeholders

(F541)

[error] 200-200: f-string is missing placeholders

(F541)

[error] 202-202: f-string is missing placeholders

(F541)

[error] 204-204: f-string is missing placeholders

(F541)

[error] 211-211: f-string is missing placeholders

(F541)

[error] 215-215: f-string is missing placeholders

(F541)

[error] 217-217: f-string is missing placeholders

(F541)

[error] 219-219: f-string is missing placeholders

(F541)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py

[error] 186-186: continuation line with same indent as next logical line

(E125)

[error] 238-238: continuation line with same indent as next logical line

(E125)

🪛 GitHub Check: CodeQL

packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py

[failure] 136-136: Clear-text logging of sensitive information
This expression logs sensitive data (secret) as clear text.
This expression logs sensitive data (secret) as clear text.
This expression logs sensitive data (secret) as clear text.

🪛 Ruff (0.14.7)

packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py

22-22: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

packages/sample-app/sample_app/experiment/made_by_traceloop/metrics_exp.py

151-151: f-string without any placeholders

Remove extraneous f prefix

(F541)

159-159: f-string without any placeholders

Remove extraneous f prefix

(F541)

165-165: f-string without any placeholders

Remove extraneous f prefix

(F541)

172-172: f-string without any placeholders

Remove extraneous f prefix

(F541)

176-176: f-string without any placeholders

Remove extraneous f prefix

(F541)

packages/sample-app/sample_app/experiment/made_by_traceloop/advanced_quality_exp.py

166-166: f-string without any placeholders

Remove extraneous f prefix

(F541)

170-170: f-string without any placeholders

Remove extraneous f prefix

(F541)

172-172: f-string without any placeholders

Remove extraneous f prefix

(F541)

174-174: f-string without any placeholders

Remove extraneous f prefix

(F541)

181-181: f-string without any placeholders

Remove extraneous f prefix

(F541)

185-185: f-string without any placeholders

Remove extraneous f prefix

(F541)

187-187: f-string without any placeholders

Remove extraneous f prefix

(F541)

189-189: f-string without any placeholders

Remove extraneous f prefix

(F541)

196-196: f-string without any placeholders

Remove extraneous f prefix

(F541)

200-200: f-string without any placeholders

Remove extraneous f prefix

(F541)

202-202: f-string without any placeholders

Remove extraneous f prefix

(F541)

204-204: f-string without any placeholders

Remove extraneous f prefix

(F541)

211-211: f-string without any placeholders

Remove extraneous f prefix

(F541)

215-215: f-string without any placeholders

Remove extraneous f prefix

(F541)

217-217: f-string without any placeholders

Remove extraneous f prefix

(F541)

219-219: f-string without any placeholders

Remove extraneous f prefix

(F541)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.12)
GitHub Check: Test Packages (3.10)
GitHub Check: Test Packages (3.11)
GitHub Check: Build Packages (3.11)
GitHub Check: Lint

🔇 Additional comments (15)

packages/sample-app/sample_app/experiment/run_research_experiment.py (2)

49-53: Redundant data: text duplicates completion.

Both completion and text contain the same answer value. If this duplication is intentional for evaluator compatibility (e.g., different evaluators expecting different field names), consider adding a brief inline comment explaining why both fields are needed.

17-21: LGTM!

API keys are correctly retrieved from environment variables, following the coding guidelines for secure credential handling.

packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)

5-20: LGTM!

Clean Pydantic model with proper type hints, sensible defaults, and clear documentation including usage examples.

packages/sample-app/sample_app/experiment/made_by_traceloop/rag_exp.py (1)

43-65: LGTM!

OpenAI API key is securely retrieved from environment variable, and the async client usage is correct.

packages/sample-app/sample_app/experiment/made_by_traceloop/validation_exp.py (1)

31-32: LGTM!

API key is correctly retrieved from environment variable, adhering to the coding guidelines for secure credential handling.

packages/traceloop-sdk/traceloop/sdk/evaluator/model.py (1)

16-23: LGTM!

The new evaluator_config field is appropriately typed as Optional[Dict[str, Any]] with a None default, maintaining backward compatibility with existing code.

packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py (1)

1-9: LGTM!

Clean package initialization that properly exposes the new EvaluatorDetails and EvaluatorMadeByTraceloop classes alongside the existing Evaluator. The __all__ list is correctly updated to reflect the public API.

packages/traceloop-sdk/traceloop/sdk/experiment/model.py (1)

2-6: LGTM!

The EvaluatorSpec type alias provides a clean, flexible API that accepts either a simple string slug or a full EvaluatorDetails configuration object. This is a sensible design that maintains backward compatibility while enabling richer evaluator configurations.

packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py (1)

23-34: LGTM!

The generate_response function correctly retrieves the OpenAI API key from environment variables via os.getenv("OPENAI_API_KEY"), adhering to the coding guidelines for secure secret handling.

packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py (1)

23-37: LGTM!

The API key is correctly retrieved from the environment variable, and the function properly handles the optional context by conditionally adding a system message.

packages/sample-app/sample_app/experiment/made_by_traceloop/content_safety_exp.py (1)

23-34: LGTM!

The generate_response function correctly retrieves the OpenAI API key from the environment variable, following secure secret handling practices.

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (2)

130-148: Well-structured evaluator configuration support.

The implementation cleanly handles both string slugs and EvaluatorDetails objects, converting them to normalized tuples for consistent processing throughout the experiment execution.

175-200: Proper threading of evaluator configuration.

The evaluator_config is correctly unpacked from the tuple and passed to both run_experiment_evaluator and trigger_experiment_evaluator, maintaining consistency across synchronous and fire-and-forget execution paths.

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (2)

24-44: Clean extension of evaluator request builder.

The evaluator_config parameter is properly added to the request builder signature and passed through to ExecuteEvaluatorRequest construction.

67-107: Consistent evaluator configuration support in execution path.

The evaluator_config parameter is correctly threaded through the synchronous execution path, with appropriate documentation updates.

packages/sample-app/sample_app/experiment/made_by_traceloop/advanced_quality_exp.py

packages/sample-app/sample_app/experiment/made_by_traceloop/content_safety_exp.py

packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py

packages/sample-app/sample_app/experiment/made_by_traceloop/rag_exp.py

packages/sample-app/sample_app/experiment/made_by_traceloop/validation_exp.py

ellipsis-dev

Important

Looks good to me! 👍

Reviewed d3eac37 in 2 minutes and 16 seconds. Click for details.

Reviewed 120 lines of code in 6 files
Skipped 0 files when reviewing.
Skipped posting 12 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/sample-app/sample_app/experiment/made_by_traceloop/advanced_quality_exp.py:105

Draft comment:
Typo in the comment: 'ddataset' should be 'dataset'.
Reason this comment was not posted:
Comment was on unchanged code.

2. packages/sample-app/sample_app/experiment/made_by_traceloop/content_safety_exp.py:90

Draft comment:
Typo in the comment: 'ddataset' should be corrected to 'dataset'.
Reason this comment was not posted:
Comment was on unchanged code.

3. packages/sample-app/sample_app/experiment/made_by_traceloop/metrics_exp.py:94

Draft comment:
The comment still has a typo ('ddataset'); consider fixing it to 'dataset'.
Reason this comment was not posted:
Comment was on unchanged code.

4. packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py:95

Draft comment:
Fix the typo in the comment ('ddataset' → 'dataset').
Reason this comment was not posted:
Comment was on unchanged code.

5. packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py:85

Draft comment:
The comment still uses 'ddataset' instead of 'dataset'—please correct it.
Reason this comment was not posted:
Comment was on unchanged code.

6. packages/sample-app/sample_app/experiment/made_by_traceloop/validation_exp.py:121

Draft comment:
Removal of the detailed evaluator configuration print statements reduces debug visibility. Consider adding a conditional verbose flag if these logs are useful for troubleshooting.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 15% vs. threshold = 50% The comment is suggesting that the PR author "consider" adding a verbose flag, which is speculative and not a clear code change requirement. The comment uses the word "consider" which makes it a suggestion rather than identifying a definite issue. The removed logging code appears to be debug/informational output, and the PR author intentionally removed it - presumably because they didn't find it necessary. The comment is essentially second-guessing the author's decision without strong evidence that this is wrong. This violates the rule about not making speculative comments and not asking the author to confirm their intention. The removed logging might have been genuinely useful for debugging, and the comment could be providing valuable feedback about maintaining observability in the codebase. Perhaps there's a team standard about logging that I'm not aware of. Even if logging is valuable, the comment uses "consider" which makes it a suggestion rather than identifying a clear issue. The rules explicitly state not to make speculative comments or ask the author to confirm their intention. If the logging was critical, it would be caught in testing or code review by humans who understand the context better. This comment should be deleted. It's speculative (uses "consider"), doesn't identify a definite issue, and second-guesses an intentional removal by the author without strong evidence that it's wrong. The author clearly decided this logging wasn't necessary.

7. packages/sample-app/sample_app/experiment/made_by_traceloop/advanced_quality_exp.py:105

Draft comment:
Typographical error: consider changing 'ddataset' to 'dataset' in the comment.
Reason this comment was not posted:
Comment was on unchanged code.

8. packages/sample-app/sample_app/experiment/made_by_traceloop/content_safety_exp.py:90

Draft comment:
Typo in comment: "Set a ddataset slug that exists in the traceloop platform" should be "Set a dataset slug that exists in the traceloop platform".
Reason this comment was not posted:
Comment was on unchanged code.

9. packages/sample-app/sample_app/experiment/made_by_traceloop/metrics_exp.py:94

Draft comment:
Typographical error: The comment says 'ddataset' which is likely a typo. Consider changing it to 'dataset'.
Reason this comment was not posted:
Comment was on unchanged code.

10. packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py:95

Draft comment:
Typo: "ddataset" on line 95 should be "dataset".
Reason this comment was not posted:
Comment was on unchanged code.

11. packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py:85

Draft comment:
Typo: "ddataset" should be spelled as "dataset" in the comment.
Reason this comment was not posted:
Comment was on unchanged code.

12. packages/sample-app/sample_app/experiment/made_by_traceloop/validation_exp.py:128

Draft comment:
Typographical error: "ddataset" appears in the comment. It should be "dataset".
Reason this comment was not posted:
Comment was on unchanged code.

Workflow ID: wflow_JGAvzK2L4LPcWCDK

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 5

♻️ Duplicate comments (5)

packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py (2)
22-22: Use explicit type hint for optional parameter.

The context parameter defaults to None but lacks an explicit Optional or union type hint, violating PEP 484.

This issue was previously flagged. Apply this diff:
-async def generate_response(prompt: str, context: str = None) -> str: +async def generate_response(prompt: str, context: str | None = None) -> str:
45-48: Pass context to generate_response for faithfulness evaluation.

The context is extracted from the row but not passed to generate_response, which accepts it as an optional parameter. This omission affects the faithfulness evaluator's ability to assess whether the response is grounded in the provided context.

This issue was previously flagged. Apply this diff:
 question = row.get("question", "This is a demo question") context = row.get("context", "This is a demo context") # Generate response - completion = await generate_response(question) + completion = await generate_response(question, context)
packages/sample-app/sample_app/experiment/made_by_traceloop/content_safety_exp.py (1)
79-84: Guard against None config when accessing .items().

The EvaluatorDetails.config field is typed as Optional[Dict[str, Any]]. While current factory methods set config, accessing .items() on None would raise an AttributeError.

This issue was previously flagged. Apply this diff:
 print("Running experiment with content safety evaluators:") for evaluator in evaluators: - config_str = ", ".join(f"{k}={v}" for k, v in evaluator.config.items() if k != "description") + config_str = "" + if evaluator.config: + config_str = ", ".join(f"{k}={v}" for k, v in evaluator.config.items() if k != "description") print(f" - {evaluator.slug}") if config_str: print(f" Config: {config_str}")
packages/sample-app/sample_app/experiment/made_by_traceloop/advanced_quality_exp.py (2)
24-36: Extract and return logprobs data for perplexity evaluation.

The function enables logprobs=True but returns only the message content without extracting response.choices[0].logprobs. The perplexity evaluator requires the actual logprobs data structure. Additionally, the response access lacks error handling.

This issue was previously flagged. Apply this diff:
-async def generate_response(prompt: str, max_tokens: int = 300) -> str: - """Generate a response using OpenAI with logprobs for perplexity measurement""" +from typing import Tuple, Any + +async def generate_response(prompt: str, max_tokens: int = 300) -> Tuple[str, Any]: + """Generate a response using OpenAI with logprobs for perplexity measurement +  + Returns: + Tuple of (content, logprobs) where logprobs is the raw logprobs data structure + """ openai_client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY")) response = await openai_client.chat.completions.create( model="gpt-3.5-turbo", messages=[{"role": "user", "content": prompt}], temperature=0.7, max_tokens=max_tokens, logprobs=True, # Enable logprobs for perplexity calculation ) - return response.choices[0].message.content + if not response.choices: + raise ValueError("No response generated by OpenAI") +  + choice = response.choices[0] + content = choice.message.content + if content is None: + raise ValueError("OpenAI returned empty content") +  + return content, choice.logprobs
53-59: Use actual logprobs data instead of completion text.

Currently, the "logprobs" key is assigned the completion text, but the perplexity evaluator expects the actual logprobs data structure from the OpenAI response. This depends on fixing generate_response() to return logprobs.

This issue was previously flagged. After fixing generate_response(), apply this diff:
 # Generate response - completion = await generate_response(question) + completion, logprobs_data = await generate_response(question) # Return data for evaluation # Different evaluators expect different fields return { - "logprobs": completion, # For perplexity + "logprobs": logprobs_data, # For perplexity "question": question, "completion": completion, # Standard completion field "reference": reference_answer, # For semantic similarity comparison "reference_topics": topic, # For topic adherence }

🧹 Nitpick comments (12)

packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py (2)
84-84: Prefix unused variables with underscore.

The variables results and errors are unpacked but never used. Per Python conventions and static analysis recommendations, prefix them with underscores to indicate they are intentionally unused.

Apply this diff:
- results, errors = await client.experiment.run( + _results, _errors = await client.experiment.run( dataset_slug="security", # Set a ddataset slug that exists in the traceloop platform dataset_version="v1", task=security_task, evaluators=evaluators, experiment_slug="security-evaluators-exp", stop_on_error=False, wait_for_results=True, )
85-85: Fix typo in comment.

The comment contains "ddataset" which should be "dataset".

Apply this diff:
- dataset_slug="security", # Set a ddataset slug that exists in the traceloop platform + dataset_slug="security", # Set a dataset slug that exists in the traceloop platform
packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py (2)
94-94: Prefix unused variables with underscore.

The variables results and errors are unpacked but never used. Prefix them with underscores to indicate they are intentionally unused.

Apply this diff:
- results, errors = await client.experiment.run( + _results, _errors = await client.experiment.run(
95-95: Fix typo in comment.

The comment contains "ddataset" which should be "dataset".

Apply this diff:
- dataset_slug="quality", # Set a ddataset slug that exists in the traceloop platform + dataset_slug="quality", # Set a dataset slug that exists in the traceloop platform
packages/sample-app/sample_app/experiment/made_by_traceloop/content_safety_exp.py (2)
89-89: Prefix unused variables with underscore.

The variables results and errors are unpacked but never used. Prefix them with underscores.

Apply this diff:
- results, errors = await client.experiment.run( + _results, _errors = await client.experiment.run(
90-90: Fix typo in comment.

The comment contains "ddataset" which should be "dataset".

Apply this diff:
- dataset_slug="content-safety", # Set a ddataset slug that exists in the traceloop platform + dataset_slug="content-safety", # Set a dataset slug that exists in the traceloop platform
packages/sample-app/sample_app/experiment/made_by_traceloop/advanced_quality_exp.py (2)
104-104: Prefix unused variables with underscore.

The variables results and errors are unpacked but never used. Prefix them with underscores.

Apply this diff:
- results, errors = await client.experiment.run( + _results, _errors = await client.experiment.run(
105-105: Fix typo in comment.

The comment contains "ddataset" which should be "dataset".

Apply this diff:
- dataset_slug="quality", # Set a ddataset slug that exists in the traceloop platform + dataset_slug="quality", # Set a dataset slug that exists in the traceloop platform
packages/sample-app/sample_app/experiment/made_by_traceloop/metrics_exp.py (2)
93-93: Prefix unused variables with underscore.

The variables results and errors are unpacked but never used. Prefix them with underscores.

Apply this diff:
- results, errors = await client.experiment.run( + _results, _errors = await client.experiment.run(
94-94: Fix typo in comment.

The comment contains "ddataset" which should be "dataset".

Apply this diff:
- dataset_slug="metrics", # Set a ddataset slug that exists in the traceloop platform + dataset_slug="metrics", # Set a dataset slug that exists in the traceloop platform
packages/sample-app/sample_app/experiment/made_by_traceloop/validation_exp.py (2)
127-127: Prefix unused variables with underscore.

The variables results and errors are unpacked but never used. Prefix them with underscores.

Apply this diff:
- results, errors = await client.experiment.run( + _results, _errors = await client.experiment.run(
128-128: Fix typo in comment.

The comment contains "ddataset" which should be "dataset".

Apply this diff:
- dataset_slug="validation", # Set a ddataset slug that exists in the traceloop platform + dataset_slug="validation", # Set a dataset slug that exists in the traceloop platform

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4fbcb23 and d3eac37.

📒 Files selected for processing (6)

packages/sample-app/sample_app/experiment/made_by_traceloop/advanced_quality_exp.py (1 hunks)
packages/sample-app/sample_app/experiment/made_by_traceloop/content_safety_exp.py (1 hunks)
packages/sample-app/sample_app/experiment/made_by_traceloop/metrics_exp.py (1 hunks)
packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py (1 hunks)
packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py (1 hunks)
packages/sample-app/sample_app/experiment/made_by_traceloop/validation_exp.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/sample-app/sample_app/experiment/made_by_traceloop/content_safety_exp.py
packages/sample-app/sample_app/experiment/made_by_traceloop/advanced_quality_exp.py
packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py
packages/sample-app/sample_app/experiment/made_by_traceloop/metrics_exp.py
packages/sample-app/sample_app/experiment/made_by_traceloop/validation_exp.py
packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py

🧬 Code graph analysis (4)

packages/sample-app/sample_app/experiment/made_by_traceloop/content_safety_exp.py (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (4)

EvaluatorMadeByTraceloop (5-340)

profanity_detector (237-247)

toxicity_detector (38-52)

sexism_detector (250-264)

packages/sample-app/sample_app/experiment/made_by_traceloop/advanced_quality_exp.py (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (4)

perplexity (330-340)

agent_goal_accuracy (304-314)

semantic_similarity (291-301)

topic_adherence (317-327)

packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (2)

answer_relevancy (211-221)

faithfulness (224-234)

packages/sample-app/sample_app/experiment/made_by_traceloop/metrics_exp.py (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (4)

char_count (159-169)

word_count (185-195)

char_count_ratio (172-182)

word_count_ratio (198-208)

🪛 Ruff (0.14.7)

packages/sample-app/sample_app/experiment/made_by_traceloop/content_safety_exp.py

89-89: Unpacked variable results is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

89-89: Unpacked variable errors is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

packages/sample-app/sample_app/experiment/made_by_traceloop/advanced_quality_exp.py

104-104: Unpacked variable results is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

104-104: Unpacked variable errors is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py

22-22: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

94-94: Unpacked variable results is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

94-94: Unpacked variable errors is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

packages/sample-app/sample_app/experiment/made_by_traceloop/metrics_exp.py

93-93: Unpacked variable results is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

93-93: Unpacked variable errors is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

packages/sample-app/sample_app/experiment/made_by_traceloop/validation_exp.py

127-127: Unpacked variable results is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

127-127: Unpacked variable errors is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py

84-84: Unpacked variable results is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

84-84: Unpacked variable errors is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.12)
GitHub Check: Test Packages (3.10)
GitHub Check: Test Packages (3.11)
GitHub Check: Lint
GitHub Check: Build Packages (3.11)

packages/sample-app/sample_app/experiment/made_by_traceloop/content_safety_exp.py

packages/sample-app/sample_app/experiment/made_by_traceloop/metrics_exp.py

packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py

packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py

packages/sample-app/sample_app/experiment/made_by_traceloop/validation_exp.py

ellipsis-dev

Important

Looks good to me! 👍

Reviewed fd05222 in 1 minute and 5 seconds. Click for details.

Reviewed 37 lines of code in 2 files
Skipped 0 files when reviewing.
Skipped posting 3 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py:48

Draft comment:
Passes 'context' to generate_response to include additional context. Confirm that generate_response properly handles the context parameter.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

2. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:185

Draft comment:
Refactor: Simplified word_count method signature by removing the unnecessary line break within the parentheses.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

3. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:237

Draft comment:
Refactor: Simplified profanity_detector method signature by removing the unnecessary line break within the parentheses.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

Workflow ID: wflow_bO5dVH9Zl9BfiODT

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py (2)
22-22: PEP 484: Use explicit type union for optional parameters.

The context parameter defaults to None but the type hint is str. This violates PEP 484 and should explicitly indicate the parameter can be None.

Apply this diff:
-async def generate_response(prompt: str, context: str = None) -> str: +async def generate_response(prompt: str, context: str | None = None) -> str:
30-37: Add error handling for OpenAI API response.

The function accesses response.choices[0].message.content without validating the response structure, which could raise IndexError or AttributeError at runtime if the API returns an unexpected response.

Apply this diff:
 response = await openai_client.chat.completions.create( model="gpt-3.5-turbo", messages=messages, temperature=0.7, max_tokens=200, ) - return response.choices[0].message.content + if not response.choices: + raise ValueError("No response generated by OpenAI") +  + content = response.choices[0].message.content + if content is None: + raise ValueError("OpenAI returned empty content") +  + return content

🧹 Nitpick comments (3)

packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py (1)
94-94: Consider prefixing unused variables with underscore.

The unpacked variables results and errors are not used after the experiment run. If they're intentionally unused, prefix them with underscores to indicate this explicitly.

Apply this diff:
- results, errors = await client.experiment.run( + _results, _errors = await client.experiment.run(
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (2)
12-18: Remove unused import from docstring example.

The example imports Predefined on line 13 but never uses it, which may confuse developers.

Apply this diff to remove the unused import:
 Example: - >>> from traceloop.sdk.evaluator import Predefined - >>> >>> evaluators = [ ... EvaluatorMadeByTraceloop.pii_detector(probability_threshold=0.8), ... EvaluatorMadeByTraceloop.toxicity_detector(threshold=0.7),
21-68: Consider adding threshold validation.

The pii_detector, toxicity_detector, and prompt_injection methods document that thresholds should be in the range 0.0-1.0, but no validation enforces this constraint. Invalid values could cause runtime errors or unexpected behavior downstream.

Consider adding validation like:
@staticmethod def pii_detector( probability_threshold: float = 0.5, ) -> EvaluatorDetails: """  PII (Personally Identifiable Information) detector evaluator.   Args:  probability_threshold: Minimum probability threshold for detecting PII (0.0-1.0)   Returns:  EvaluatorDetails configured for PII detection  """ if not 0.0 <= probability_threshold <= 1.0: raise ValueError(f"probability_threshold must be between 0.0 and 1.0, got {probability_threshold}") config: Dict[str, Any] = {"probability_threshold": probability_threshold} return EvaluatorDetails(slug="pii-detector", version=None, config=config)
Apply similar validation to toxicity_detector (line 38), prompt_injection (line 55), and sexism_detector (line 248).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d3eac37 and fd05222.

📒 Files selected for processing (2)

packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py (1 hunks)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py

🧬 Code graph analysis (1)

packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (2)

answer_relevancy (210-220)

faithfulness (223-233)

🪛 Ruff (0.14.7)

packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py

22-22: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

94-94: Unpacked variable results is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

94-94: Unpacked variable errors is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.12)
GitHub Check: Build Packages (3.11)
GitHub Check: Test Packages (3.11)
GitHub Check: Lint
GitHub Check: Test Packages (3.10)

🔇 Additional comments (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)

1-2: LGTM!

The imports are appropriate and all are used throughout the file.

packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py

ellipsis-dev

Important

Looks good to me! 👍

Reviewed cbe5abc in 56 seconds. Click for details.

Reviewed 27 lines of code in 2 files
Skipped 0 files when reviewing.
Skipped posting 2 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:273

Draft comment:
Using an empty dict for config in secrets_detector ensures consistency with other evaluator methods. This is a good practice.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative and does not provide any actionable feedback or suggestions for improvement. It simply states that using an empty dict for config ensures consistency, which does not align with the rules for useful comments.

2. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:128

Draft comment:
Adding a detailed type annotation for evaluator_details improves clarity and consistency in code.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative, as it praises the addition of a detailed type annotation without providing any actionable feedback or suggestions. It does not align with the rules for good comments, which should offer specific suggestions or highlight potential issues.

Workflow ID: wflow_Lt8OyqSWdiTp1m0X

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (1)
96-103: Avoid silently ignoring unsupported evaluator specs in _run_locally

The new normalization only handles str and EvaluatorDetails. If a caller accidentally passes any other type (including a legacy (slug, version) tuple), that entry is silently dropped: no evaluator will run and there’s no error, which is hard to debug and could be a behavioral regression.

It would be safer to fail fast on unsupported types so misconfigurations are obvious to users.
@@ - evaluators: Optional[List[EvaluatorSpec]] = None, + evaluators: Optional[List[EvaluatorSpec]] = None, @@ - # Convert evaluators to tuples of (slug, version, config) + # Convert evaluators to tuples of (slug, version, config) evaluator_details: Optional[List[Tuple[str, Optional[str], Optional[Dict[str, Any]]]]] = None if evaluators: evaluator_details = [] for evaluator in evaluators: if isinstance(evaluator, str): # Simple string slug evaluator_details.append((evaluator, None, None)) elif isinstance(evaluator, EvaluatorDetails): # EvaluatorDetails object with config evaluator_details.append((evaluator.slug, evaluator.version, evaluator.config)) + else: + raise TypeError( + f"Unsupported evaluator spec {evaluator!r}; " + "expected str or EvaluatorDetails." + )
Optionally (if you still want to support legacy (slug, version) tuples) you could extend the branch instead of raising, but in that case you should also update EvaluatorSpec to include the tuple form.

Also applies to: 130-148, 173-196

♻️ Duplicate comments (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)
101-122: Enforce the documented requirement for schema_string in json_validator.

The docstring says schema_string is "required if enable_schema_validation is True", but the implementation allows enable_schema_validation=True with schema_string=None. That breaks the documented contract and can push the error downstream instead of failing fast.

Add a guard that raises a clear ValueError when enable_schema_validation is True and schema_string is not provided:
 def json_validator( enable_schema_validation: bool = False, schema_string: Optional[str] = None, ) -> EvaluatorDetails: @@ - Returns: - EvaluatorDetails configured for JSON validation - """ - config: Dict[str, Any] = { - "enable_schema_validation": enable_schema_validation, - } - if schema_string: - config["schema_string"] = schema_string + Returns: + EvaluatorDetails configured for JSON validation + """ + if enable_schema_validation and not schema_string: + raise ValueError("schema_string is required when enable_schema_validation is True") + + config: Dict[str, Any] = { + "enable_schema_validation": enable_schema_validation, + } + if schema_string: + config["schema_string"] = schema_string

🧹 Nitpick comments (4)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (2)

12-18: Clean up the example import to avoid confusion.

The docstring example imports Predefined but then uses EvaluatorMadeByTraceloop directly and never uses Predefined. Consider either importing EvaluatorMadeByTraceloop in the example or actually demonstrating how Predefined relates to this class, so the usage pattern is unambiguous.

158-339: Optional: DRY up simple “no-config” evaluators.

Many evaluators (char_count, char_count_ratio, word_count, word_count_ratio, answer_relevancy, faithfulness, profanity_detector, secrets_detector, sql_validator, semantic_similarity, agent_goal_accuracy, topic_adherence, perplexity) all build config: Dict[str, Any] = {} and return an EvaluatorDetails with just a different slug. If this list grows, you might consider a tiny helper to reduce repetition and keep slugs centralized, e.g. _simple(slug: str) -> EvaluatorDetails.
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (2)
38-63: Clarify evaluators parameter to match new EvaluatorSpec contract

The type hint now allows EvaluatorSpec, but the docstring still says “List of evaluator slugs to run”, which hides the support for EvaluatorDetails (with version/config). Updating the docstring will make the new API surface clearer.
@@ - evaluators: List of evaluator slugs to run + evaluators: List of evaluator specs to run. Each item can be a + string evaluator slug or an EvaluatorDetails instance with + optional version/config.
251-255: Align _run_in_github docstring and behavior with EvaluatorSpec and fail-fast on bad specs

Here, evaluators is also typed as Optional[List[EvaluatorSpec]], and the slug extraction handles only str and EvaluatorDetails. The docstring, however, still claims support for “(slug, version) tuples”, which no longer matches the implementation and may confuse callers. Also, unsupported types are silently ignored, similar to _run_locally.

I recommend (a) updating the docstring to describe the actual supported forms and (b) failing fast on unsupported specs.
@@ - evaluators: List of evaluator slugs or (slug, version) tuples to run + evaluators: List of evaluator specs to run. Each item can be a + string evaluator slug or an EvaluatorDetails instance with + optional version/config. For GitHub runs, only the slug is + currently used. @@ - evaluator_slugs = None - if evaluators: - evaluator_slugs = [] - for evaluator in evaluators: - if isinstance(evaluator, str): - evaluator_slugs.append(evaluator) - elif isinstance(evaluator, EvaluatorDetails): - evaluator_slugs.append(evaluator.slug) + evaluator_slugs = None + if evaluators: + evaluator_slugs = [] + for evaluator in evaluators: + if isinstance(evaluator, str): + evaluator_slugs.append(evaluator) + elif isinstance(evaluator, EvaluatorDetails): + evaluator_slugs.append(evaluator.slug) + else: + raise TypeError( + f"Unsupported evaluator spec {evaluator!r}; " + "expected str or EvaluatorDetails." + )
This makes the GitHub path’s behavior explicit and keeps configuration/version semantics for that path clearly documented (currently config/version are accepted via EvaluatorDetails but effectively ignored here).

Also applies to: 347-355

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between fd05222 and cbe5abc.

📒 Files selected for processing (2)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1 hunks)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (8 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py

🧬 Code graph analysis (2)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)

EvaluatorDetails (5-20)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (2)

packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)

EvaluatorDetails (5-20)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (1)

run_experiment_evaluator (67-107)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.11)
GitHub Check: Test Packages (3.12)
GitHub Check: Lint
GitHub Check: Build Packages (3.11)
GitHub Check: Test Packages (3.10)

nina-kollman added 2 commits December 4, 2025 13:04

init

b23cc91

example

f5ca043

nina-kollman added 6 commits December 4, 2025 13:39

change

0010b7b

update example

2a9469c

added more exp

b452056

no description

dccd1ba

add

6a8ca10

fix example

9323a3a

nina-kollman marked this pull request as ready for review December 4, 2025 15:46

github-advanced-security bot found potential problems Dec 4, 2025

View reviewed changes

packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py Fixed Show fixed Hide fixed

no print

4fbcb23

ellipsis-dev bot reviewed Dec 4, 2025

View reviewed changes

coderabbitai bot reviewed Dec 4, 2025

View reviewed changes

nina-kollman added 3 commits December 4, 2025 18:03

slim example

ab86958

add comm

7cd030b

lint

d3eac37

ellipsis-dev bot reviewed Dec 4, 2025

View reviewed changes

coderabbitai bot reviewed Dec 4, 2025

View reviewed changes

lint 2

fd05222

ellipsis-dev bot reviewed Dec 4, 2025

View reviewed changes

coderabbitai bot reviewed Dec 4, 2025

View reviewed changes

my py

cbe5abc

ellipsis-dev bot reviewed Dec 4, 2025

View reviewed changes

coderabbitai bot reviewed Dec 4, 2025

View reviewed changes

feat(exp): Add made by traceloop evaluators #3491

Are you sure you want to change the base?

feat(exp): Add made by traceloop evaluators #3491

Uh oh!

Conversation

nina-kollman commented Dec 4, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

coderabbitai bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

Labels

2 participants

nina-kollman commented Dec 4, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 4, 2025 •

edited

Loading