Add LoongFlow MLE-Bench Results (Open Source) by hundredwz · Pull Request #119 · openai/mle-bench

hundredwz · 2026-02-09T08:52:01Z

Hello MLE-Bench team,

We are pleased to submit the evaluation results for LoongFlow, an open-source thinking & learning framework for expert-grade AI agents.

To promote open research and community development, we have made the full framework source code publicly available at: https://github.com/baidu-baige/LoongFlow

Resources per run

Model: gemini-3-flash-preview
CPU / Memory: 36 vCPUs, 440 GB RAM
GPU: 2× NVIDIA A10 or 2× NVIDIA H20 (depending on node)
Time Limit: 24 hours

Pull Request Contents

This PR updates the README.md leaderboard and includes detailed grading reports for three independent runs, located at:

runs/loongflow_group1
runs/loongflow_group2
runs/loongflow_group3

Note: Grading was performed using the --pad-missing flag.

We deeply appreciate the MLE-Bench project for providing this valuable evaluation platform for ML engineering agents. We hope our open-source contribution will facilitate technical advancement and community growth in the AI Agent field.

Best regards,

The LoongFlow Team

hundredwz · 2026-02-09T11:55:42Z

For competitions where medals were achieved (any_medal: true), the ML code generated by our agent has been submitted and is available at:

https://github.com/baidu-baige/LoongFlow/tree/main/agents/ml_agent/examples/mlebench/competitions

AtrixTang · 2026-02-11T12:12:51Z

Subject: Evaluation methodology concern — test-set feedback as data leakage

Congratulations on the results and thanks for open-sourcing the framework — the PES architecture and multi-island evolution are genuinely interesting work.

However, I believe there is a significant methodological concern that should be addressed.

The core issue: the agent optimizes directly on private test-set feedback.

In the current MLE-Bench setup, each iteration evaluates submission.csv against ground-truth test answers (eval_program.py#L98-L100). Crucially, this test score is fed back as the evolutionary fitness signal, driving:

Boltzmann parent selection (boltzmann.py#L250-L255)
Child weight adjustment (ml_summary.py#L300-L308)
Early stopping (pes_agent.py#L484-L488)

This behavior was previously acknowledged by the team in Issue #36:

Why this matters:
In my view, this constitutes a form of data leakage. The agent is not merely selecting a final submission using the test score — it is running a multi-iteration evolutionary search where the test score is the optimization objective. This is fundamentally equivalent to training on the test set: the longer the agent runs, the more information it extracts from the ground-truth labels to guide its search.

Comparison to the Kaggle Standard:
The standard Kaggle protocol that MLE-Bench is designed to simulate requires participants to make model selection decisions without access to private test scores.

Real-world scenario: Engineers must accept the risk of overfitting to the validation set.
Current approach: Using the test score as a per-iteration feedback signal allows the agent to operate with an omniscient view ("God mode"). By definition, this effectively eliminates the possibility of overfitting, as the system automatically discards or deprioritizes any mutation that fails on the test set.

A benchmark whose test labels can be iteratively optimized against is no longer measuring generalization — it is measuring search efficiency over a leaked answer key. This runs contrary to the fundamental intent of MLE-Bench and creates an unrealistic scenario that no human competitor or real-world engineer could ever replicate.

It seems to me that this approach might not be reasonable. What are your thoughts? @dexhunter @thesofakillers @kibrq

JunnHuo · 2026-02-12T03:21:57Z

I would like to raise a related question: for MLE-Bench submissions where the code has not been made public, it is currently difficult to determine whether the evaluation strictly avoids iterative use of private test-set feedback.

More broadly, when only final scores are reported without releasing code, the evaluation protocol cannot be independently verified. This may affect the comparability of results across submissions.

In this context, would it be reasonable to include explicit indicators on the leaderboard—for example, whether the code is open-sourced and whether the method explicitly avoids using private test-set feedback during optimization? Clear disclosure on these points could help strengthen methodological transparency and consistency.

AtrixTang · 2026-02-12T04:10:24Z

I fully agree that open-sourcing code is vital for transparency, and I appreciate you raising the point about verification.

However, I think we should be careful not to conflate transparency with methodological validity.

The distinction between the development set (train/val) and the test set is a fundamental consensus in Machine Learning. The core purpose of a benchmark like MLE-Bench is to measure generalization capabilities.

When an agent iteratively optimizes using feedback from the private test set, it fundamentally alters the nature of the task—from solving a problem to fitting an answer key. Regardless of whether this behavior is disclosed or flagged on a leaderboard, it diverges from the standard definition of ML evaluation.

So, rather than focusing on how to label these submissions, I believe we should first address the primary issue: Does allowing test-set feedback align with the intent of this benchmark? From a standard ML perspective, this is typically considered data leakage, not a valid strategy.

To be clear, I am advocating for a completely separate leaderboard for these submissions, rather than simply adding a "transparency tag" on the main board. Mixing these results would be misleading, as it conflates generalization capability with oracle search efficiency. Therefore, to preserve the integrity of the main benchmark, submissions that utilize private test-set feedback should be moved to a distinct "Oracle / Skyline" track.

JunnHuo · 2026-02-12T04:28:08Z

One concern I have is that many submissions aren’t very clear about their evaluation setup. When the code isn’t public and only final scores are shared, it’s hard to tell whether private test-set feedback was used during optimization.

That lack of clarity can make it difficult to fairly compare results across submissions. It might help to ask for clearer disclosure of evaluation practices to keep things consistent and fair.

Thanks.

joe-needham · 2026-02-20T19:11:13Z

Congrats on the submission!

Regarding discussion above re lack of visibility, indeed source-code and full methodology details are not mandatory but we do include Source Code Available as a column in the leaderboard.

* Add source code link for Leeroo submission (Open Source) (openai#116) * Add LoongFlow MLE-Bench Results (Open Source) (openai#119) * Add Disarray MLE-Bench Results (openai#118) Co-authored-by: Alireza Mohammadshahi <121011194+alirezamshi@users.noreply.github.com> Co-authored-by: Hundredwz <sdwangzhuo@gmail.com> * Add leakage disclaimer to leaderboard (openai#125) * Add MLEvolve 12h-budget results to MLE-Bench (Open Source) (openai#120) Co-authored-by: root <root@dsh-mle-97hsx-18790-worker-0.dushangheng.ailab-ma4tool.svc.pjlab.local> --------- Co-authored-by: Alireza Mohammadshahi <121011194+alirezamshi@users.noreply.github.com> Co-authored-by: Hundredwz <sdwangzhuo@gmail.com> Co-authored-by: Moustafa AbdelBaky <9140797+moustafa-a@users.noreply.github.com> Co-authored-by: joe-needham <joe.needham@c-openai.com> Co-authored-by: Shangheng Du <143237960+YoungDubbyDu@users.noreply.github.com> Co-authored-by: root <root@dsh-mle-97hsx-18790-worker-0.dushangheng.ailab-ma4tool.svc.pjlab.local>

Add LoongFlow MLE-Bench Results (Open Source)

e3a95e5

joe-needham merged commit 30111dc into openai:main Feb 20, 2026

sunmarinup pushed a commit to upgini/mle-bench that referenced this pull request Feb 24, 2026

Add LoongFlow MLE-Bench Results (Open Source) (openai#119)

14859b4

AtrixTang mentioned this pull request Feb 26, 2026

Any point in this benchmark anymore? #124

Closed

ruz048 mentioned this pull request Mar 6, 2026

Add AIBuildAI MLE-Bench Results #126

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LoongFlow MLE-Bench Results (Open Source)#119

Add LoongFlow MLE-Bench Results (Open Source)#119
joe-needham merged 1 commit intoopenai:mainfrom
baidu-baige:feature/loongflow

hundredwz commented Feb 9, 2026

hundredwz commented Feb 9, 2026

AtrixTang commented Feb 11, 2026

JunnHuo commented Feb 12, 2026

AtrixTang commented Feb 12, 2026

JunnHuo commented Feb 12, 2026

joe-needham commented Feb 20, 2026

Labels

4 participants

Conversation

hundredwz commented Feb 9, 2026

Resources per run

Pull Request Contents

hundredwz commented Feb 9, 2026

AtrixTang commented Feb 11, 2026

Subject: Evaluation methodology concern — test-set feedback as data leakage

JunnHuo commented Feb 12, 2026

AtrixTang commented Feb 12, 2026

JunnHuo commented Feb 12, 2026

joe-needham commented Feb 20, 2026

Labels

4 participants