Skip to content

Add LoongFlow MLE-Bench Results (Open Source)#119

Merged
joe-needham merged 1 commit intoopenai:mainfrom
baidu-baige:feature/loongflow
Feb 20, 2026
Merged

Add LoongFlow MLE-Bench Results (Open Source)#119
joe-needham merged 1 commit intoopenai:mainfrom
baidu-baige:feature/loongflow

Conversation

@hundredwz
Copy link
Contributor

Hello MLE-Bench team,

We are pleased to submit the evaluation results for LoongFlow, an open-source thinking & learning framework for expert-grade AI agents.

To promote open research and community development, we have made the full framework source code publicly available at: https://github.com/baidu-baige/LoongFlow

Resources per run

  • Model: gemini-3-flash-preview
  • CPU / Memory: 36 vCPUs, 440 GB RAM
  • GPU: 2× NVIDIA A10 or 2× NVIDIA H20 (depending on node)
  • Time Limit: 24 hours

Pull Request Contents

This PR updates the README.md leaderboard and includes detailed grading reports for three independent runs, located at:

  • runs/loongflow_group1
  • runs/loongflow_group2
  • runs/loongflow_group3

Note: Grading was performed using the --pad-missing flag.

We deeply appreciate the MLE-Bench project for providing this valuable evaluation platform for ML engineering agents. We hope our open-source contribution will facilitate technical advancement and community growth in the AI Agent field.

Best regards,

The LoongFlow Team

@hundredwz
Copy link
Contributor Author

For competitions where medals were achieved (any_medal: true), the ML code generated by our agent has been submitted and is available at:

https://github.com/baidu-baige/LoongFlow/tree/main/agents/ml_agent/examples/mlebench/competitions

@AtrixTang
Copy link

Subject: Evaluation methodology concern — test-set feedback as data leakage

Congratulations on the results and thanks for open-sourcing the framework — the PES architecture and multi-island evolution are genuinely interesting work.

However, I believe there is a significant methodological concern that should be addressed.

The core issue: the agent optimizes directly on private test-set feedback.

In the current MLE-Bench setup, each iteration evaluates submission.csv against ground-truth test answers (eval_program.py#L98-L100). Crucially, this test score is fed back as the evolutionary fitness signal, driving:

This behavior was previously acknowledged by the team in Issue #36:

image

Why this matters:
In my view, this constitutes a form of data leakage. The agent is not merely selecting a final submission using the test score — it is running a multi-iteration evolutionary search where the test score is the optimization objective. This is fundamentally equivalent to training on the test set: the longer the agent runs, the more information it extracts from the ground-truth labels to guide its search.

Comparison to the Kaggle Standard:
The standard Kaggle protocol that MLE-Bench is designed to simulate requires participants to make model selection decisions without access to private test scores.

  • Real-world scenario: Engineers must accept the risk of overfitting to the validation set.
  • Current approach: Using the test score as a per-iteration feedback signal allows the agent to operate with an omniscient view ("God mode"). By definition, this effectively eliminates the possibility of overfitting, as the system automatically discards or deprioritizes any mutation that fails on the test set.

A benchmark whose test labels can be iteratively optimized against is no longer measuring generalization — it is measuring search efficiency over a leaked answer key. This runs contrary to the fundamental intent of MLE-Bench and creates an unrealistic scenario that no human competitor or real-world engineer could ever replicate.

It seems to me that this approach might not be reasonable. What are your thoughts? @dexhunter @thesofakillers @kibrq

@JunnHuo
Copy link

JunnHuo commented Feb 12, 2026

I would like to raise a related question: for MLE-Bench submissions where the code has not been made public, it is currently difficult to determine whether the evaluation strictly avoids iterative use of private test-set feedback.

More broadly, when only final scores are reported without releasing code, the evaluation protocol cannot be independently verified. This may affect the comparability of results across submissions.

In this context, would it be reasonable to include explicit indicators on the leaderboard—for example, whether the code is open-sourced and whether the method explicitly avoids using private test-set feedback during optimization? Clear disclosure on these points could help strengthen methodological transparency and consistency.

@AtrixTang
Copy link

I fully agree that open-sourcing code is vital for transparency, and I appreciate you raising the point about verification.

However, I think we should be careful not to conflate transparency with methodological validity.

The distinction between the development set (train/val) and the test set is a fundamental consensus in Machine Learning. The core purpose of a benchmark like MLE-Bench is to measure generalization capabilities.

When an agent iteratively optimizes using feedback from the private test set, it fundamentally alters the nature of the task—from solving a problem to fitting an answer key. Regardless of whether this behavior is disclosed or flagged on a leaderboard, it diverges from the standard definition of ML evaluation.

So, rather than focusing on how to label these submissions, I believe we should first address the primary issue: Does allowing test-set feedback align with the intent of this benchmark? From a standard ML perspective, this is typically considered data leakage, not a valid strategy.

To be clear, I am advocating for a completely separate leaderboard for these submissions, rather than simply adding a "transparency tag" on the main board. Mixing these results would be misleading, as it conflates generalization capability with oracle search efficiency. Therefore, to preserve the integrity of the main benchmark, submissions that utilize private test-set feedback should be moved to a distinct "Oracle / Skyline" track.

@JunnHuo
Copy link

JunnHuo commented Feb 12, 2026

One concern I have is that many submissions aren’t very clear about their evaluation setup. When the code isn’t public and only final scores are shared, it’s hard to tell whether private test-set feedback was used during optimization.

That lack of clarity can make it difficult to fairly compare results across submissions. It might help to ask for clearer disclosure of evaluation practices to keep things consistent and fair.

Thanks.

@joe-needham
Copy link
Collaborator

Congrats on the submission!

Regarding discussion above re lack of visibility, indeed source-code and full methodology details are not mandatory but we do include Source Code Available as a column in the leaderboard.

@joe-needham joe-needham merged commit 30111dc into openai:main Feb 20, 2026
sunmarinup pushed a commit to upgini/mle-bench that referenced this pull request Feb 24, 2026
sunmarinup added a commit to upgini/mle-bench that referenced this pull request Mar 13, 2026
* Add source code link for Leeroo submission (Open Source) (openai#116) * Add LoongFlow MLE-Bench Results (Open Source) (openai#119) * Add Disarray MLE-Bench Results (openai#118) Co-authored-by: Alireza Mohammadshahi <121011194+alirezamshi@users.noreply.github.com> Co-authored-by: Hundredwz <sdwangzhuo@gmail.com> * Add leakage disclaimer to leaderboard (openai#125) * Add MLEvolve 12h-budget results to MLE-Bench (Open Source) (openai#120) Co-authored-by: root <root@dsh-mle-97hsx-18790-worker-0.dushangheng.ailab-ma4tool.svc.pjlab.local> --------- Co-authored-by: Alireza Mohammadshahi <121011194+alirezamshi@users.noreply.github.com> Co-authored-by: Hundredwz <sdwangzhuo@gmail.com> Co-authored-by: Moustafa AbdelBaky <9140797+moustafa-a@users.noreply.github.com> Co-authored-by: joe-needham <joe.needham@c-openai.com> Co-authored-by: Shangheng Du <143237960+YoungDubbyDu@users.noreply.github.com> Co-authored-by: root <root@dsh-mle-97hsx-18790-worker-0.dushangheng.ailab-ma4tool.svc.pjlab.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants