Add LoongFlow MLE-Bench Results (Open Source)#119
Conversation
| For competitions where medals were achieved (any_medal: true), the ML code generated by our agent has been submitted and is available at: https://github.com/baidu-baige/LoongFlow/tree/main/agents/ml_agent/examples/mlebench/competitions |
Subject: Evaluation methodology concern — test-set feedback as data leakageCongratulations on the results and thanks for open-sourcing the framework — the PES architecture and multi-island evolution are genuinely interesting work. However, I believe there is a significant methodological concern that should be addressed. The core issue: the agent optimizes directly on private test-set feedback. In the current MLE-Bench setup, each iteration evaluates
This behavior was previously acknowledged by the team in Issue #36: Why this matters: Comparison to the Kaggle Standard:
A benchmark whose test labels can be iteratively optimized against is no longer measuring generalization — it is measuring search efficiency over a leaked answer key. This runs contrary to the fundamental intent of MLE-Bench and creates an unrealistic scenario that no human competitor or real-world engineer could ever replicate. It seems to me that this approach might not be reasonable. What are your thoughts? @dexhunter @thesofakillers @kibrq |
| I would like to raise a related question: for MLE-Bench submissions where the code has not been made public, it is currently difficult to determine whether the evaluation strictly avoids iterative use of private test-set feedback. More broadly, when only final scores are reported without releasing code, the evaluation protocol cannot be independently verified. This may affect the comparability of results across submissions. In this context, would it be reasonable to include explicit indicators on the leaderboard—for example, whether the code is open-sourced and whether the method explicitly avoids using private test-set feedback during optimization? Clear disclosure on these points could help strengthen methodological transparency and consistency. |
| I fully agree that open-sourcing code is vital for transparency, and I appreciate you raising the point about verification. However, I think we should be careful not to conflate transparency with methodological validity. The distinction between the development set (train/val) and the test set is a fundamental consensus in Machine Learning. The core purpose of a benchmark like MLE-Bench is to measure generalization capabilities. When an agent iteratively optimizes using feedback from the private test set, it fundamentally alters the nature of the task—from solving a problem to fitting an answer key. Regardless of whether this behavior is disclosed or flagged on a leaderboard, it diverges from the standard definition of ML evaluation. So, rather than focusing on how to label these submissions, I believe we should first address the primary issue: Does allowing test-set feedback align with the intent of this benchmark? From a standard ML perspective, this is typically considered data leakage, not a valid strategy. To be clear, I am advocating for a completely separate leaderboard for these submissions, rather than simply adding a "transparency tag" on the main board. Mixing these results would be misleading, as it conflates generalization capability with oracle search efficiency. Therefore, to preserve the integrity of the main benchmark, submissions that utilize private test-set feedback should be moved to a distinct "Oracle / Skyline" track. |
| One concern I have is that many submissions aren’t very clear about their evaluation setup. When the code isn’t public and only final scores are shared, it’s hard to tell whether private test-set feedback was used during optimization. That lack of clarity can make it difficult to fairly compare results across submissions. It might help to ask for clearer disclosure of evaluation practices to keep things consistent and fair. Thanks. |
| Congrats on the submission! Regarding discussion above re lack of visibility, indeed source-code and full methodology details are not mandatory but we do include |
* Add source code link for Leeroo submission (Open Source) (openai#116) * Add LoongFlow MLE-Bench Results (Open Source) (openai#119) * Add Disarray MLE-Bench Results (openai#118) Co-authored-by: Alireza Mohammadshahi <121011194+alirezamshi@users.noreply.github.com> Co-authored-by: Hundredwz <sdwangzhuo@gmail.com> * Add leakage disclaimer to leaderboard (openai#125) * Add MLEvolve 12h-budget results to MLE-Bench (Open Source) (openai#120) Co-authored-by: root <root@dsh-mle-97hsx-18790-worker-0.dushangheng.ailab-ma4tool.svc.pjlab.local> --------- Co-authored-by: Alireza Mohammadshahi <121011194+alirezamshi@users.noreply.github.com> Co-authored-by: Hundredwz <sdwangzhuo@gmail.com> Co-authored-by: Moustafa AbdelBaky <9140797+moustafa-a@users.noreply.github.com> Co-authored-by: joe-needham <joe.needham@c-openai.com> Co-authored-by: Shangheng Du <143237960+YoungDubbyDu@users.noreply.github.com> Co-authored-by: root <root@dsh-mle-97hsx-18790-worker-0.dushangheng.ailab-ma4tool.svc.pjlab.local>

Hello MLE-Bench team,
We are pleased to submit the evaluation results for LoongFlow, an open-source thinking & learning framework for expert-grade AI agents.
To promote open research and community development, we have made the full framework source code publicly available at: https://github.com/baidu-baige/LoongFlow
Resources per run
Pull Request Contents
This PR updates the
README.mdleaderboard and includes detailed grading reports for three independent runs, located at:runs/loongflow_group1runs/loongflow_group2runs/loongflow_group3Note: Grading was performed using the
--pad-missingflag.We deeply appreciate the MLE-Bench project for providing this valuable evaluation platform for ML engineering agents. We hope our open-source contribution will facilitate technical advancement and community growth in the AI Agent field.
Best regards,
The LoongFlow Team