TY - GEN
T1 - Turning the Tide
T2 - 30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025
AU - Zhang, Wei
AU - Yang, Jian
AU - Yang, Jiaxi
AU - Wang, Ya
AU - Li, Zhoujun
AU - Cui, Zeyu
AU - Hui, Binyuan
AU - Lin, Junyang
N1 - Publisher Copyright:
©2025 Association for Computational Linguistics.
PY - 2025
Y1 - 2025
N2 - Code large language models (LLMs) enhance programming by understanding and generating code across languages, offering intelligent feedback, bug detection, and code updates through reflection, improving development efficiency and accessibility. While benchmarks (e.g. HumanEval/LiveCodeBench) evaluate code generation and real-world relevance, previous works ignore the scenario of modifying code in repositories. Considering challenges remaining in improving reflection capabilities and avoiding data contamination in dynamic benchmarks, we introduce LiveRepoReflection, a challenging benchmark for evaluating code understanding and generation in multi-file repository contexts, featuring 1,888 rigorously filtered test cases across 6 programming languages to ensure diversity, correctness, and high difficulty. Further, we create RepoReflection-Instruct, a large-scale, quality-filtered instruction-tuning dataset derived from diverse sources, used to train RepoReflectionCoder through a two-turn dialogue process involving code generation and error-driven repair. The leaderboard evaluates over 40 LLMs to reflect the model performance of repository-based code reflection.
AB - Code large language models (LLMs) enhance programming by understanding and generating code across languages, offering intelligent feedback, bug detection, and code updates through reflection, improving development efficiency and accessibility. While benchmarks (e.g. HumanEval/LiveCodeBench) evaluate code generation and real-world relevance, previous works ignore the scenario of modifying code in repositories. Considering challenges remaining in improving reflection capabilities and avoiding data contamination in dynamic benchmarks, we introduce LiveRepoReflection, a challenging benchmark for evaluating code understanding and generation in multi-file repository contexts, featuring 1,888 rigorously filtered test cases across 6 programming languages to ensure diversity, correctness, and high difficulty. Further, we create RepoReflection-Instruct, a large-scale, quality-filtered instruction-tuning dataset derived from diverse sources, used to train RepoReflectionCoder through a two-turn dialogue process involving code generation and error-driven repair. The leaderboard evaluates over 40 LLMs to reflect the model performance of repository-based code reflection.
UR - https://www.scopus.com/pages/publications/105028961477
U2 - 10.18653/v1/2025.findings-emnlp.377
DO - 10.18653/v1/2025.findings-emnlp.377
M3 - 会议稿件
AN - SCOPUS:105028961477
T3 - EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025
SP - 7148
EP - 7164
BT - EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025
A2 - Christodoulopoulos, Christos
A2 - Chakraborty, Tanmoy
A2 - Rose, Carolyn
A2 - Peng, Violet
PB - Association for Computational Linguistics (ACL)
Y2 - 4 November 2025 through 9 November 2025
ER -