跳到主要导航 跳到搜索 跳到主要内容

OMNI-MATH: A UNIVERSAL OLYMPIAD LEVEL MATHEMATIC BENCHMARK FOR LARGE LANGUAGE MODELS

  • Bofei Gao
  • , Feifan Song
  • , Zhe Yang
  • , Zefan Cai
  • , Yibo Miao
  • , Qingxiu Dong
  • , Lei Li
  • , Chenghao Ma
  • , Liang Chen
  • , Runxin Xu
  • , Zhengyang Tang
  • , Benyou Wang
  • , Daoguang Zan
  • , Shanghaoran Quan
  • , Ge Zhang
  • , Lei Sha
  • , Yichang Zhang
  • , Xuancheng Ren
  • , Tianyu Liu
  • , Baobao Chang*
  • *此作品的通讯作者
  • Peking University
  • University of Wisconsin-Madison
  • Shanghai Jiao Tong University
  • The University of Hong Kong
  • Engineering Research Center of Information Networks
  • The Chinese University of Hong Kong, Shenzhen
  • CAS - Institute of Software
  • Alibaba Group Holding Ltd.
  • University of Waterloo
  • Zhongguancun Laboratory

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Recent advancements in large language models (LLMs) have led to significant breakthroughs in mathematical reasoning capabilities. However, existing benchmarks like GSM8K or MATH are now being solved with high accuracy (e.g., OpenAI o1 achieves 94.8% on MATH dataset), indicating their inadequacy for truly challenging these models. To bridge this gap, we propose a comprehensive and challenging benchmark specifically designed to assess LLMs' mathematical reasoning at the Olympiad level. Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics and comprises a vast collection of 4428 competition-level problems with rigorous human annotation. These problems are meticulously categorized into over 33 sub-domains and span more than 10 distinct difficulty levels, enabling a holistic assessment of model performance in Olympiad-mathematical reasoning. Furthermore, we conducted an in-depth analysis based on this benchmark. Our experimental results show that even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems, with 60.54% and 52.55% accuracy, highlighting significant challenges in Olympiad-level mathematical reasoning.

源语言英语
主期刊名13th International Conference on Learning Representations, ICLR 2025
出版商International Conference on Learning Representations, ICLR
98023-98052
页数30
ISBN(电子版)9798331320850
出版状态已出版 - 2025
已对外发布
活动13th International Conference on Learning Representations, ICLR 2025 - Singapore, 新加坡
期限: 24 4月 202528 4月 2025

出版系列

姓名13th International Conference on Learning Representations, ICLR 2025

会议

会议13th International Conference on Learning Representations, ICLR 2025
国家/地区新加坡
Singapore
时期24/04/2528/04/25

指纹

探究 'OMNI-MATH: A UNIVERSAL OLYMPIAD LEVEL MATHEMATIC BENCHMARK FOR LARGE LANGUAGE MODELS' 的科研主题。它们共同构成独一无二的指纹。

引用此