TY - GEN
T1 - Large Language Models Struggle with Unreasonability in Math Problems
AU - Ma, Jingyuan
AU - Dai, Damai
AU - Yuan, Zihang
AU - Li, Rui
AU - Luo, Weilin
AU - Wang, Bin
AU - Liu, Qun
AU - Sha, Lei
AU - Sui, Zhifang
N1 - Publisher Copyright:
© 2026, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2026
Y1 - 2026
N2 - Large Language Models (LLMs) have shown remarkable success across a wide range of math and reasoning benchmarks. However, we observe that they often struggle when faced with unreasonable math problems. Instead of recognizing these issues, the models frequently proceed as if the problems are well posed, producing incorrect answers or overthinking and producing verbose self-corrections. To systematically investigate this overlooked vulnerability, we propose the Unreasonable Math Problems (UMP) benchmark, designed to evaluate LLMs’ ability to detect and respond to unreasonable math problems. Based on extensive experiments covering 19 LLMs, we find that even state-of-the-art general models such as GPT-4o struggle on UMP. Reasoning models such as DeepSeek-R1 demonstrate higher sensitivity to unreasonable inputs; however, this sensitivity often comes at the cost of generating overly long and meaningless responses that fail to converge. We further find that prompting and fine-tuning enhance the detection of unreasonable inputs with minor and acceptable trade-offs that make them practical solutions in this challenging setting.
AB - Large Language Models (LLMs) have shown remarkable success across a wide range of math and reasoning benchmarks. However, we observe that they often struggle when faced with unreasonable math problems. Instead of recognizing these issues, the models frequently proceed as if the problems are well posed, producing incorrect answers or overthinking and producing verbose self-corrections. To systematically investigate this overlooked vulnerability, we propose the Unreasonable Math Problems (UMP) benchmark, designed to evaluate LLMs’ ability to detect and respond to unreasonable math problems. Based on extensive experiments covering 19 LLMs, we find that even state-of-the-art general models such as GPT-4o struggle on UMP. Reasoning models such as DeepSeek-R1 demonstrate higher sensitivity to unreasonable inputs; however, this sensitivity often comes at the cost of generating overly long and meaningless responses that fail to converge. We further find that prompting and fine-tuning enhance the detection of unreasonable inputs with minor and acceptable trade-offs that make them practical solutions in this challenging setting.
UR - https://www.scopus.com/pages/publications/105034603867
U2 - 10.1609/aaai.v40i38.40518
DO - 10.1609/aaai.v40i38.40518
M3 - 会议稿件
AN - SCOPUS:105034603867
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
T3 - Proceedings of the AAAI Conference on Artificial Intelligence
SP - 32428
EP - 32436
BT - Proceedings of the AAAI Conference on Artificial Intelligence
A2 - Koenig, Sven
A2 - Jenkins, Chad
A2 - Taylor, Matthew E.
PB - Association for the Advancement of Artificial Intelligence
T2 - 40th AAAI Conference on Artificial Intelligence, AAAI 2026
Y2 - 20 January 2026 through 27 January 2026
ER -