TY - JOUR
T1 - TableBench
T2 - 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025
AU - Wu, Xianjie
AU - Yang, Jian
AU - Chai, Linzheng
AU - Zhang, Ge
AU - Liu, Jiaheng
AU - Du, Xeron
AU - Liang, Di
AU - Shu, Daixin
AU - Cheng, Xianfu
AU - Sun, Tianzhen
AU - Li, Tongliang
AU - Li, Zhoujun
AU - Niu, Guanglin
N1 - Publisher Copyright:
© 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2025/4/11
Y1 - 2025/4/11
N2 - Recent advancements in large language models (LLMs) have markedly enhanced the interpretation and processing of tabular data, introducing previously unimaginable capabilities. Despite these achievements, LLMs still encounter significant challenges when applied in industrial scenarios, particularly due to the increased complexity of reasoning required with real-world tabular data, underscoring a notable disparity between academic benchmarks and practical applications. To address this discrepancy, we conduct a detailed investigation into the application of tabular data in industrial scenarios and propose a comprehensive and complex benchmark TableBench, including 18 fields within four major categories of table question answering (TableQA) capabilities. Furthermore, we introduce TABLELLM, trained on our meticulously constructed training set TableInstruct, achieving comparable performance with GPT-3.5. Massive experiments conducted on TableBench indicate that both open-source and proprietary LLMs still have significant room for improvement to meet real-world demands, where the most advanced model, GPT-4, achieves only a modest score compared to humans.
AB - Recent advancements in large language models (LLMs) have markedly enhanced the interpretation and processing of tabular data, introducing previously unimaginable capabilities. Despite these achievements, LLMs still encounter significant challenges when applied in industrial scenarios, particularly due to the increased complexity of reasoning required with real-world tabular data, underscoring a notable disparity between academic benchmarks and practical applications. To address this discrepancy, we conduct a detailed investigation into the application of tabular data in industrial scenarios and propose a comprehensive and complex benchmark TableBench, including 18 fields within four major categories of table question answering (TableQA) capabilities. Furthermore, we introduce TABLELLM, trained on our meticulously constructed training set TableInstruct, achieving comparable performance with GPT-3.5. Massive experiments conducted on TableBench indicate that both open-source and proprietary LLMs still have significant room for improvement to meet real-world demands, where the most advanced model, GPT-4, achieves only a modest score compared to humans.
UR - https://www.scopus.com/pages/publications/105004001185
U2 - 10.1609/aaai.v39i24.34739
DO - 10.1609/aaai.v39i24.34739
M3 - 会议文章
AN - SCOPUS:105004001185
SN - 2159-5399
VL - 39
SP - 25497
EP - 25506
JO - Proceedings of the AAAI Conference on Artificial Intelligence
JF - Proceedings of the AAAI Conference on Artificial Intelligence
IS - 24
Y2 - 25 February 2025 through 4 March 2025
ER -