TY - GEN
T1 - LLM-Forest
T2 - 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
AU - He, Xinrui
AU - Ban, Yikun
AU - Zou, Jiaru
AU - Wei, Tianxin
AU - Cook, Curtiss B.
AU - He, Jingrui
N1 - Publisher Copyright:
© 2025 Association for Computational Linguistics.
PY - 2025
Y1 - 2025
N2 - Missing data imputation is a critical challenge in various domains, such as healthcare and finance, where data completeness is vital for accurate analysis. Large language models (LLMs), trained on vast corpora, have shown strong potential in data generation, making them a promising tool for data imputation. However, challenges persist in designing effective prompts for a finetuning-free process and in mitigating biases and uncertainty in LLM outputs. To address these issues, we propose a novel framework, LLM-Forest, which introduces a "forest" of few-shot learning LLM "trees" with their outputs aggregated via confidence-based weighted voting based on LLM self-assessment, inspired by the ensemble learning (Random Forest). This framework is established on a new concept of bipartite information graphs to identify high-quality relevant neighboring entries with both feature and value granularity. Extensive experiments on 9 real-world datasets demonstrate the effectiveness and efficiency of LLM-Forest. The implementation is available at https://github.com/Xinrui17/LLM-Forest.
AB - Missing data imputation is a critical challenge in various domains, such as healthcare and finance, where data completeness is vital for accurate analysis. Large language models (LLMs), trained on vast corpora, have shown strong potential in data generation, making them a promising tool for data imputation. However, challenges persist in designing effective prompts for a finetuning-free process and in mitigating biases and uncertainty in LLM outputs. To address these issues, we propose a novel framework, LLM-Forest, which introduces a "forest" of few-shot learning LLM "trees" with their outputs aggregated via confidence-based weighted voting based on LLM self-assessment, inspired by the ensemble learning (Random Forest). This framework is established on a new concept of bipartite information graphs to identify high-quality relevant neighboring entries with both feature and value granularity. Extensive experiments on 9 real-world datasets demonstrate the effectiveness and efficiency of LLM-Forest. The implementation is available at https://github.com/Xinrui17/LLM-Forest.
UR - https://www.scopus.com/pages/publications/105028635955
U2 - 10.18653/v1/2025.findings-acl.361
DO - 10.18653/v1/2025.findings-acl.361
M3 - 会议稿件
AN - SCOPUS:105028635955
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 6921
EP - 6936
BT - Findings of the Association for Computational Linguistics
A2 - Che, Wanxiang
A2 - Nabende, Joyce
A2 - Shutova, Ekaterina
A2 - Pilehvar, Mohammad Taher
PB - Association for Computational Linguistics (ACL)
Y2 - 27 July 2025 through 1 August 2025
ER -