Skip to main navigation Skip to search Skip to main content

Evaluating Large-Scale and Lightweight Large Language Models for Traditional Chinese Medicine Exam Questions: A Comparative Study

  • Yizhen Li
  • , Shaohan Huang
  • , Jiaxing Qi
  • , Yao Lu
  • , Lei Quan
  • , Dongran Han
  • , Bin Li
  • , Xincan Liu*
  • , Zhongzhi Luan*
  • *Corresponding author for this work
  • Henan University of Chinese Medicine
  • Beihang University
  • Beijing University of Chinese Medicine

Research output: Contribution to journalArticlepeer-review

Abstract

Background: Traditional Chinese medicine (TCM) with knowledge-intensive framework poses unique challenges to performance for large language models (LLMs). Although TCM-specific benchmarks and models have been developed, the performance of lightweight LLMs remains insufficiently investigated. This study presents a systematic evaluation and comparison of large-scale and lightweight LLMs to assess their capabilities and deployment trade-offs. Methods: We developed TCM-related question-answering, a dataset comprising 801 questions derived from TCM textbooks. Eleven LLMs were evaluated under zero-shot and few-shot prompting conditions in both English and Chinese. Performance was primarily measured by accuracy. Results: Large-scale LLMs achieved high accuracy on single-choice (69.01%–90.92%) and true/false questions (52.34%–59.38%) but performed poorly on multiple-choice questions, with a maximum accuracy of only 8.40%. Lightweight LLMs (2.10%–49.48%) generally lagged behind larger LLMs (6.30%–95.07%). However, Qwen3-1.7B (5.92%–54.20%) stood out and even surpassed the domain-specialized TCMChat-7B (2.10%–36.98%). Few-shot prompting enhanced performance in 8/11 (72.7%) of the models, Chinese prompts yielded better results than English in 9/11 (81.8%) of the models. Symptomatic diagnosis emerged as the most challenging reasoning category across all models (16.75%–48.07%). Conclusion: This study demonstrates that although large-scale LLMs exhibit strong knowledge recall in TCM, their suboptimal performance on multiple-choice questions and substantial computational costs may limit their practical applicability in clinical settings. The robust performance of Qwen3-1.7B indicates that effective model optimization and domain-specific training may offer greater advantages than simply increasing model size. While the current evaluation is based on examination-style tasks and does not involve real-world clinical decision-making, our findings provide insights to support the deployment of optimized models in resource-constrained healthcare environments.

Original languageEnglish
Article numbere70118
JournalJournal of Evidence-Based Medicine
Volume19
Issue number1
DOIs
StatePublished - Mar 2026

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 3 - Good Health and Well-being
    SDG 3 Good Health and Well-being

Keywords

  • benchmark evaluation
  • large language models
  • lightweight large language models
  • prompt engineering
  • traditional Chinese medicine

Fingerprint

Dive into the research topics of 'Evaluating Large-Scale and Lightweight Large Language Models for Traditional Chinese Medicine Exam Questions: A Comparative Study'. Together they form a unique fingerprint.

Cite this