Abstract
Background: Traditional Chinese medicine (TCM) with knowledge-intensive framework poses unique challenges to performance for large language models (LLMs). Although TCM-specific benchmarks and models have been developed, the performance of lightweight LLMs remains insufficiently investigated. This study presents a systematic evaluation and comparison of large-scale and lightweight LLMs to assess their capabilities and deployment trade-offs. Methods: We developed TCM-related question-answering, a dataset comprising 801 questions derived from TCM textbooks. Eleven LLMs were evaluated under zero-shot and few-shot prompting conditions in both English and Chinese. Performance was primarily measured by accuracy. Results: Large-scale LLMs achieved high accuracy on single-choice (69.01%–90.92%) and true/false questions (52.34%–59.38%) but performed poorly on multiple-choice questions, with a maximum accuracy of only 8.40%. Lightweight LLMs (2.10%–49.48%) generally lagged behind larger LLMs (6.30%–95.07%). However, Qwen3-1.7B (5.92%–54.20%) stood out and even surpassed the domain-specialized TCMChat-7B (2.10%–36.98%). Few-shot prompting enhanced performance in 8/11 (72.7%) of the models, Chinese prompts yielded better results than English in 9/11 (81.8%) of the models. Symptomatic diagnosis emerged as the most challenging reasoning category across all models (16.75%–48.07%). Conclusion: This study demonstrates that although large-scale LLMs exhibit strong knowledge recall in TCM, their suboptimal performance on multiple-choice questions and substantial computational costs may limit their practical applicability in clinical settings. The robust performance of Qwen3-1.7B indicates that effective model optimization and domain-specific training may offer greater advantages than simply increasing model size. While the current evaluation is based on examination-style tasks and does not involve real-world clinical decision-making, our findings provide insights to support the deployment of optimized models in resource-constrained healthcare environments.
| Original language | English |
|---|---|
| Article number | e70118 |
| Journal | Journal of Evidence-Based Medicine |
| Volume | 19 |
| Issue number | 1 |
| DOIs | |
| State | Published - Mar 2026 |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 3 Good Health and Well-being
Keywords
- benchmark evaluation
- large language models
- lightweight large language models
- prompt engineering
- traditional Chinese medicine
Fingerprint
Dive into the research topics of 'Evaluating Large-Scale and Lightweight Large Language Models for Traditional Chinese Medicine Exam Questions: A Comparative Study'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver