Laboratory Medicine ›› 2025, Vol. 40 ›› Issue (11): 1075-1081.DOI: 10.3969/j.issn.1673-8640.2025.11.008

Previous Articles     Next Articles

Performance evaluation of different large language models in interpreting tumor marker determination reports

QI Xinglun, YAO Yifan, SHEN Shushi, YANG Zheng, ZHU Junjie, FAN Lina, YANG Dagan()   

  1. Department of Clinical Laboratory,the First Affiliated Hospital of Zhejiang University School of Medicine,Hangzhou 310003,Zhejiang,China
  • Received:2025-06-04 Revised:2025-09-22 Online:2025-11-30 Published:2025-12-12

Abstract:

Objective To evaluate the interpretation ability of different large language models(LLM)for tumor marker determination reports and provide a reference for the clinical application of LLM. Methods The relevant data of patients who underwent tumor marker determination at the First Affiliated Hospital of Zhejiang University School of Medicine in 2024 were collected. Stratified random sampling was conducted using the sampling package of R software,and 200 determination reports were randomly selected. DeepSeek R1,Qwen 3,KIMI and ChatGPT 4.1 were used to interpret 200 determination reports. Totally,2 junior and 2 senior assessors were selected to evaluate the interpretation quality of 4 LLM on a 10-point scale. The differences in interpretation ability among different LLM were evaluated using the Friedman and Wilcoxon tests. Results The overall scores of the tumor marker determination reports that could be interpreted by all 4 LLM from high to low were DeepSeek R1[9(8,10)points],Qwen 3[9(8,10)points],KIMI[8(6,10)points] and ChatGPT 4.1 [7(5,9)points](P<0.001). DeepSeek R1 had 0.3%,Qwen 3 had 3.6%,KIMI had 19.0%,and ChatGPT had 27.2% with a score of ≤5 points. There was statistical significance in the score given by assessors of different levels for the 3 LLM except for DeepSeek R1(P<0.001). The scores for the comprehensiveness,accuracy,clarity and relevance of the LLM report interpretation were consistently good(P<0.001). The interpretation ability of LLM and the ability to identify abnormal indicators were superior to the ability to analyze the causes of abnormalities and the ability to make clinical recommendations. The incidences of hallucinations in DeepSeek R1 and Qwen 3 were 3.0% and 2.5%,which were lower than those in KIMI(13.0%)and ChatGPT 4.1(16.0%). Conclusions LLM report interpretation can be used as an auxiliary tool,but there are performance differences among different models. In practical applications,strict supervision should be carried out,and continuous improvement should be made to reduce the risk of incorrect report interpretation.

Key words: Large language model, Tumor marker, Report interpretation

CLC Number: