检验医学 ›› 2025, Vol. 40 ›› Issue (11): 1075-1081.DOI: 10.3969/j.issn.1673-8640.2025.11.008

• 论著 • 上一篇    下一篇

不同大语言模型肿瘤标志物报告解读性能评价

齐星伦, 姚一帆, 沈舒施, 杨铮, 朱俊杰, 范利娜, 杨大干()   

  1. 浙江大学医学院附属第一医院检验科,浙江 杭州 310003
  • 收稿日期:2025-06-04 修回日期:2025-09-22 出版日期:2025-11-30 发布日期:2025-12-12
  • 通讯作者: 杨大干,E-mail:yangdagan@zju.edu.cn
  • 作者简介:齐星伦,女,1995年生,学士,主管技师,主要从事临床基础检验和数据科学研究。
  • 基金资助:
    国家重点研发计划项目(2022YFC3602302)

Performance evaluation of different large language models in interpreting tumor marker determination reports

QI Xinglun, YAO Yifan, SHEN Shushi, YANG Zheng, ZHU Junjie, FAN Lina, YANG Dagan()   

  1. Department of Clinical Laboratory,the First Affiliated Hospital of Zhejiang University School of Medicine,Hangzhou 310003,Zhejiang,China
  • Received:2025-06-04 Revised:2025-09-22 Online:2025-11-30 Published:2025-12-12

摘要:

目的 评估不同大语言模型(LLM)对肿瘤标志物检验报告的解读能力,为LLM的临床应用提供参考。方法 收集2024年浙江大学医学院附属第一医院行肿瘤标志物检测者相关数据。采用R软件sampling包进行分层随机抽样,随机抽取200份检测报告。分别采用DeepSeek R1、Qwen 3、KIMI、ChatGPT 4.1对200份检验报告进行解读。选取初级和资深评估者各2名,采用10分制对4个LLM的解读质量进行评估。采用Friedman和Wilcoxon检验分析不同LLM解读能力的差异。结果 LLM4个均能解读肿瘤标志物报告,总体评分从高到低依次为DeepSeek R1[9(8,10)分]、Qwen 3[9(8,10)分]、KIMI[ 8(6,10)分]、ChatGPT 4.1[7(5,9)分],4个LLM评分差异有统计学意义(P<0.001)。DeepSeek R1有0.3%、Qwen 3有3.6%、KIMI有19.0%、ChatGPT 4.1有27.2%的评分≤5分。不同级别评估者对除DeepSeek R1外的3个LLM的评分差异均有统计学意义(P<0.001)。LLM报告解读的全面性、准确性、清晰度和相关性的评分结果一致较好(P<0.001)。LLM报告解读和异常指标识别能力优于异常原因分析和临床建议能力;DeepSeek R1、Qwen 3“幻觉”发生率分别为3.0%、2.5%,低于KIMI(13.0%)和ChatGPT 4.1(16.0%)。结论 LLM可作为报告解读辅助工具,但不同模型解读性能存在差异。实际应用时应进行严格监管,并持续改进,以减少报告错误解读风险。

关键词: 大语言模型, 肿瘤标志物, 报告解读

Abstract:

Objective To evaluate the interpretation ability of different large language models(LLM)for tumor marker determination reports and provide a reference for the clinical application of LLM. Methods The relevant data of patients who underwent tumor marker determination at the First Affiliated Hospital of Zhejiang University School of Medicine in 2024 were collected. Stratified random sampling was conducted using the sampling package of R software,and 200 determination reports were randomly selected. DeepSeek R1,Qwen 3,KIMI and ChatGPT 4.1 were used to interpret 200 determination reports. Totally,2 junior and 2 senior assessors were selected to evaluate the interpretation quality of 4 LLM on a 10-point scale. The differences in interpretation ability among different LLM were evaluated using the Friedman and Wilcoxon tests. Results The overall scores of the tumor marker determination reports that could be interpreted by all 4 LLM from high to low were DeepSeek R1[9(8,10)points],Qwen 3[9(8,10)points],KIMI[8(6,10)points] and ChatGPT 4.1 [7(5,9)points](P<0.001). DeepSeek R1 had 0.3%,Qwen 3 had 3.6%,KIMI had 19.0%,and ChatGPT had 27.2% with a score of ≤5 points. There was statistical significance in the score given by assessors of different levels for the 3 LLM except for DeepSeek R1(P<0.001). The scores for the comprehensiveness,accuracy,clarity and relevance of the LLM report interpretation were consistently good(P<0.001). The interpretation ability of LLM and the ability to identify abnormal indicators were superior to the ability to analyze the causes of abnormalities and the ability to make clinical recommendations. The incidences of hallucinations in DeepSeek R1 and Qwen 3 were 3.0% and 2.5%,which were lower than those in KIMI(13.0%)and ChatGPT 4.1(16.0%). Conclusions LLM report interpretation can be used as an auxiliary tool,but there are performance differences among different models. In practical applications,strict supervision should be carried out,and continuous improvement should be made to reduce the risk of incorrect report interpretation.

Key words: Large language model, Tumor marker, Report interpretation

中图分类号: