北京大学学报(医学版) ›› 2018, Vol. 50 ›› Issue (2): 256-263. doi: 10.3969/j.issn.1671-167X.2018.02.010

• 论著 • 上一篇    下一篇

非结构化电子病历中信息抽取的定制化方法

包小源1,2,黄婉晶3,张凯4,金梦1,2,李岩2,5,牛承志6△   

  1. (1. 北京大学医学信息学中心, 北京100191; 2. 国家医疗服务数据中心, 北京100191; 3. 北京大学数学科学学院, 北京100871; 4. 北京大学基础医学院, 北京100191; 5. 北京大学医学部医院管理处, 北京100191; 6. 郑州大学第一附属医院信息处, 郑州450052)
  • 出版日期:2018-04-18 发布日期:2018-04-18
  • 通讯作者: 牛承志 E-mail:nczfkb@126.com
  • 基金资助:
    北京大学医学-信息科学交叉学科种子基金项目(BMU20140434)资助

A customized method for information extraction from unstructured text data in the electronic medical records

BAO Xiao-yuan1,2, HUANG Wan-jing3, ZHANG Kai4, JIN Meng1,2, LI Yan2,5, NIU Cheng-zhi6△   

  1. (1. Medical Informatics Center, Peking University, Beijing 100191, China; 2. National Clinical Service Data Center, Beijing 100191, China; 3. School of Mathematical Sciences, Peking University, Beijing 100871, China; 4. Peking University School of Basic Medical Science, Beijing 100191, China; 5. Department of Hospital Management, Peking University Health Science Center, Beijing 100191, China; 6. Department of Information, the First Affiliated Hospital of Zhengzhou University, Zhengzhou 450052, China)
  • Online:2018-04-18 Published:2018-04-18
  • Contact: NIU Cheng-zhi E-mail:nczfkb@126.com
  • Supported by:
    Supported by the Peking University Seed Fund for Medicine-Information Interdisciplinary Research Project (BMU20140434)

摘要: 目的:电子病历数据中的主诉、现病史、既往史、鉴别诊断、影像诊断、手术记录等主体内主要采用中文自然语言文字描述,是临床医生实际诊疗细节的具体体现,包含了诊疗细节的大量、丰富信息。本研究目的在于建立一种从中进行有效信息提取并组织成可分析利用的形式,供目前医学数据处理、医学研究之用。方法:基于医院的真实电子病历数据,设计定制化的基于规则学习及信息抽取方法,采用三个步骤实现中文信息的抽取:(1)抽样标注,随机抽取600份电子病历的病史信息(包括现病史、既往史、个人史、家族史等),采用本研究开发的标注平台,对其中需要抽取的信息(以糖尿病史为实例)进行标注;(2)根据标注结果,进行抽取模版归纳,并将抽取模版进行重写,生成可以直接用于抽取的Perl语言正则表达式抽取规则,并利用这些规则进行实际信息抽取;(3)对抽取结果进行人工验证与自动化验证相结合的方法,对方法的有效性进行验证。结果:所设计方法已在国家医疗数据中心平台上实现,并针对糖尿病病史抽取在医院进行了单个科室的现场验证,2015年1 436份糖尿病患者病历的病史抽取结果为召回率87.6%、准确率99.5%、F分数(F-Score)0.93;全体糖尿病患者10%抽样病历共1 223份的抽取结果为召回率89.2%、准确率99.2%、F-Score 0.94,效果较好。结论:主要采用自然语言处理与基于规则的信息抽取相结合的方法,设计并实现了从非结构化的中文电子病历文本数据中抽取定制化信息的算法,与已有工作比对效果较好。

关键词: 病案系统, 计算机化, 信息获取, 糖尿病, 病史记录

Abstract: Objective: There is a huge amount of diagnostic or treatment information in electronic me-dical record (EMR), which is a concrete manifestation of clinicians actual diagnosis and treatment details. Plenty of episodes in EMRs, such as complaints, present illness, past history, differential diagnosis, diagnostic imaging, surgical records, reflecting details of diagnosis and treatment in clinical process, adopt Chinese description of natural language. How to extract effective information from these Chinese narrative text data, and organize it into a form of tabular for analysis of medical research, for the practical utilization of clinical data in the real world, is a difficult problem in Chinese medical data processing. Methods: Based on the EMRs narrative text data in a tertiary hospital in China, a customized information extracting rules learning, and rule based information extraction methods is proposed. The overall method consists of three steps, which includes: (1) Step 1, a random sample of 600 copies (including the history of present illness, past history, personal history, family history, etc.) of the electronic medical record data, was extracted as raw corpora. With our developed Chinese clinical narrative text annotation platform, the trained clinician and nurses marked the tokens and phrases in the corpora which would be extracted (with a history of diabetes as an example). (2) Step 2, based on the annotated corpora clinical text data, some extraction templates were summarized and induced firstly. Then these templates were rewritten using regular expressions of Perl programming language, as extraction rules. Using these extraction rules as basic knowledge base, we developed extraction packages in Perl, for extracting data from the EMRs text data. In the end, the extracted data items were organized in tabular data format, for later usage in clinical research or hospital surveillance purposes. (3) As the final step of the method, the evaluation and validation of the proposed methods were implemented in the National Clinical Service Data Integration Platform, and we checked the extraction results using artificial verification and automated verification combined, proved the effectiveness of the method. Results: For all the patients with diabetes as diagnosed disease in the Department of Endocrine in the hospital, the medical history episode of these patients showed that, altogether 1 436 patients were dismissed in 2015, and a history of diabetes medical records extraction results showed that the recall rate was 87.6%, the accuracy rate was 99.5%, and F-Score was 0.93. For all the 10% patients (totally 1 223 patients) with diabetes by the dismissed dates of August 2017 in the same department, the extracted diabetes history extraction results showed that the recall rate was 89.2%, the accuracy rate was 99.2%, F-Score was 0.94. Conclusion: This study mainly adopts the combination of natural language processing and rule-based information extraction, and designs and implements an algorithm for extracting customized information from unstructured Chinese electronic medical record text data. It has better results than existing work.

Key words: Medical records systems, computerized, Access to Information, Diabetes mellitus, Medical history taking

中图分类号: 

  • R319
[1] 徐欣然,霍芃呈,和璐,孟焕新,朱筠轩,靳东思奇. 伴与不伴糖尿病的牙周炎患者牙周基础治疗的疗效比较及其与白细胞水平的相关分析[J]. 北京大学学报(医学版), 2022, 54(1): 48-53.
[2] 王子靖,李在玲. 有幽门螺杆菌感染家族史儿童胃部菌群的特点[J]. 北京大学学报(医学版), 2021, 53(6): 1115-1121.
[3] 朱忆颖,闵赛南,俞光岩. 局部注射环孢素A对非肥胖糖尿病小鼠下颌下腺分泌功能及炎症的影响[J]. 北京大学学报(医学版), 2021, 53(4): 750-757.
[4] 尹雪倩, 张晓玄, 文婧, 刘思奇, 刘欣然, 周若宇, 王军波. 荞麦、燕麦、豌豆复配对糖尿病大鼠血糖的影响[J]. 北京大学学报(医学版), 2021, 53(3): 447-452.
[5] 吴俊慧,陈泓伯,武轶群,吴瑶,王紫荆,吴涛,王梦莹,王斯悦,王小文,王伽婷,于欢,胡永华. 2015—2017年北京市2型糖尿病患者骨关节炎患病的相关因素[J]. 北京大学学报(医学版), 2021, 53(3): 518-522.
[6] 樊理诗,高敏,Edwin B.FISHER,孙昕霙. 北京市通州区和顺义区747例2型糖尿病患者生存质量影响因素[J]. 北京大学学报(医学版), 2021, 53(3): 523-529.
[7] 郭洪萍,赵艾,薛勇,马良坤,张玉梅,王培玉. 孕期营养素摄入与妊娠期糖尿病孕妇血糖控制效果的相关性研究[J]. 北京大学学报(医学版), 2021, 53(3): 467-472.
[8] 陈平,黎泽明,郭怡,孙昕霙,Edwin B.FISHER. 基于大五人格理论应用潜在剖面分析探究2型糖尿病患者的用药依从性[J]. 北京大学学报(医学版), 2021, 53(3): 530-535.
[9] 郜洪宇,徐菁玲,孟焕新,和璐,侯建霞. 牙周基础治疗对2型糖尿病伴慢性牙周炎患者红细胞、血小板相关指标的影响[J]. 北京大学学报(医学版), 2020, 52(4): 750-754.
[10] 郑鸿尘,薛恩慈,王雪珩,陈曦,王斯悦,黄辉,江锦,叶莺,黄春兰,周筠,高文静,余灿清,吕筠,吴小玲,黄小明,曹卫华,严延生,吴涛,李立明. 基于大家系设计的静息心率与常见慢性病双表型遗传度估计[J]. 北京大学学报(医学版), 2020, 52(3): 432-437.
[11] 杨航,杨林承,张瑞涛,凌云鹏,葛庆岗. 合并高血压、冠心病、糖尿病的新型冠状病毒肺炎患者发生病死的危险因素分析[J]. 北京大学学报(医学版), 2020, 52(3): 420-424.
[12] 黎泽明,高敏,陈雪莹,孙昕霙. 2型糖尿病患者大五人格特征与自我管理态度的相关性[J]. 北京大学学报(医学版), 2020, 52(3): 506-513.
[13] 张勇,刘畅,陈彬,陈帆,段晋瑜,张孟钧,焦剑. 糖尿病前期患者糖代谢异常与慢性牙周炎的相关性[J]. 北京大学学报(医学版), 2020, 52(1): 71-76.
[14] 段姣妞,杜伟,侯睿宏,许珂,张改连,张莉芸. 类脂质渐进性坏死1例[J]. 北京大学学报(医学版), 2019, 51(6): 1182-1184.
[15] 何姣,袁戈恒,张俊清,郭晓蕙. 早期糖尿病周围神经病变大鼠模型的建立[J]. 北京大学学报(医学版), 2019, 51(6): 1150-1154.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 田增民, 陈涛, Nanbert ZHONG, 李志超, 尹丰, 刘爽. 神经干细胞移植治疗遗传性小脑萎缩的临床研究(英文稿)[J]. 北京大学学报(医学版), 2009, 41(4): 456 -458 .
[2] 郭岩, 谢铮. 用一代人时间弥合差距——健康社会决定因素理论及其国际经验[J]. 北京大学学报(医学版), 2009, 41(2): 125 -128 .
[3] 成刚, 钱振华, 胡军. 艾滋病项目自愿咨询检测的技术效率分析[J]. 北京大学学报(医学版), 2009, 41(2): 135 -140 .
[4] 卢恬, 朱晓辉, 柳世庆, 郑杰, 邱晓彦. 白细胞介素2促进宫颈癌细胞系HeLaS3免疫球蛋白G的表达[J]. 北京大学学报(医学版), 2009, 41(2): 158 -161 .
[5] 袁惠燕, 张苑, 范田园. 离子交换型栓塞微球及其载平阳霉素的制备与性质研究[J]. 北京大学学报(医学版), 2009, 41(2): 217 -220 .
[6] 徐莉, 孟焕新, 张立, 陈智滨, 冯向辉, 释栋. 侵袭性牙周炎患者血清中抗牙龈卟啉单胞菌的IgG抗体水平的研究[J]. 北京大学学报(医学版), 2009, 41(1): 52 -55 .
[7] 董稳, 刘瑞昌, 刘克英, 关明, 杨旭东. 氯诺昔康和舒芬太尼用于颌面外科术后自控静脉镇痛的比较[J]. 北京大学学报(医学版), 2009, 41(1): 109 -111 .
[8] 祁琨, 邓芙蓉, 郭新彪. 纳米二氧化钛颗粒对人肺成纤维细胞缝隙连接通讯的影响[J]. 北京大学学报(医学版), 2009, 41(3): 297 -301 .
[9] 李宏亮*, 安卫红*, 赵扬玉, 朱曦. 妊娠合并高脂血症性胰腺炎行血液净化治疗1例[J]. 北京大学学报(医学版), 2009, 41(5): 599 -601 .
[10] 李伟军, 邢晓芳, 曲立科, 孟麟, 寿成超. PRL-3基因C104S位点突变体和CAAX缺失体的构建及表达[J]. 北京大学学报(医学版), 2009, 41(5): 516 -520 .