北京大学学报(医学版) ›› 2018, Vol. 50 ›› Issue (2): 256-263. doi: 10.3969/j.issn.1671-167X.2018.02.010

• 论著 • 上一篇    下一篇

非结构化电子病历中信息抽取的定制化方法

包小源1,2,黄婉晶3,张凯4,金梦1,2,李岩2,5,牛承志6△   

  1. (1. 北京大学医学信息学中心, 北京100191; 2. 国家医疗服务数据中心, 北京100191; 3. 北京大学数学科学学院, 北京100871; 4. 北京大学基础医学院, 北京100191; 5. 北京大学医学部医院管理处, 北京100191; 6. 郑州大学第一附属医院信息处, 郑州450052)
  • 出版日期:2018-04-18 发布日期:2018-04-18
  • 通讯作者: 牛承志 E-mail:nczfkb@126.com
  • 基金资助:
    北京大学医学-信息科学交叉学科种子基金项目(BMU20140434)资助

A customized method for information extraction from unstructured text data in the electronic medical records

BAO Xiao-yuan1,2, HUANG Wan-jing3, ZHANG Kai4, JIN Meng1,2, LI Yan2,5, NIU Cheng-zhi6△   

  1. (1. Medical Informatics Center, Peking University, Beijing 100191, China; 2. National Clinical Service Data Center, Beijing 100191, China; 3. School of Mathematical Sciences, Peking University, Beijing 100871, China; 4. Peking University School of Basic Medical Science, Beijing 100191, China; 5. Department of Hospital Management, Peking University Health Science Center, Beijing 100191, China; 6. Department of Information, the First Affiliated Hospital of Zhengzhou University, Zhengzhou 450052, China)
  • Online:2018-04-18 Published:2018-04-18
  • Contact: NIU Cheng-zhi E-mail:nczfkb@126.com
  • Supported by:
    Supported by the Peking University Seed Fund for Medicine-Information Interdisciplinary Research Project (BMU20140434)

摘要: 目的:电子病历数据中的主诉、现病史、既往史、鉴别诊断、影像诊断、手术记录等主体内主要采用中文自然语言文字描述,是临床医生实际诊疗细节的具体体现,包含了诊疗细节的大量、丰富信息。本研究目的在于建立一种从中进行有效信息提取并组织成可分析利用的形式,供目前医学数据处理、医学研究之用。方法:基于医院的真实电子病历数据,设计定制化的基于规则学习及信息抽取方法,采用三个步骤实现中文信息的抽取:(1)抽样标注,随机抽取600份电子病历的病史信息(包括现病史、既往史、个人史、家族史等),采用本研究开发的标注平台,对其中需要抽取的信息(以糖尿病史为实例)进行标注;(2)根据标注结果,进行抽取模版归纳,并将抽取模版进行重写,生成可以直接用于抽取的Perl语言正则表达式抽取规则,并利用这些规则进行实际信息抽取;(3)对抽取结果进行人工验证与自动化验证相结合的方法,对方法的有效性进行验证。结果:所设计方法已在国家医疗数据中心平台上实现,并针对糖尿病病史抽取在医院进行了单个科室的现场验证,2015年1 436份糖尿病患者病历的病史抽取结果为召回率87.6%、准确率99.5%、F分数(F-Score)0.93;全体糖尿病患者10%抽样病历共1 223份的抽取结果为召回率89.2%、准确率99.2%、F-Score 0.94,效果较好。结论:主要采用自然语言处理与基于规则的信息抽取相结合的方法,设计并实现了从非结构化的中文电子病历文本数据中抽取定制化信息的算法,与已有工作比对效果较好。

关键词: 病案系统, 计算机化, 信息获取, 糖尿病, 病史记录

Abstract: Objective: There is a huge amount of diagnostic or treatment information in electronic me-dical record (EMR), which is a concrete manifestation of clinicians actual diagnosis and treatment details. Plenty of episodes in EMRs, such as complaints, present illness, past history, differential diagnosis, diagnostic imaging, surgical records, reflecting details of diagnosis and treatment in clinical process, adopt Chinese description of natural language. How to extract effective information from these Chinese narrative text data, and organize it into a form of tabular for analysis of medical research, for the practical utilization of clinical data in the real world, is a difficult problem in Chinese medical data processing. Methods: Based on the EMRs narrative text data in a tertiary hospital in China, a customized information extracting rules learning, and rule based information extraction methods is proposed. The overall method consists of three steps, which includes: (1) Step 1, a random sample of 600 copies (including the history of present illness, past history, personal history, family history, etc.) of the electronic medical record data, was extracted as raw corpora. With our developed Chinese clinical narrative text annotation platform, the trained clinician and nurses marked the tokens and phrases in the corpora which would be extracted (with a history of diabetes as an example). (2) Step 2, based on the annotated corpora clinical text data, some extraction templates were summarized and induced firstly. Then these templates were rewritten using regular expressions of Perl programming language, as extraction rules. Using these extraction rules as basic knowledge base, we developed extraction packages in Perl, for extracting data from the EMRs text data. In the end, the extracted data items were organized in tabular data format, for later usage in clinical research or hospital surveillance purposes. (3) As the final step of the method, the evaluation and validation of the proposed methods were implemented in the National Clinical Service Data Integration Platform, and we checked the extraction results using artificial verification and automated verification combined, proved the effectiveness of the method. Results: For all the patients with diabetes as diagnosed disease in the Department of Endocrine in the hospital, the medical history episode of these patients showed that, altogether 1 436 patients were dismissed in 2015, and a history of diabetes medical records extraction results showed that the recall rate was 87.6%, the accuracy rate was 99.5%, and F-Score was 0.93. For all the 10% patients (totally 1 223 patients) with diabetes by the dismissed dates of August 2017 in the same department, the extracted diabetes history extraction results showed that the recall rate was 89.2%, the accuracy rate was 99.2%, F-Score was 0.94. Conclusion: This study mainly adopts the combination of natural language processing and rule-based information extraction, and designs and implements an algorithm for extracting customized information from unstructured Chinese electronic medical record text data. It has better results than existing work.

Key words: Medical records systems, computerized, Access to Information, Diabetes mellitus, Medical history taking

中图分类号: 

  • R319
[1] 张培恒, 高莹, 吴红花, 张健, 张俊清. 暴发性1型糖尿病合并急性胰腺炎1例及文献回顾[J]. 北京大学学报(医学版), 2024, 56(5): 923-927.
[2] 马雨佳,卢燃藜,周泽宸,李晓怡,闫泽玉,武轶群,陈大方. 基于两样本孟德尔随机化的失眠与2型糖尿病关联研究[J]. 北京大学学报(医学版), 2024, 56(1): 174-178.
[3] 鲍雷,蔡夏夏,张明远,任磊磊. 维生素D3对2型糖尿病小鼠轻度认知障碍的改善作用及机制研究[J]. 北京大学学报(医学版), 2023, 55(4): 587-592.
[4] 张晓悦,林雨欣,蒋莹,张蓝超,董芒艳,池海谊,董浩宇,马利军,李智婧,常春. 自我效能在2型糖尿病患者自我管理能力和自我管理行为间的中介效应[J]. 北京大学学报(医学版), 2023, 55(3): 450-455.
[5] 于欢,杨若彤,王斯悦,吴俊慧,王梦莹,秦雪英,吴涛,陈大方,武轶群,胡永华. 2型糖尿病患者使用二甲双胍与缺血性脑卒中发病风险的队列研究[J]. 北京大学学报(医学版), 2023, 55(3): 456-464.
[6] 张云静,乔丽颖,祁萌,严颖,亢伟伟,刘国臻,王明远,席云峰,王胜锋. 乳腺癌患者新发心血管疾病预测模型的建立与验证:基于内蒙古区域医疗数据[J]. 北京大学学报(医学版), 2023, 55(3): 471-479.
[7] 陈阳阳,周玉博,杨静,花语蒙,原鹏波,刘爱萍,魏瑗. 双胎妊娠孕期体质量对血清高敏C反应蛋白与妊娠期糖尿病关联的影响:一项队列研究[J]. 北京大学学报(医学版), 2022, 54(3): 427-433.
[8] 王佳敏,刘秋萍,张明露,巩超,刘舒丹,陈暐烨,沈鹏,林鸿波,高培,唐迅. 基于马尔可夫模型的社区人群糖尿病筛查预防心血管病的效果评价[J]. 北京大学学报(医学版), 2022, 54(3): 450-457.
[9] 吴俊慧,武轶群,吴瑶,王紫荆,吴涛,秦雪英,王梦莹,王小文,王伽婷,胡永华. 北京城镇职工2型糖尿病患者缺血性脑卒中发病率及主要危险因素[J]. 北京大学学报(医学版), 2022, 54(2): 249-254.
[10] 徐欣然,霍芃呈,和璐,孟焕新,朱筠轩,靳东思奇. 伴与不伴糖尿病的牙周炎患者牙周基础治疗的疗效比较及其与白细胞水平的相关分析[J]. 北京大学学报(医学版), 2022, 54(1): 48-53.
[11] 王子靖,李在玲. 有幽门螺杆菌感染家族史儿童胃部菌群的特点[J]. 北京大学学报(医学版), 2021, 53(6): 1115-1121.
[12] 朱忆颖,闵赛南,俞光岩. 局部注射环孢素A对非肥胖糖尿病小鼠下颌下腺分泌功能及炎症的影响[J]. 北京大学学报(医学版), 2021, 53(4): 750-757.
[13] 尹雪倩, 张晓玄, 文婧, 刘思奇, 刘欣然, 周若宇, 王军波. 荞麦、燕麦、豌豆复配对糖尿病大鼠血糖的影响[J]. 北京大学学报(医学版), 2021, 53(3): 447-452.
[14] 郭洪萍,赵艾,薛勇,马良坤,张玉梅,王培玉. 孕期营养素摄入与妊娠期糖尿病孕妇血糖控制效果的相关性研究[J]. 北京大学学报(医学版), 2021, 53(3): 467-472.
[15] 吴俊慧,陈泓伯,武轶群,吴瑶,王紫荆,吴涛,王梦莹,王斯悦,王小文,王伽婷,于欢,胡永华. 2015—2017年北京市2型糖尿病患者骨关节炎患病的相关因素[J]. 北京大学学报(医学版), 2021, 53(3): 518-522.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!