论著

非结构化电子病历中信息抽取的定制化方法

  • 包小源 ,
  • 黄婉晶 ,
  • 张凯 ,
  • 金梦 ,
  • 李岩 ,
  • 牛承志
展开
  • (1. 北京大学医学信息学中心, 北京100191; 2. 国家医疗服务数据中心, 北京100191; 3. 北京大学数学科学学院, 北京100871; 4. 北京大学基础医学院, 北京100191; 5. 北京大学医学部医院管理处, 北京100191; 6. 郑州大学第一附属医院信息处, 郑州450052)

网络出版日期: 2018-04-18

基金资助

北京大学医学-信息科学交叉学科种子基金项目(BMU20140434)资助

A customized method for information extraction from unstructured text data in the electronic medical records

  • BAO Xiao-yuan ,
  • HUANG Wan-jing ,
  • ZHANG Kai ,
  • JIN Meng ,
  • LI Yan ,
  • NIU Cheng-zhi
Expand
  • (1. Medical Informatics Center, Peking University, Beijing 100191, China; 2. National Clinical Service Data Center, Beijing 100191, China; 3. School of Mathematical Sciences, Peking University, Beijing 100871, China; 4. Peking University School of Basic Medical Science, Beijing 100191, China; 5. Department of Hospital Management, Peking University Health Science Center, Beijing 100191, China; 6. Department of Information, the First Affiliated Hospital of Zhengzhou University, Zhengzhou 450052, China)

Online published: 2018-04-18

Supported by

Supported by the Peking University Seed Fund for Medicine-Information Interdisciplinary Research Project (BMU20140434)

摘要

目的:电子病历数据中的主诉、现病史、既往史、鉴别诊断、影像诊断、手术记录等主体内主要采用中文自然语言文字描述,是临床医生实际诊疗细节的具体体现,包含了诊疗细节的大量、丰富信息。本研究目的在于建立一种从中进行有效信息提取并组织成可分析利用的形式,供目前医学数据处理、医学研究之用。方法:基于医院的真实电子病历数据,设计定制化的基于规则学习及信息抽取方法,采用三个步骤实现中文信息的抽取:(1)抽样标注,随机抽取600份电子病历的病史信息(包括现病史、既往史、个人史、家族史等),采用本研究开发的标注平台,对其中需要抽取的信息(以糖尿病史为实例)进行标注;(2)根据标注结果,进行抽取模版归纳,并将抽取模版进行重写,生成可以直接用于抽取的Perl语言正则表达式抽取规则,并利用这些规则进行实际信息抽取;(3)对抽取结果进行人工验证与自动化验证相结合的方法,对方法的有效性进行验证。结果:所设计方法已在国家医疗数据中心平台上实现,并针对糖尿病病史抽取在医院进行了单个科室的现场验证,2015年1 436份糖尿病患者病历的病史抽取结果为召回率87.6%、准确率99.5%、F分数(F-Score)0.93;全体糖尿病患者10%抽样病历共1 223份的抽取结果为召回率89.2%、准确率99.2%、F-Score 0.94,效果较好。结论:主要采用自然语言处理与基于规则的信息抽取相结合的方法,设计并实现了从非结构化的中文电子病历文本数据中抽取定制化信息的算法,与已有工作比对效果较好。

本文引用格式

包小源 , 黄婉晶 , 张凯 , 金梦 , 李岩 , 牛承志 . 非结构化电子病历中信息抽取的定制化方法[J]. 北京大学学报(医学版), 2018 , 50(2) : 256 -263 . DOI: 10.3969/j.issn.1671-167X.2018.02.010

Abstract

Objective: There is a huge amount of diagnostic or treatment information in electronic me-dical record (EMR), which is a concrete manifestation of clinicians actual diagnosis and treatment details. Plenty of episodes in EMRs, such as complaints, present illness, past history, differential diagnosis, diagnostic imaging, surgical records, reflecting details of diagnosis and treatment in clinical process, adopt Chinese description of natural language. How to extract effective information from these Chinese narrative text data, and organize it into a form of tabular for analysis of medical research, for the practical utilization of clinical data in the real world, is a difficult problem in Chinese medical data processing. Methods: Based on the EMRs narrative text data in a tertiary hospital in China, a customized information extracting rules learning, and rule based information extraction methods is proposed. The overall method consists of three steps, which includes: (1) Step 1, a random sample of 600 copies (including the history of present illness, past history, personal history, family history, etc.) of the electronic medical record data, was extracted as raw corpora. With our developed Chinese clinical narrative text annotation platform, the trained clinician and nurses marked the tokens and phrases in the corpora which would be extracted (with a history of diabetes as an example). (2) Step 2, based on the annotated corpora clinical text data, some extraction templates were summarized and induced firstly. Then these templates were rewritten using regular expressions of Perl programming language, as extraction rules. Using these extraction rules as basic knowledge base, we developed extraction packages in Perl, for extracting data from the EMRs text data. In the end, the extracted data items were organized in tabular data format, for later usage in clinical research or hospital surveillance purposes. (3) As the final step of the method, the evaluation and validation of the proposed methods were implemented in the National Clinical Service Data Integration Platform, and we checked the extraction results using artificial verification and automated verification combined, proved the effectiveness of the method. Results: For all the patients with diabetes as diagnosed disease in the Department of Endocrine in the hospital, the medical history episode of these patients showed that, altogether 1 436 patients were dismissed in 2015, and a history of diabetes medical records extraction results showed that the recall rate was 87.6%, the accuracy rate was 99.5%, and F-Score was 0.93. For all the 10% patients (totally 1 223 patients) with diabetes by the dismissed dates of August 2017 in the same department, the extracted diabetes history extraction results showed that the recall rate was 89.2%, the accuracy rate was 99.2%, F-Score was 0.94. Conclusion: This study mainly adopts the combination of natural language processing and rule-based information extraction, and designs and implements an algorithm for extracting customized information from unstructured Chinese electronic medical record text data. It has better results than existing work.
文章导航

/