北京大学学报(医学版) ›› 2018, Vol. 50 ›› Issue (2): 352-357. doi: 10.3969/j.issn.1671-167X.2018.02.025

• 技术方法 • 上一篇    下一篇

基于光学结构识别技术的化学知识库构建

吕传宇,李明娜,张亮仁,刘振明△   

  1. (北京大学药学院, 天然药物及仿生药物国家重点实验室, 北京100191)
  • 出版日期:2018-04-18 发布日期:2018-04-18
  • 通讯作者: 刘振明 E-mail:zmliu@bjmu.edu.cn
  • 基金资助:
    国家自然科学基金(21772005、21572010)和北京大学医学-信息科学交叉学科种子基金项目(BMU20160579)资助

Construction of chemical information database based on optical structure recognition technique

LV Chuan-yu, LI Ming-na, ZHANG Liang-ren, LIU Zhen-ming△   

  1. (State Key Laboratory of Natural and Biomimetic Drugs, Peking University School of Pharmaceutical Sciences, Beijing 100191, China)
  • Online:2018-04-18 Published:2018-04-18
  • Contact: LIU Zhen-ming E-mail:zmliu@bjmu.edu.cn
  • Supported by:
    Supported by the National Natural Science Foundation of China (21772005, 21572010) and Peking University Seed Fund for Medicine-Information Interdisciplinary Research Project (BMU20160579)

摘要: 目的:构建了一种从科研文献提取关键信息建立化学知识库的流程。方法:使用名称转化技术和光学结构识别软件提取化合物结构,使用文献管理软件EndNote X8获取文献题录信息,使用机器学习工具ChemDataExtractor和人工注释方法提取文献内信息,使用计算模拟平台Pipeline Pilot 7.5获取可预测属性,关联开源数据库ChEMBL获取已知生物活性。结果:成功建立起一种合理、高效的化学知识库构建策略,并采用该策略构建了北京大学海洋天然产物库PKU-MNPD。结论:提出了一种化学知识库的数据汇聚策略,提高了化学知识库构建效率,并且基于原始文献使得构建的数据库内容准确、全面、易于检索。

关键词: 科学文献, 光学结构识别, 数据挖掘, 化学知识库

Abstract: Objective: To create a protocol that could be used to construct chemical information database from scientific literature quickly and automatically. Methods: Scientific literature, patents and technical reports from different chemical disciplines were collected and stored in PDF format as fundamental datasets. Chemical structures were transformed from published documents and images to machine-readable data by using the name conversion technology and optical structure recognition tool CLiDE. In the process of molecular structure information extraction, Markush structures were enumerated into well-defined monomer molecules by means of QueryTools in molecule editor ChemDraw. Document management software EndNote X8 was applied to acquire bibliographical references involving title, author, journal and year of publication. Text mining toolkit ChemDataExtractor was adopted to retrieve information that could be used to populate structured chemical database from figures, tables, and textual paragraphs. After this step, detailed manual revision and annotation were conducted in order to ensure the accuracy and completeness of the data. In addition to the literature data, computing simulation platform Pipeline Pilot 7.5 was utilized to calculate the physical and chemical properties and predict molecular attributes. Furthermore, open database ChEMBL was linked to fetch known bioactivities, such as indications and targets. After information extraction and data expansion, five separate metadata files were generated, including molecular structure data file, molecular information, bibliographical references, predictable attributes and known bioactivities. Canonical simplified molecular input line entry specification as primary key, metadata files were associated through common key nodes including molecular number and PDF number to construct an integrated chemical information database. Results: A reasonable construction protocol of chemical information database was created successfully. A total of 174 research articles and 25 reviews published in Marine Drugs from January 2015 to June 2016 collected as essential data source, and an elementary marine natural product database named PKU-MNPD was built in accordance with this protocol, which contained 3 262 molecules and 19 821 records. Conclusion: This data aggregation protocol is of great help for the chemical information database construction in accuracy, comprehensiveness and efficiency based on original documents. The structured chemical information database can facilitate the access to medical intelligence and accelerate the transformation of scientific research achievements.

Key words: Scientific literature, Optical structure recognition, Data mining, Chemical information database

中图分类号: 

  •  
[1] 王小慧,张岩,刘林枝,尚晨光. 二甲双胍与脂联素对子宫内膜癌细胞增殖的作用[J]. 北京大学学报(医学版), 2018, 50(5): 767-773.
[2] 孙静,宋卫东,闫思源,席志军. 氯喹抑制肾癌细胞活性促进舒尼替尼诱导的细胞凋亡[J]. 北京大学学报(医学版), 2018, 50(5): 778-784.
[3] 吴天伟,崔蓉,张宝旭. 高效液相色谱法测定小鼠血浆中8-甲氧基补骨脂素及其药代动力学研究[J]. 北京大学学报(医学版), 2018, 50(5): 792-796.
[4] 李岩,王辉,邓莹,姚瑶,李民. 静脉输注右美托咪定对臂丛阻滞效果的随机对照研究[J]. 北京大学学报(医学版), 2018, 50(5): 845-849.
[5] 隋华欣,吕培军,王勇,冯驭驰. 低能量激光照射对人脂肪来源干细胞/海藻酸钠/明胶三维生物打印体成骨能力的影响[J]. 北京大学学报(医学版), 2018, 50(5): 868-875.
[6] 杨殷杰,侯本祥,侯晓玫. 高压蒸汽灭菌对R-相镍钛锉表面形态及疲劳折断性能的影响[J]. 北京大学学报(医学版), 2018, 50(5): 882-886.
[7] 龙赟子,刘思毅,李稳,董艳梅. 生物活性玻璃盖髓剂的理化性质[J]. 北京大学学报(医学版), 2018, 50(5): 887-891.
[8] 柴金友,刘建彰,王兵,屈健,孙振,高文慧,郭天晧,冯海兰,潘韶霞. 一种切削法制作的数字化种植手术导板加工精度评价[J]. 北京大学学报(医学版), 2018, 50(5): 892-898.
[9] 王明,李辉,王静,高嵩. 利用X射线衍射增强成像技术诊断肝纤维化[J]. 北京大学学报(医学版), 2018, 50(5): 899-904.
[10] 彭俐,王祖华,孙玉春,渠薇,韩扬,梁宇红. 根尖切除手术导板的计算机辅助设计及三维打印[J]. 北京大学学报(医学版), 2018, 50(5): 905-910.
[11] 闫文娟,郑佳佳,陈小贤. 释氟高流动树脂在儿童早期釉质龋窝沟封闭中的应用[J]. 北京大学学报(医学版), 2018, 50(5): 911-914.
[12] 何娜,闫盈盈,应颖秋,伊敏,么改琦,葛庆岗,翟所迪. 持续静脉血液滤过联合体外膜肺氧合治疗1例急性重症胰腺炎的万古霉素个体化治疗方案报道[J]. 北京大学学报(医学版), 2018, 50(5): 915-920.
[13] 郭晓玥,邵珲,赵扬玉. 系统性红斑狼疮患者孕期并发肺动脉高压1例[J]. 北京大学学报(医学版), 2018, 50(5): 928-931.
[14] 吴双胜,杨鹏,李海月,马春娜,王全意. 传染病健康素养水平与传染病症状发生之间的相关性研究[J]. 北京大学学报(医学版), 2018, 50(5): 937-940.
[15] 刘茁,田晓军,马潞林,黄毅,侯小飞,卢剑,张树栋,王国良,赵磊,刘余庆. 后腹腔镜下肾囊肿去顶术治疗肾盂旁囊肿的临床分析[J]. 北京大学学报(医学版), 2018, 50(5): 941-944.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!