北京大学学报(医学版) ›› 2018, Vol. 50 ›› Issue (2): 368-374. doi: 10.3969/j.issn.1671-167X.2018.02.028

• 论著 • 上一篇    下一篇

一种基于图塌缩的药物分子检索方法

瞿经纬1,吕肖庆1,2△,刘振明3,廖媛1,孙鹏晖1,王蓓1,汤帜1,2   

  1. (1. 北京大学计算机科学技术研究所, 北京100080; 2. 数字出版技术国家重点实验室, 北京100080; 3. 北京大学药学院药物化学系, 北京100191)
  • 出版日期:2018-04-18 发布日期:2018-04-18
  • 通讯作者: 吕肖庆 E-mail:lvxiaoqing@pku.edu.cn
  • 基金资助:
     国家自然科学基金(61573028、61673029)、新闻出版业科技与标准重点实验室(新闻出版智能媒体技术重点实验室)和北京大学医学-信息科学交叉学科种子基金项目(BMU20160579)资助

A retrieval method of drug molecules based on graph collapsing

QU Jing-wei1, LV Xiao-qing1,2△, LIU Zhen-ming3, LIAO Yuan1, SUN Peng-hui1, WANG Bei1, TANG Zhi1,2   

  1. (1.Institute of Computer Science & Technology, Peking University, Beijing 100080, China; 2. State Key Laboratory of Digital Publishing Technology, Beijing 100080, China; 3. Department of Medical Chemistry, Peking University School of Pharmaceutical Sciences, Beijing 100191, China)
  • Online:2018-04-18 Published:2018-04-18
  • Contact: LV Xiao-qing E-mail:lvxiaoqing@pku.edu.cn
  • Supported by:
    Supported by National Natural Science Foundation of China (61573028, 61673029), Key Laboratory of Science, Technology and Standard in Press Industry (Key Laboratory of Intelligent Press Media Technology) and Peking University Seed Fund for Medicine-Information Interdisciplinary Research Project (BMU20160579)

摘要: 目的:为实现准确、高效的医药信息查询,本文探索了一种基于图结构的药物分子检索方法。方法:基于图结构的药物分子检索方法以接收智能终端的拍照或手绘作为输入,并将输入的结构式形式化为相应图结构,基于对图匹配效率的直接影响因素的分析,建立了结构式的一种紧凑有效的超图表示形式,其依据结构式的特点结合了子图匹配与频繁子图挖掘等方法对大图进行多级塌缩。为避免塌缩过程中子图交叠问题阻碍超图的准确构建,引入一种基于图同构的算法,借助子图之间交叠情况的分析,选择占优子图,利用多维度信息完成精确的分子匹配。结果:为证明检索方法的有效性,将本文检索方法和Wikipedia Chemical Structure Explorer(WCSE)进行检索准确率的对比,结果表明,本文方法的检索准确率更高,前10个检索结果的MAP(mean average precision)、DCG(discounted cumulative gain)、RBP(rank-biased precision)和ERR(expected reciprocal rank)四个指标均高于WCSE。上述指标的领先幅度分别为10%、1.41、6.42%、1.32%。进一步通过两个系统的具体检索结果实例对检索效果进行直观对比,发现本文方法在药物分子检索有效性方面更具优势,能为用户提供更为满意的检索结果。结论:本研究提出的基于图结构相似度的药物分子检索方法能够实现较为理想的检索结果,实验证明本检索系统具有可行性和有效性。

关键词: 信息存储和检索, 分子结构, 图结构, 超图, 算法

Abstract: Objective: To establish a compact and efficient hypergraph representation and a graph-similarity-based retrieval method of molecules to achieve effective and efficient medicine information retrieval. Methods: Chemical structural formula (CSF) was a primary search target as a unique and precise identifier for each compound at the molecular level in the research field of medicine information retrieval. To retrieve medicine information effectively and efficiently, a complete workflow of the graph-based CSF retrieval system was introduced. This system accepted the photos taken from smartphones and the sketches drawn on tablet personal computers as CSF inputs, and formalized the CSFs with the corresponding graphs. Then this paper proposed a compact and efficient hypergraph representation for molecules on the basis of analyzing factors that directly affected the efficiency of graph matching. According to the characteristics of CSFs, a hierarchical collapsing method combining graph isomorphism and frequent subgraph mining was adopted. There was yet a fundamental challenge, subgraph overlapping during the collapsing procedure, which hindered the method from establishing the correct compact hypergraph of an original CSF graph. Therefore, a graph-isomorphism-based algorithm was proposed to select dominant acyclic subgraphs on the basis of overlapping analysis. Finally, the spatial similarity among graphical CSFs was evaluated by multi-dimensional measures of similarity. Results: To evaluate the performance of the proposed method, the proposed system was firstly compared with Wikipedia Chemical Structure Explorer (WCSE), the state-of-the-art system that allowed CSF similarity searching within Wikipedia molecules dataset, on retrieval accuracy. The system achieved higher values on mean average precision, discounted cumulative gain, rank-biased precision, and expected reciprocal rank than WCSE from the top-2 to the top-10 retrieved results. Specifically, the system achieved 10%, 1.41, 6.42%, and 1.32% higher than WCSE on these metrics for top-10 retrieval results, respectively. Moreover, several retrieval cases were presented to intuitively compare with WCSE. The results of the above comparative study de-monstrated that the proposed method outperformed the existing method with regard to accuracy and effectiveness. Conclusion: This paper proposes a graph-similarity-based retrieval approach for medicine information. To obtain satisfactory retrieval results, an isomorphism-based algorithm is proposed for dominant subgraph selection based on the subgraph overlapping analysis, as well as an effective and efficient hyper-graph representation of molecules. Experiment results demonstrate the effectiveness of the proposed approach.

Key words: Information storage and retrieval, Molecular structure, Graph structure, Hypergraph, Algorithms

中图分类号: 

  •  
[1] 王斯维,黎敏,杨慧芳,赵一姣,王勇,刘怡. 3种生成大视野锥形束CT数据正中矢状面方法的比较[J]. 北京大学学报(医学版), 2016, 48(2): 330-335.
[2] 熊玉雪, 杨慧芳, 赵一姣, 王勇. 两种评价面部三维表面数据不对称度方法的比较[J]. 北京大学学报(医学版), 2015, 47(2): 340-343.
[3] 刘文龙, 王路漫, 贺东奇, 张天蓝, 苟宝迪, 李庆. 寡糖分子结构及其分形[J]. 北京大学学报(医学版), 2014, 46(5): 739-743.
[4] 姚艺桑, 高凌, 李玉玲, 马少丽, 吴子媺, 谈宁芝, 吴建勇, 倪陆群, 朱佳石. 丰度加权法分析冬虫夏草RAPD多态性高度差异及动态变化[J]. 北京大学学报(医学版), 2014, 46(4): 618-628.
[5] 霍长虹, 梁鸿, 林文翰, 赵玉英. 苯骈噁嗪酮类化合物的结构特征和谱学规律[J]. 北京大学学报(医学版), 2006, 38(3): 321-323.
[6] 郑璐, 吴刚, 王邠, 吴立军, 赵玉英. 合欢皂苷及苷元的分离鉴定[J]. 北京大学学报(医学版), 2004, 36(4): 421-425.
[7] 邹坤, 王邠, 赵玉英, 郑俊华, 张如意. 合欢皮中一个新的八糖苷[J]. 北京大学学报(医学版), 2004, 36(1): 18-20.
[8] 赵明, 王超, 彭师奇. 寡肽药物先导结构的发现与优化[J]. 北京大学学报(医学版), 2002, 34(5): 506-512.
[9] 王东升, 邱晓彦, 朱晓辉, 吕蓬, 姜南, 吕平, 张玲, 张岩, 高晓明. 人宫颈癌传代细胞Ig样蛋白的纯化分析[J]. 北京大学学报(医学版), 2000, 32(4): 310-312.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 田增民, 陈涛, Nanbert ZHONG, 李志超, 尹丰, 刘爽. 神经干细胞移植治疗遗传性小脑萎缩的临床研究(英文稿)[J]. 北京大学学报(医学版), 2009, 41(4): 456 -458 .
[2] 郭岩, 谢铮. 用一代人时间弥合差距——健康社会决定因素理论及其国际经验[J]. 北京大学学报(医学版), 2009, 41(2): 125 -128 .
[3] 成刚, 钱振华, 胡军. 艾滋病项目自愿咨询检测的技术效率分析[J]. 北京大学学报(医学版), 2009, 41(2): 135 -140 .
[4] 卢恬, 朱晓辉, 柳世庆, 郑杰, 邱晓彦. 白细胞介素2促进宫颈癌细胞系HeLaS3免疫球蛋白G的表达[J]. 北京大学学报(医学版), 2009, 41(2): 158 -161 .
[5] 袁惠燕, 张苑, 范田园. 离子交换型栓塞微球及其载平阳霉素的制备与性质研究[J]. 北京大学学报(医学版), 2009, 41(2): 217 -220 .
[6] 徐莉, 孟焕新, 张立, 陈智滨, 冯向辉, 释栋. 侵袭性牙周炎患者血清中抗牙龈卟啉单胞菌的IgG抗体水平的研究[J]. 北京大学学报(医学版), 2009, 41(1): 52 -55 .
[7] 董稳, 刘瑞昌, 刘克英, 关明, 杨旭东. 氯诺昔康和舒芬太尼用于颌面外科术后自控静脉镇痛的比较[J]. 北京大学学报(医学版), 2009, 41(1): 109 -111 .
[8] 祁琨, 邓芙蓉, 郭新彪. 纳米二氧化钛颗粒对人肺成纤维细胞缝隙连接通讯的影响[J]. 北京大学学报(医学版), 2009, 41(3): 297 -301 .
[9] Jian-wei GU, Emily YOUNG, Zhi-jun PAN, Kevan B. TUCKER, Megan SHPARAGO, Min HUANG, Amelia Purser BAILEY. SD大鼠长期高盐饮食可导致其高血压并改变肾细胞因子基因表达谱[J]. 北京大学学报(医学版), 2009, 41(5): 505 -515 .
[10] 李宏亮*, 安卫红*, 赵扬玉, 朱曦. 妊娠合并高脂血症性胰腺炎行血液净化治疗1例[J]. 北京大学学报(医学版), 2009, 41(5): 599 -601 .