北京大学学报(医学版) ›› 2021, Vol. 53 ›› Issue (3): 566-572. doi: 10.19723/j.issn.1671-167X.2021.03.021

• 论著 • 上一篇    下一篇

基于集成学习模型预测重症患者再入重症监护病房的风险

林瑜1,2,吴静依3,蔺轲1,2,胡永华2,4,孔桂兰1,3,Δ()   

  1. 1.北京大学健康医疗大数据国家研究院,北京 100191
    2.北京大学公共卫生学院流行病与卫生统计系,北京 100191
    3.北京大学信息技术高等研究院,杭州 311215
    4.北京大学医学信息学中心,北京 100191
  • 收稿日期:2019-07-05 出版日期:2021-06-18 发布日期:2021-06-16
  • 通讯作者: 孔桂兰 E-mail:guilan.kong@hsc.pku.edu.cn
  • 基金资助:
    国家自然科学基金(81771938);国家自然科学基金(91846101);北京市自然科学基金(7212201);北京大学医学部-密歇根大学医学院转化医学与临床研究联合研究所项目(BMU2020JI011)

Prediction of intensive care unit readmission for critically ill patients based on ensemble learning

LIN Yu1,2,WU Jing-yi3,LIN Ke1,2,HU Yong-hua2,4,KONG Gui-lan1,3,Δ()   

  1. 1. National Institute of Health Data Science, Peking University, Beijing 100191, China
    2. Department of Epidemiology and Biostatistics, Peking University School of Public Health, Beijing 100191, China
    3. Advanced Institute of Information Technology, Peking University, Hangzhou 311215, China
    4. Peking University Medical Informatics Center, Beijing 100191, China
  • Received:2019-07-05 Online:2021-06-18 Published:2021-06-16
  • Contact: Gui-lan KONG E-mail:guilan.kong@hsc.pku.edu.cn
  • Supported by:
    National Natural Science Foundation of China(81771938);National Natural Science Foundation of China(91846101);Beijing Municipal Natural Science Foundation(7212201);Project of the University of Michigan Health System-Peking University Health Science Center Joint Institute for Translational and Clinical Research BMU(BMU2020JI011)

摘要:

目的: 基于集成学习算法建立患者再入重症监护病房(intensive care unit, ICU)的风险预测模型,并比较各个模型的预测性能。方法: 使用美国重症医学数据库(medical information mart for intensive care,MIMIC)-Ⅲ,根据纳入、排除标准筛选患者,提取人口学特征、生命体征、实验室检查、合并症等可能对结局有预测作用的变量,基于集成学习方法随机森林、自适应提升算法(adaptive boosting, AdaBoost)和梯度提升决策树(gradient boosting decision tree, GBDT)建立再入ICU预测模型,并比较集成学习与Logistic回归的预测性能。使用五折交叉验证后的平均灵敏度、阳性预测值、阴性预测值、假阳性率、假阴性率、受试者工作特征曲线下面积(area under the receiver operating characteristic curve,AUROC)和Brier评分评价模型效果,基于最佳性能模型给出重要性排序前10位的预测变量。结果: 所有模型中,GBDT (AUROC=0.858)优于随机森林(AUROC=0.827),略好于AdaBoost (AUROC=0.851)。与Logistic回归(AUROC=0.810)相比,集成学习算法在区分度上均有较大的提升。GBDT算法给出的变量重要性排序中,平均动脉压、收缩压、舒张压、心率、尿量、血肌酐等变量排序靠前,相对而言,再入ICU患者的心血管功能和肾功能更差。结论: 基于集成学习算法的患者再入ICU预测模型表现出较好的性能,优于Logistic回归。使用集成学习算法建立的再入ICU风险预测模型可用于识别再入ICU风险高的患者,医务人员可针对高风险患者采取干预措施,改善患者的整体临床结局。

关键词: 重症监护病房, 病人再入院, 机器学习, 试验预期值

Abstract:

Objective: To develop machine learning models for predicting intensive care unit (ICU) readmission using ensemble learning algorithms. Methods: A publicly accessible American ICU database, medical information mart for intensive care (MIMIC)-Ⅲ as the data source was used, and the patients were selected by the inclusion and exclusion criteria. A set of variables that had the predictive ability of outcome including demographics, vital signs, laboratory tests, and comorbidities of patients were extracted from the dataset. We built the ICU readmission prediction models based on ensemble learning methods including random forest, adaptive boosting (AdaBoost), and gradient boosting decision tree (GBDT), and compared the prediction performance of the machine learning models with a conventional Logistic regression model. Five-fold cross validation was used to train and validate the prediction models. Average sensitivity, positive prediction value, negative prediction value, false positive rate, false negative rate, area under the receiver operating characteristic curve (AUROC) and Brier score were used as performance measures. After constructing the prediction models, top 10 predictive variables based on importance ranking were identified by the model with the best discrimination. Results: Among these ICU readmission prediction models, GBDT (AUROC=0.858) had better performance than random forest (AUROC=0.827), and was slightly superior to AdaBoost (AUROC=0.851) in terms of AUROC. Compared with Logistic regression (AUROC=0.810), the discrimination of the three ensemble learning models was much better. The feature importance provided by GBDT showed that the top ranking variables included vital signs and laboratory tests. The patients with ICU readmission had higher mean arterial pressure, systolic blood pressure, diastolic blood pressure, and heart rate than the patients without ICU readmission. Meanwhile, the patients readmitted to ICU experienced lower urine output and higher serum creatinine. Overall, the patients having repeated admissions during their hospitalization showed worse heart function and renal function compared with the patients without ICU readmission. Conclusion: The ensemble learning based ICU readmission prediction models had better performance than Logistic regression model. Such ensemble learning models have the potential to aid ICU physicians in identifying those patients with high risk of ICU readmission and thus help improve overall clinical outcomes.

Key words: Intensive care units, Patient readmission, Machine learning, Predictive value of tests

中图分类号: 

  • R459.7

表1

随机下采样和NearMiss方法处理不平衡数据的性能"

Classifier Accuracy
Random under-sampling NearMiss
Logistic regression 0.615 0.838
Random forest 0.542 0.844
AdaBoost 0.620 0.873
GBDT 0.626 0.874

图1

基于Logistic回归的递归特征消除法"

图2

Logit函数示意图"

图3

决策树示意图"

表2

有无再入ICU的重症患者基本情况比较"

Characteristics Total Patients with ICU readmission Patients without ICU readmission P value
Ageb/years, $\bar{x} \pm s$ 62.67±16.22 64.08±15.16 62.58±16.29 <0.001
Gender (female)a, n (%) 11 319 (42.38) 671 (41.69) 10 648 (42.49) 0.16
ICU stayb/d, $\bar{x} \pm s$ 4.56±5.60 5.46±6.12 4.50±5.56 <0.001
GCS total scoreb, $\bar{x} \pm s$ 14.33±1.42 14.17±1.58 14.34±1.41 <0.001
GCS motorb, $\bar{x} \pm s$ 5.89±0.50 5.88±0.50 5.89±0.50 0.08
GCS verbalb, $\bar{x} \pm s$ 4.41±1.28 4.40±1.18 4.41±1.29 <0.001
GCS eyesb, $\bar{x} \pm s$ 3.75±0.54 3.73±0.56 3.75±0.54 0.10
Admission typea, n (%)
Medical 18 208 (68.17) 1 054 (63.92) 17 154 (68.45)
Scheduled surgery 3 620 (13.55) 179 (10.86) 3 441 (13.73) <0.001
Unscheduled surgery 4 881 (18.27) 416 (25.23) 4 465 (17.82)

表3

Logistic回归、随机森林、AdaBoost和GBDT模型预测再入ICU的性能"

Performance Logistic regression Random forest AdaBoost GBDT
Sensitivity 0.763±0.029 0.787±0.010 0.821±0.029 0.817±0.028
PPV 0.843±0.022 0.858±0.056 0.876±0.040 0.892±0.038
NPV 0.784±0.018 0.802±0.012 0.832±0.020 0.832±0.017
FPR 0.143±0.026 0.134±0.057 0.120±0.044 0.101±0.040
FNR 0.237±0.029 0.213±0.010 0.179±0.029 0.183±0.028
AUROC 0.810±0.013 0.827±0.029 0.851±0.017 0.858±0.013
Brier score 0.190±0.013 0.173±0.029 0.149±0.017 0.142±0.013

表4

基于GBDT模型重要性排序前10位的变量及分布"

Rank Variables Importance With ICU readmission, $\bar{x} \pm s$ Without ICU readmission, $\bar{x} \pm s$
1 Platelet/(×103/μL) 0.109 240.05±154.37 230.05±131.88
2 Glucose (maximum)/(mg/dL) 0.093 208.36±109.04 199.68±104.80
3 Urine output/mL 0.089 1 908.70±1 111.36 2 089.51±1 129.90
4 Mean arterial pressure (maximum)/mmHg 0.067 122.04±43.03 116.87±38.68
5 Glucose (minimum)/(mg/dL) 0.058 96.06±33.03 99.83±35.78
6 Heart rate (maximum)/(beat/min) 0.058 114.38±25.07 109.35±23.49
7 Creatinine/(mg/dL) 0.053 1.33±1.24 1.19±1.23
8 Systolic blood pressure (maximum)/mmHg 0.042 160.42±29.69 158.04±28.14
9 Diastolic blood pressure (minimum)/mmHg 0.035 40.47±12.59 41.99±12.85
10 Heart rate (minimum)/(beat/min) 0.031 68.79±15.12 68.30±14.46
[1] Halpern NA, Pastores SM. Critical care medicine in the United States 2000-2005: an analysis of bed numbers, occupancy rates, payer mix, and costs[J]. Crit Care Med, 2010,38(1):65-71.
doi: 10.1097/CCM.0b013e3181b090d0
[2] Woldhek AL, Rijkenberg S, Bosman RJ, et al. Readmission of ICU patients: A quality indicator?[J]. J Crit Care, 2017,38:328-334.
doi: 10.1016/j.jcrc.2016.12.001
[3] Kramer AA, Higgins TL, Zimmerman JE. The association between ICU readmission rate and patient outcomes[J]. Crit Care Med, 2013,41(1):24-33.
doi: 10.1097/CCM.0b013e3182657b8a
[4] Rosenberg AL, Hofer TP, Hayward RA, et al. Who bounces back? Physiologic and other predictors of intensive care unit readmission[J]. Crit Care Med, 2001,29(3):511-518.
pmid: 11373413
[5] Baker DR, Pronovost PJ, Morlock LL, et al. Patient flow variabi-lity and unplanned readmissions to an intensive care unit[J]. Crit Care Med, 2009,37(11):2882-2887.
doi: 10.1097/CCM.0b013e3181b01caf
[6] Martin LA, Kilpatrick JA, Al-Dulaimi R, et al. Predicting ICU readmission among surgical ICU patients: Development and validation of a clinical nomogram[J]. Surgery, 2019,165(2):373-380.
doi: S0039-6060(18)30429-X pmid: 30170817
[7] Lee H, Lim CW, Hong HP, et al. Efficacy of the APACHE Ⅱ score at ICU discharge in predicting post-ICU mortality and ICU readmission in critically ill surgical patients[J]. Anaesth Intensive Care, 2015,43(2):175-186.
doi: 10.1177/0310057X1504300206
[8] Fialho AS, Cismondi F, Vieira SM, et al. Data mining using clinical physiology at discharge to predict ICU readmissions[J]. Expert Syst Appl, 2012,39(18):13158-13165.
doi: 10.1016/j.eswa.2012.05.086
[9] Desautels T, Das R, Calvert J, et al. Prediction of early unplanned intensive care unit readmission in a UK tertiary care hospital: a cross-sectional machine learning approach[J]. BMJ Open, 2017,7(9):e017199.
doi: 10.1136/bmjopen-2017-017199
[10] Hosni M, Abnane I, Idri A, et al. Reviewing ensemble classification methods in breast cancer[J]. Comput Methods Programs Biomed, 2019,177:89-112.
doi: 10.1016/j.cmpb.2019.05.019
[11] Liu Y, Gu Y, Nguyen JC, et al. Symptom severity classification with gradient tree boosting[J]. J Biomed Inform, 2017,75S:S105-S111.
[12] Johnson AE, Pollard TJ, Shen L, et al. MIMIC-Ⅲ, a freely accessible critical care database[J]. Sci Data, 2016,3:160035.
doi: 10.1038/sdata.2016.35
[13] Austin SR, Wong YN, Uzzo RG, et al. Why summary comorbidity measures such as the Charlson comorbidity index and Elixhauser score work[J]. Med Care, 2015,53(9):E65-E72.
doi: 10.1097/MLR.0b013e318297429c
[14] Oakes DF, Borges IN, Forgiarini Junior LA, et al. Assessment of ICU readmission risk with the stability and workload index for transfer score[J]. J Bras Pneumol, 2014,40(1):73-76.
doi: 10.1590/S1806-37132014000100011
[15] Xue Y, Klabjan D, Luo Y. Predicting ICU readmission using grouped physiological and medication trends[J]. Artif Intell Med, 2019,95:27-37.
doi: S0933-3657(17)30648-6 pmid: 30213670
[16] He HB, Garcia EA. Learning from imbalanced data[J]. IEEE T Knowl Data En, 2009,21(9):1263-1284.
doi: 10.1109/TKDE.2008.239
[17] Rahman R, Matlock K, Ghosh S, et al. Heterogeneity aware random forest for drug sensitivity prediction[J]. Sci Rep, 2017,7(1):11347.
doi: 10.1038/s41598-017-11665-4
[18] Hu J. Automated detection of driver fatigue based on AdaBoost classifier with EEG signals[J]. Front Comput Neurosci, 2017,11:72.
doi: 10.3389/fncom.2017.00072
[19] Friedman JH. Greedy function approximation: A gradient boosting machine[J]. Ann Stat, 2001,29(5):1189-1232.
doi: 10.1214/aos/1013203450
[20] Mani I, Zhang I. kNN approach to unbalanced data distributions: a case study involving information extraction[C]// ICML 2003 Workshop on Learning from Imbalanced Datasets, August 21-24, 2003. Washington, D.C.: ICML, 2003.
[1] 吴静依,林瑜,蔺轲,胡永华,孔桂兰. 基于机器学习的重症监护室超长入住时长预测[J]. 北京大学学报(医学版), 2021, 53(6): 1163-1170.
[2] 朱学华,杨明钰,夏海缀,何为,张智荧,刘余庆,肖春雷,马潞林,卢剑. 机器学习模型在预测肾结石输尿管软镜碎石术后早期结石清除率中的应用[J]. 北京大学学报(医学版), 2019, 51(4): 653-659.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 田增民, 陈涛, Nanbert ZHONG, 李志超, 尹丰, 刘爽. 神经干细胞移植治疗遗传性小脑萎缩的临床研究(英文稿)[J]. 北京大学学报(医学版), 2009, 41(4): 456 -458 .
[2] 郭岩, 谢铮. 用一代人时间弥合差距——健康社会决定因素理论及其国际经验[J]. 北京大学学报(医学版), 2009, 41(2): 125 -128 .
[3] 成刚, 钱振华, 胡军. 艾滋病项目自愿咨询检测的技术效率分析[J]. 北京大学学报(医学版), 2009, 41(2): 135 -140 .
[4] 卢恬, 朱晓辉, 柳世庆, 郑杰, 邱晓彦. 白细胞介素2促进宫颈癌细胞系HeLaS3免疫球蛋白G的表达[J]. 北京大学学报(医学版), 2009, 41(2): 158 -161 .
[5] 袁惠燕, 张苑, 范田园. 离子交换型栓塞微球及其载平阳霉素的制备与性质研究[J]. 北京大学学报(医学版), 2009, 41(2): 217 -220 .
[6] 徐莉, 孟焕新, 张立, 陈智滨, 冯向辉, 释栋. 侵袭性牙周炎患者血清中抗牙龈卟啉单胞菌的IgG抗体水平的研究[J]. 北京大学学报(医学版), 2009, 41(1): 52 -55 .
[7] 祁琨, 邓芙蓉, 郭新彪. 纳米二氧化钛颗粒对人肺成纤维细胞缝隙连接通讯的影响[J]. 北京大学学报(医学版), 2009, 41(3): 297 -301 .
[8] 李宏亮*, 安卫红*, 赵扬玉, 朱曦. 妊娠合并高脂血症性胰腺炎行血液净化治疗1例[J]. 北京大学学报(医学版), 2009, 41(5): 599 -601 .
[9] 李伟军, 邢晓芳, 曲立科, 孟麟, 寿成超. PRL-3基因C104S位点突变体和CAAX缺失体的构建及表达[J]. 北京大学学报(医学版), 2009, 41(5): 516 -520 .
[10] 丰雷, 王玉凤, 曹庆久. 哌甲酯对注意缺陷多动障碍儿童平衡功能影响的开放性研究[J]. 北京大学学报(医学版), 2007, 39(3): 304 -309 .