基于机器学习算法的前列腺癌风险预测和在线计算研究

doi:10.24920/004086

摘要/Abstract

摘要：

目的基于临床常用指标,采用机器学习方法构建前列腺癌风险预测模型,为前列腺癌的早期诊疗提供科学依据,评价人工智能技术在医疗健康数据平台下的应用价值。
方法对国家临床医学科学数据中心提供的前列腺肿瘤预警数据集预处理后,使用平滑剪切绝对偏差（smoothly clipped absolute deviation,SCAD）算法筛选特征指标。采用随机森林（Radom forest,RF）、支持向量机（support vector machine,SVM）、反向传播（back propagation,BP）神经网络、卷积神经网络（convolutional neural network,CNN）4种模型预测前列腺癌发生风险,其中神经网络模型使用经SMOTE增强后数据拟合。不同模型的预测能力采用受试者操作特性（ROC）曲线下面积（area under the curve,AUC）进行比较。在确定最优模型后,使用Shiny开发前列腺癌风险预测在线平台。
结果在预测变量中,除活检标本碎组织体积、血游离前列腺特异抗原（fPSA）外,无机磷、甘油三酯、游离钙等临床常用指标与前列腺癌也密切相关。在4种模型中,RF预测效果最好（准确率：96.80%;AUC：0.975,95%CI：0.964-0.986）,其次为BP神经网络（准确率：85.36%;AUC：0.892,95%CI：0.849-0.934）,SVM（准确率：82.67%;AUC：0.824,95%CI：0.805-0.844）与BP神经网络预测效果相近,CNN预测能力最低（准确率：72.37%;AUC：0.724,95%CI：0.670-0.779）。基于RF及预测指标成功开发了一种前列腺癌风险预测在线平台。
结论本研究揭示了医疗信息化平台下传统机器学习方法和基础神经网络模型在疾病风险预测中的应用价值,为疑似前列腺癌并接受穿刺活检人群的前列腺癌预测提出了新思路。此外,开发在线预测系统有助于增强人工智能预测技术的实用性,使医疗应用更为便捷。

关键词: 前列腺癌, 随机森林, 支持向量机, 反向传播神经网络, 卷积神经网络

Abstract:

Objective To build a prostate cancer (PCa) risk prediction model based on common clinical indicators to provide a theoretical basis for the diagnosis and treatment of PCa and to evaluate the value of artificial intelligence (AI) technology under healthcare data platforms.
Methods After preprocessing of the data from Population Health Data Archive, smuothly clipped absolute deviation (SCAD) was used to select features. Random forest (RF), support vector machine (SVM), back propagation neural network (BP), and convolutional neural network (CNN) were used to predict the risk of PCa, among which BP and CNN were used on the enhanced data by SMOTE. The performances of models were compared using area under the curve (AUC) of the receiving operating characteristic curve. After the optimal model was selected, we used the Shiny to develop an online calculator for PCa risk prediction based on predictive indicators.
Results Inorganic phosphorus, triglycerides, and calcium were closely related to PCa in addition to the volume of fragmented tissue and free prostate-specific antigen (PSA). Among the four models, RF had the best performance in predicting PCa (accuracy: 96.80%; AUC: 0.975, 95% CI: 0.964-0.986). Followed by BP (accuracy: 85.36%; AUC: 0.892, 95% CI: 0.849-0.934) and SVM (accuracy: 82.67%; AUC: 0.824, 95% CI: 0.805-0.844). CNN performed worse (accuracy: 72.37%; AUC: 0.724, 95% CI: 0.670-0.779). An online platform for PCa risk prediction was developed based on the RF model and the predictive indicators.
Conclusions This study revealed the application value of traditional machine learning and deep learning models in disease risk prediction under healthcare data platform, proposed new ideas for PCa risk prediction in patients suspected for PCa and had undergone core needle biopsy. Besides, the online calculation may enhance the practicability of AI prediction technology and facilitate medical diagnosis.

Key words: prostate cancer, random forest, support vector machine, back-propagation neural network, convolutional neural network

Chun Wang, Qinxue Chang, Xiaomeng Wang, Keyun Wang, He Wang, Zhuang Cui, Changping Li. Prostate Cancer Risk Prediction and Online Calculation Based on Machine Learning Algorithm[J].Chinese Medical Sciences Journal, 2022, 37(3): 210-217.

图/表 9

参考文献 17.

1.	Sung H, Ferlay J, Siegel RL, et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin 2021; 71(3): 209-49. doi: 10.3322/caac.21660. doi: 10.3322/caac.21660
2.	Lee SE, Chung JS, Han BK, et al. Relationship of prostate-specific antigen and prostate volume in Korean men with biopsy-proven benign prostatic hyperplasia. Urology 2008; 71(3): 395-8. doi: 10.1016/j.urology.2007.10.019. doi: 10.1016/j.urology.2007.10.019 pmid: 18342171
3.	Mousavi SM. Toward prostate cancer early detection in Iran. Asian Pac J Cancer Prev 2009; 10(3): 413-8.
4.	The General Hospital of the People’s Liberation Army. Prostate Cancer Data Set. Population Health Data Archive PHDA, 2019. CSTR: A0006.11.A0005.201905.000531.
5.	Liu X, Cheng MH, Shi CG, et al. Variability of glomerular filtration rate estimation equations in elderly Chinese patients with chronic kidney disease. Clin Interv Aging 2012; 7: 409-15. doi: 10.2147/CIA.S36152. doi: 10.2147/CIA.S36152 pmid: 23091374
6.	Taft LM, Evans RS, Shyu CR, et al. Countering imbalanced datasets to improve adverse drug event predictive models in labor and delivery. J Biomed Inform 2009; 42(2): 356-64. doi: 10.1016/j.jbi.2008.09.001 doi: 10.1016/j.jbi.2008.09.001 pmid: 18824133
7.	Toth R, Schiffmann H, Hube-Magg C, et al. Random forest-based modelling to detect biomarkers for prostate cancer progression. Clin Epigenetics 2019; 11(1): 148. doi: 10.1186/s13148-019-0736-8. doi: 10.1186/s13148-019-0736-8 pmid: 31640781
8.	Lan L, Wang Z, Zhe SD, et al. Scaling up kernel SVM on limited resources: A low-rank linearization approach. IEEE Trans Med Imaging 2019; 30(2): 369-78. doi: 10.1109/TNNLS.2018.2838140. doi: 10.1109/TNNLS.2018.2838140
9.	Shin HC, Roth HR, Gao MC, et al. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans Med Imaging 2016; 35(5): 1285-98. doi: 10.1109/TMI.2016.2528162. doi: 10.1109/TMI.2016.2528162
10.	Xie J. Study on the drying model of larch wood based on artificial neural network[dissertation]. Northeast Forestry University; 2013.
11.	The Online Platform for Prostate Cancer Risk Prediction. Department of Epidemiology and Health Statistics, Tianjin Medical University; c2021. Available from: https://pcarisk.shinyapps.io/pcapred/. Released:November 30, 2021.
12.	Lacher DA, Hughes JP. Total, free, and complexed prostate-specific antigen levels among US men, 2007-2010. Clin Chim Acta 2015; 448: 220-7. doi: 10.1016/j.cca.2015.06.009. doi: 10.1016/j.cca.2015.06.009 pmid: 26093340
13.	Ju HX, Wang T, Wang W, et al. A comparative study of prostate cancer prediction models. Data Knowl Discov 2021; 5(09): 107-14. doi: 10.11925/infotech.2096-3467.2020.1185. doi: 10.11925/infotech.2096-3467.2020.1185
14.	Wang YF, Wu H, Xue WG, et al. Classification prediction and analysis of cancer risk factors for prostate cancer and prostate hyperplasia. Acad J PLA Med Sch 2021; 42(3): 277-81, +305. Chinese. doi: 10.3969/j.issn.2095-5227.2021.03.008. doi: 10.3969/j.issn.2095-5227.2021.03.008
15.	Van Hemelrijck M, Garmo H, Holmberg L, et al. Prostate cancer risk in the Swedish AMORIS study: the interplay among triglycerides, total cholesterol, and glucose. Cancer 2011; 117(10): 2086-95. doi: 10.1002/cncr.25758. doi: 10.1002/cncr.25758 pmid: 21523720
16.	Arthur R, Møller H, Garmo H, et al. Association between baseline serum glucose, triglycerides and total cholesterol, and prostate cancer risk categories. Cancer Med 2016; 5(6): 1307-18. doi: 10.1002/cam4.665. doi: 10.1002/cam4.665
17.	Srikrishna G. S100A8 and S100A9: new insights into their roles in malignancy. J Innate Immun 2012; 4(1): 31-40. doi: 10.1159/000330095. doi: 10.1159/000330095 pmid: 21912088

Indicator	Abbreviation
Demographic information
Age	-
Height	-
Weight	-
Body mass index	BMI
Prostate indicators
Free prostate specific antigen	fPSA
Total prostate specific antigen	tPSA
Free prostate-specific antigen ratio	rPSA
Volume of core-biopsy sampled tissue	CBV
Offwhite fragile tissue in biopsied samples	OFT
Transurethral resection prostate	TURP
Presence of accompanying prostatic hyperplasia	-
Volume of prostate	PV
Serum enzymatic examination
Alkaline phosphatase	ALP
Creatine kinase isoenzyme	CK_MB
Lactic dehydrogenase	LDH
Creatine kinase	CK
Blood biochemical indicators
Creatinine	CR
Albuminous	ALB
Serum uric acid	UA
Triglyceride	TG
High density lipoprotein cholesterol	HDLC
Low density lipoprotein cholesterol	LDLC
Apolipoprotein A1	ApoA1
Apolipoprotein B	ApoB
Estimated glomerular filtration rate	eGFR
Electrolyte indicators
Potassium	K
Inorganic phosphorus	iP
Sodium	Na
Inorganic calcium	iCa
Chloride	CL

Actual	Predicted		Row Total
Actual	Non-PCa	PCa	Row Total
RF model
Non-PCa	149	3	152
PCa	5	93	98
Column total	154	96	250
SVM model
Non-PCa	141	26	167
PCa	26	107	133
Column total	167	133	300
BP model
Non-PCa	264	47	311
PCa	19	270	289
Column total	283	317	600
CNN model
Non-PCa	74	21	95
PCa	35	71	106
Column total	109	92	201

Model	Accuracy (%)	Precision (%)	Recall (%)	AUC (95%CI)
RF	96.80	96.88	94.90	0.975 (0.964, 0.986)
SVM	82.67	80.45	80.45	0.824 (0.805, 0.844)
BP	85.36	85.17	93.43	0.892 (0.849, 0.934)
CNN	72.37	77.18	66.98	0.724 (0.670, 0.779)

优先出版

当期目次

过刊浏览

虚拟专题

作者中心

评阅者中心