基于深度学习的电子病历中医疗知识抽取与分析

doi:10.24920/003589

摘要/Abstract

摘要： 目的电子病历（Electronic Medical Record, EMR）是记录患者医疗活动的重要数字载体。医疗知识抽取（Medical knowledge extraction,MKE）在EMR方面的自然语言处理（Natural language processing,NLP）研究中起着关键作用。命名实体识别（Named Entity Recognition, NER）和医疗关系抽取（Medical Relation Extraction, MRE）是MKE的两个基本任务。本研究旨在通过探索新方法来提高这两项任务的识别准确性。方法本研究讨论并构建了针对NER和MRE任务的双向长短期记忆神经网络组合条件随机场（Bidirectional long short-term memory combined conditional random field, BiLSTM-CRF）模型的两个应用场景。在两个任务的数据预处理中,使用GloVe词嵌入模型来对单词进行矢量化。在NER任务中,我们使用序列标注策略通过CRF层的联合概率分布对每个单词标签进行分类。而在MRE任务中,我们将单个实体的分类问题转换为序列分类问题,并且通过CRF层链接实体之间的特征组合来预测医疗实体的关系类别。结果通过在I2B2 2010公共数据集上的验证,本研究中构建的BiLSTM-CRF模型较两个任务中的基线方法均取得了更好的结果,其中在NER任务中的F1值约0.88,在MRE任务中的F1值约0.78。此外,本模型的收敛速度更快,也避免了过度拟合等问题。结论本研究证明了深度学习在医疗知识抽取领域的良好表现,并且验证了BiLSTM-CRF模型在不同应用场景下的可行性,为EMR领域的后续工作奠定了基础。

关键词: 医疗知识抽取, 电子病历, 命名实体识别, 医疗关系抽取, 深度学习, 双向长短期记忆神经网络组合条件随机场

Abstract: Objectives Medical knowledge extraction (MKE) plays a key role in natural language processing (NLP) research in electronic medical records (EMR), which are the important digital carriers for recording medical activities of patients. Named entity recognition (NER) and medical relation extraction (MRE) are two basic tasks of MKE. This study aims to improve the recognition accuracy of these two tasks by exploring deep learning methods.Methods This study discussed and built two application scenes of bidirectional long short-term memory combined conditional random field (BiLSTM-CRF) model for NER and MRE tasks. In the data preprocessing of both tasks, a GloVe word embedding model was used to vectorize words. In the NER task, a sequence labeling strategy was used to classify each word tag by the joint probability distribution through the CRF layer. In the MRE task, the medical entity relation category was predicted by transforming the classification problem of a single entity into a sequence classification problem and linking the feature combinations between entities also through the CRF layer.Results Through the validation on the I2B2 2010 public dataset, the BiLSTM-CRF models built in this study got much better results than the baseline methods in the two tasks, where the F1-measure was up to 0.88 in NER task and 0.78 in MRE task. Moreover, the model converged faster and avoided problems such as overfitting.Conclusion This study proved the good performance of deep learning on medical knowledge extraction. It also verified the feasibility of the BiLSTM-CRF model in different application scenarios, laying the foundation for the subsequent work in the EMR field.

Key words: medical knowledge extraction, electronic medical record, named entity recognition, medical relation extraction, deep learning, bidirectional long short-term memory, conditional random field

基金资助: 浙江省自然科学基金((No.LQ16H180004))

Li Peilin, Yuan Zhenming, Tu Wenbo, Yu Kai, Lu Dongxin. Medical Knowledge Extraction and Analysis from Electronic Medical Records Using Deep Learning[J].Chinese Medical Sciences Journal, 2019, 34(2): 133-139.

图/表 8

参考文献 20

[1]	Wu JW, Guan Y, Lv XB . Entity relation extraction from electronic medical records based on deep learning. Intell comput appl 2014; 4(3):35-8. doi: . doi: 10.3969/j.issn.2095-2163.2014.03.009
[2]	Chen L, Li Y, Chen W , et al. Utilizing Soft Constraints to Enhance Medical Relation Extraction from the History of Present Illness in Electronic Medical Records. J Biomed Inform 2018; 87:108-17. doi: . doi: 10.1016/j.jbi.2018.09.013
[3]	Xue NW, Shen LB . Chinese word segmentation as LMR tagging. Proceedings of the second SIGHAN workshop on Chinese language processing. 2003, Jul 11-12; Sapporo, Japan. Stroudsburg, PA, USA: Association for computation Linguistics; 2003. 17:176-9. doi: . doi: 10.3115/1119250.1119278
[4]	Finkel JR, Grenager T, Manning C . Incorporating non-local information into information extraction systems by Gibbs sampling. Proceedings of the 43rd annual meeting on association for computational linguistics 2005; 363-70. doi: doi: 10.3115/1219840.1219885
[5]	Ye F, Chen W, Zhou GG , et al. Intelligent recognition of named entities in electronic medical records. Chin J of Biomed Eng 2011; 30(2):256-62. Chinese. doi: . doi: 10.3969/j.issn.0258-8021.2011.02.014
[6]	Li W, Zhao DZ, Li B , et al. Entity recognition of medical records combined with CRF and rules. App Res Comput 2015; 32(4):1082-6. doi: . doi: 10.3969/j.issn.1001-3695.2015.04.029
[7]	Bollegala D, Matsuo Y, Ishizuka M. Relation adaptation: learning to extract novel relations with minimum supervision. Proceedings of the 22nd International Joint Conference of Artificial Intelligence (IJCAI). 2011 Jul 16-22; Barcelona, Spain. Menlo Park, California, USA: AAAI Press; 2011. p. 2205-10. doi: doi:. doi: 10.5591/978-1-57735-516-8/IJCAI11-368
[8]	Suchanek FM, Ifrim G, Weikum G . Combining linguistic and statistical analysis to extract relations from web documents. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. 2006 Aug 20-23; Philadelphia, USA. New York: ACM; 2006. p. 712-7. doi: . doi: 10.1145/1150402.1150492
[9]	Qin B, Liu AA , Liu T. Unsupervised Chinese Open Entity Relation Extraction. J Comput Res Dev 2015; 52(5):1029-35. Chinese. doi:. doi: 10.7544/issn1000-1239.2015.20131550
[10]	Uzuner O, Mailoa J, Ryan R , et al. Semantic relations for problem-oriented medical records. Artif Intell Med 2010; 50(2):63-73. doi: doi: 10.1016/j.artmed.2010.05.006
[11]	Zhou G, Su J, Zhang J , et al. Exploring various knowledge in relation extraction. ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. 2005 Jun 25-30; Michigan, USA. Stroudsburg, PA, USA: Association for computational Linguistics; 2005. p. 427-34. doi: . doi: 10.3115/1219840.1219893
[12]	Uzuner O, South BR, Shen S , et al. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J Am Med Inform Assn 2011; 18(5):552-6. doi: doi: 10.1136/amiajnl-2011-000203
[13]	Rink B, Harabagiu S, Roberts K . Automatic extraction of relations between medical concepts in clinical texts. J Am Med Inform Assn 2011; 18(5):594-600. doi: doi: 10.1136/amiajnl-2011-000153
[14]	Demner-Fushman D, Mork JG, Shooshan SE , et al. UMLS content views appropriate for NLP processing of the biomedical literature vs. clinical text. J Biomed Inform 2010; 43(4):587-94. doi: doi: 10.1016/j.jbi.2010.02.005
[15]	Lv X, Yi G, Yang J , et al. Clinical relation extraction with deep learning. Int J Inf Tech Decis 2016; 9(7):237-48. doi: . doi: 10.14257/ijhit.2016.9.7.22
[16]	Pennington J, Socher R , Manning C. Glove: Global vectors for word representation. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014 Oct 25-29; Doha, Qatar. p. 1532-43. doi: . doi: 10.3115/v1/D14-1162
[17]	Yin W, Kann K, Yu M , et al. Comparative study of CNN and RNN for natural language Processing 2017; arXiv: 1702. 01923.
[18]	Yang JF, Yu QB, Guan Y , et al. A Review of research on name recognition and entity relation extraction of electronic medical records. Acta Automatica Sinica 2014; 40(8):1537-62. doi: . doi: 10.3724/SP.J.1004.2014.01537
[19]	Jiang M, Chen Y, Liu M , et al. A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries. J Am Med Inform Assn 2011; 18(5):601-6. doi: doi: 10.1136/amiajnl-2011-000163
[20]	De Bruijn B, Cherry C, Kiritchenko S , et al. Machine-learned solutions for three stages of clinical information extraction: the state of the art at i2b2 2010. J Am Med Inform Assn 2011; 18(5):557-62. doi: doi: 10.1136/amiajnl-2011-000150

Category	Description
TrIP	Treatment improves medical problems.
TrWP	Treatment worsens medical problems.
TrCP	Treatment causes medical problems.
TrAP	Treatment is applied to medical problems.
TrNAP	Treatment is not applied to medical problems.
TeRP	Tests reveal medical problems.
TeCP	In order to prove medical problems, need to be checked.
PIP	The relation between medical problems.

Models	Medical Problems	Treatment	Tests	Total
SVM	0.861	0.829	0.785	0.832
CRF	0.878	0.845	0.792	0.847
HMM ^[20]	0.875	0.851	0.804	0.852
LSTM	0.892	0.863	0.816	0.861
BiLSTM-CRF	0.902	0.896	0.832	0.879

Models	TrIP	TrWP	TrCP	TrAP	TrNAP	TeRP	TeCP	PIP	Total
SVM	0.23	0.05	0.496	0.806	0.17	0.872	0.45	0.87	0.737
ME	0.216	0.02	0.502	0.814	0.193	0.859	0.393	0.91	0.731
DNN+CRF^[1]	0.225	0.03	0.534	0.86	0.225	0.916	0.451	0.96	0.752
BiLSTM-CRF	0.251	0.11	0.572	0.903	0.35	0.931	0.503	0.98	0.775

优先出版

当期目次

过刊浏览

虚拟专题

作者中心

评阅者中心