Chinese medical concept normalization, which maps non-standard medical concepts to standard expressions, is a NLP task with wide-ranging applications in medical big data research and clinical statistic. Many previous works apply supervised methods which require a lot of annotated data. However, they can not address the challenge brought by the high cost of medical data annotation, which requires sufficient professional knowledge and experience. Meanwhile, existing unsupervised methods perform poorly facing the various non-standard expression from different data sources. In this paper, we propose DUNE, Disease Unsupervised Normalization by Embedding, an unsupervised Chinese medical concept normalization framework by applying denoising auto-encoder (DAE) and network embedding. We formulate this task as finding mention-entity pairs with great text and comorbidity similarity. To handle the noise in text, we design a multi-view attention based denoising auto-encoder (MADAE) to capture text information from multiple views, reduce the influence of noise, and transform text to denoised vectors. To introduce comorbidity information, we construct a comorbidity network with both standard and non-standard disease names as nodes from medical records. Because of the diversity of nonstandard expressions, one disease perhaps corresponds to several different nodes, which causes noise in comorbidity network. To handle such network structure noise, we propose a denoising network embedding framework, which reduce the structure noise with the help of text information, to embed the nodes to vectors for comorbidity similarity measurement. Convincing experiment results show that our method performs better than existing unsupervised baselines and approaches the performance of classical supervised machine learning model on this task.
Supplementary notes can be added here, including code, math, and images.