跳到主要导航 跳到搜索 跳到主要内容

Feature engineering combined with machine learning and rule-based methods for structured information extraction from narrative clinical discharge summaries

  • Yan Xu
  • , Kai Hong
  • , Junichi Tsujii
  • , Eric I.Chao Chang*
  • *此作品的通讯作者
  • Microsoft USA
  • University of Pennsylvania

科研成果: 期刊稿件文章同行评审

摘要

Objective: A system that translates narrative text in the medical domain into structured representation is in great demand. The system performs three sub-tasks: concept extraction, assertion classification, and relation identification. Design: The overall system consists of five steps: (1) pre-processing sentences, (2) marking noun phrases (NPs) and adjective phrases (APs), (3) extracting concepts that use a dosage-unit dictionary to dynamically switch two models based on Conditional Random Fields (CRF), (4) classifying assertions based on voting of five classifiers, and (5) identifying relations using normalized sentences with a set of effective discriminating features. Measurements: Macro-averaged and micro-averaged precision, recall and F-measure were used to evaluate results. Results: The performance is competitive with the stateof- the-art systems with micro-averaged F-measure of 0.8489 for concept extraction, 0.9392 for assertion classification and 0.7326 for relation identification. Conclusions: The system exploits an array of common features and achieves state-of-the-art performance. Prudent feature engineering sets the foundation of our systems. In concept extraction, we demonstrated that switching models, one of which is especially designed for telegraphic sentences, improved extraction of the treatment concept significantly. In assertion classification, a set of features derived from a rule-based classifier were proven to be effective for the classes such as conditional and possible. These classes would suffer from data scarcity in conventional machinelearning methods. In relation identification, we use twostaged architecture, the second of which applies pairwise classifiers to possible candidate classes. This architecture significantly improves performance.

源语言英语
页(从-至)824-832
页数9
期刊Journal of the American Medical Informatics Association
19
5
DOI
出版状态已出版 - 9月 2012

指纹

探究 'Feature engineering combined with machine learning and rule-based methods for structured information extraction from narrative clinical discharge summaries' 的科研主题。它们共同构成独一无二的指纹。

引用此