A system for industrial-strength linguistic parsing of medical documents (original) (raw)

Towards a comprehensive medical language processing system: methods and issues

Proceedings : a conference of the American Medical Informatics Association / ... AMIA Annual Fall Symposium. AMIA Fall Symposium, 1997

Natural language processing (NLP) systems can help solve the data entry problem by providing coded data from textual reports for clinical applications. A number of NLP systems have shown promise, but have not yet achieved wide-spread use for practical applications. In order to achieve such use, a system must have broad coverage of the clinical domain and not be restricted to limited applications. In addition, an NLP system must perform satisfactorily for real-world applications. This paper describes methods and issues associated with an ongoing extension of MedLEE, an operational NLP system, from a limited domain to a domain that encompasses comprehensive clinical information.

A comparison of parsing technologies for the biomedical domain

2002

This paper reports on a number of experiments which are designed to investigate the extent to which current nlp resources are able to syntactically and semantically analyse biomedical text. We address two tasks: (a) parsing a real corpus with a hand-built wide-coverage grammar, producing both syntactic analyses and logical forms and (b) automatically computing the interpretation of compound nouns where the head is a nominalisation (e.g. hospital arrival means an arrival at hospital, while patient arrival means an arrival of a patient). For the former task we demonstrate that flexible and yet constrained pre-processing techniques are crucial to success: these enable us to use part-of-speech tags to overcome inadequate lexical coverage, and to package up complex technical expressions prior to parsing so that they are blocked from creating misleading amounts of syntactic complexity. We argue that the xml-processing paradigm is ideally suited for automatically preparing the corpus for parsing. For the latter task, we compute interpretations of the compounds by exploiting surface cues and meaning paraphrases, which in turn are extracted from the parsed corpus. This provides an empirical setting in which we can compare the utility of a comparatively deep parser vs. a shallow one, exploring the trade-off between resolving attachment ambiguities on the one hand and generating errors in the parses on the other. We demonstrate that a model of the meaning of compound nominalisations is achievable with the aid of current broad-coverage parsers.

MEDSYNDIKATE--a natural language system for the extraction of medical information from findings reports

2002

MEDSYNDIKATE is a natural language processor which automatically acquires medical information from findings reports. In the course of text analysis their contents is transferred to conceptual representation structures which constitute a corresponding text knowledge base. MEDSYNDIKATE is particularly adapted to deal properly with text structures, such as various forms of anaphoric reference relations spanning several sentences. The strong demands MEDSYNDIKATE poses on the availability of expressive knowledge sources are accounted for by two alternative approaches to acquire medical domain knowledge (semi)automatically. Finally, we present data for the information extraction performance of MEDSYNDIKATE in terms of the semantic interpretation of three major syntactic patterns in medical documents.

Collection and linguistic processing of a large-scale corpus of medical articles

Proc. LREC'02, 2002

We have collected a large-scale corpus of electronic articles in the cardiology domain (85 million+ words) in the framework of a digital library project that tailors the presentation of online medical literature to both patients and healthcare providers. We describe the webbased and XML technologies we used for the collection, encoding and linguistic processing of the corpus. This resulted in a largescale, high-quality, thoroughly marked-up resource which is used by many researchers in our project, in the areas of natural language processing, information retrieval and medical informatics. We show how the final use of the resource has influenced the design of its structural and linguistic encoding. The procedure we describe is general enough to be of use to researchers in a similar position wishing to compile, encode and linguistically annotate their own corpus from the web.

Natural language processing and the representation of clinical data

Journal of the American …, 1994

Objective: Develop a representation of clinical observations and actions and a method of processing free-text patient documents to facilitate applications such as quality assurance Design: The Linguistic String Project (LSP) system of New York University utilizes syntactic analysis, augmented by a sublanguage grammar and an information structure that are specific to the clinical narrative, to map free-text documents into a database for querying.

From Linguistic Resources to Medical Entity Recognition: a Supervised Morpho-syntactic Approach

2015

Due to the importance of the information it conveys, Medical Entity Recognition is one of the most investigated tasks in Natural Language Processing. Many researches have been aiming at solving the issue of Text Extraction, also in order to develop Decision Support Systems in the field of Health Care. In this paper, we propose a Lexicon-grammar method for the automatic extraction from raw texts of the semantic information referring to medical entities and, furthermore, for the identification of the semantic categories that describe the located entities. Our work is grounded on an electronic dictionary of neoclassical formative elements of the medical domain, an electronic dictionary of nouns indicating drugs, body parts and internal body parts and a grammar network composed of morphological and syntactical rules in the form of Finite-State Automata. The outcome of our research is an Extensible Markup Language (XML) annotated corpus of medical reports with information pertaining to t...

Morpho-semantic parsing of medical expressions

Proceedings / AMIA ... Annual Symposium. AMIA Symposium, 1998

The task of editing, indexing, storing, and retrieving medical expressions within medical records remains the main objective for the years to come. Therefore, the need for a parser with semantic capabilities able to robustly extract an essential part of the knowledge embedded in the medical record is paramount. The minimal requirements before considering clinical trials are that such a system has to be in position to handle any source of medical information and to conveniently grasp the main key concepts with low silence, good recognition of modalities and acceptable noise. This paper shows that the potential of morpho-semantic parsing is high to meet these conditions. This technique is an important complement to the traditional lexical approach and to expression-oriented systems like controlled vocabularies.

Natural Language Processing in Biomedicine: A Unified System Architecture Overview

Methods in Molecular Biology, 2014

In contemporary electronic medical records much of the clinically important data-signs and symptoms, symptom severity, disease status, etc.-are not provided in structured data fi elds but rather are encoded in clinician-generated narrative text. Natural language processing (NLP) provides a means of unlocking this important data source for applications in clinical decision support, quality assurance, and public health. This chapter provides an overview of representative NLP systems in biomedicine based on a unifi ed architectural view. A general architecture in an NLP system consists of two main components: background knowledge that includes biomedical knowledge resources and a framework that integrates NLP tools to process text. Systems differ in both components, which we review briefl y. Additionally, the challenge facing current research efforts in biomedical NLP includes the paucity of large, publicly available annotated corpora, although initiatives that facilitate data sharing, system evaluation, and collaborative work between researchers in clinical NLP are starting to emerge.

The SYNODOS Project: System for the Normalization and Organization of Textual Medical Data for Observation in Healthcare

IRBM

Introduction: The electronic health record (EHR) is a very important potential source of data for various areas, such as medical decision support tools, evidence-based medicine or epidemiological surveillance. Much of this data is available in text format. Methods of natural language processing can be used to perform data mining and facilitate interpretation. The purpose of this project was to develop a generic semantic solution for extracting and structuring medical data for epidemiological analyses or for medical decision-support. The solution was developed with the objective of making it as independent as possible from the field of medical application in order to allow any new user to write his or her own expert rules regardless of their area of medical expertise. Material and methods: SYNODOS offers a modular architecture that makes a clear distinction between the linguistic rules and the medical expert rules. Different modules have been developed or adapted for this purpose: an interface between the multi-terminology server and semantic analyzer during the extraction phase, linguistic rules to extract temporal expressions, expert rules adapted to two areas of application (nosocomial infections, cancer), an interface between the engine and the linguistic knowledge base. Results: Modular integrations were performed consecutively. The multi-terminology extractor and semantic analyzer were first interfaced during the extraction phase. Output of this data processing was then integrated into a knowledge base. A user interface to access documents and write business rules was developed. Expert rules for the detection of nosocomial infections and for the evaluation of colon cancer management have been developed. It was necessary to develop an additional module the need for which had not been identified during the drafting of the protocol. This module aims to structure the output of the data processing described above, according to the patient's care pathway. This module is based on the writing of medical expert rules. Evaluation indicators were obtained at different stages of the process (terminology extraction, semantic relations, data structuring, detection of events of interest). Discussion: This project helped to highlight the value of combining different technologies (natural language processing, terminology, expert systems integration) to allow for the use of unstructured data in epidemiology. However, the need to develop an additional module of expert rules did not allow a complete and operational solution. Furthermore the multi-terminology extractor (ECMT V2) response time is too long (6 min per report). A change in technology was envisaged at the end of the project to reduce this time.

The power and limits of a rule-based morpho-semantic parser

Proceedings / AMIA ... Annual Symposium. AMIA Symposium, 1999

The venue of Electronic Patient Record (EPR) implies an increasing amount of medical texts readily available for processing, as soon as convenient tools are made available. The chief application is text analysis, from which one can drive other disciplines like indexing for retrieval, knowledge representation, translation and inferencing for medical intelligent systems. Prerequisites for a convenient analyzer of medical texts are: building the lexicon, developing semantic representation of the domain, having a large corpus of texts available for statistical analysis, and finally mastering robust and powerful parsing techniques in order to satisfy the constraints of the medical domain. This article aims at presenting an easy-to-use parser ready to be adapted in different settings. It describes its power together with its practical limitations as experienced by the authors.