Reusable NLP Components
Contact:Leonard D'Avolio, PhD
Mass. Veterans Epidemiology Research and Information Center (MAVERIC)
VA Boston Healthcare System
Leonard.Davolio@va.gov or firstname.lastname@example.org
UIMA is a Java-based framework for managing unstructured information (such as free text). Developed originally by IBM and then released open source to the community through Apache (http://incubator.apache.org/uima/), UIMA provides a workflow and interfaces to support the processing of unstructured data. UIMA-compliant components (called 'annotators') follow a certain convention that allows them to be dropped into and out of pipelines or collections of components (called 'AggregateAnalysisEngines'). The power of this is the ability to 'lego' them together for a given task. In our case, we capitalize on this modularity to import new modules using ARC.
We have developed and continue to develop UIMA-based components for specific clinical tasks. Binary and source packages for the UIMA annotation engines are available at http://code.google.com/p/maveric-clinical-nlp-annotators/.
This UIMA annotator identifies and extracts Gleason scores from prostate cancer-related pathology reports. It has achieved scores of greater than 95% F-measure at 8 hospitals (6 VA and 2 industry). Coming soon.
The regular expression-based Gleason Score Annotator extracts the Gleason string from the data files. This annotator creates a Gleason score system object with properties of GleasonStr and GleasonNum. The regular expression used is:
All annotated information is stored into the UIMA CAS object. The information in CAS objects can be retrieved for further processing depending on application requirements and annotation results.
D'Avolio LW, Litwin MS, Rogers SO, Bui AAT. "Facilitating Clinical Outcomes Assessment through the automated identification of quality measures for prostate cancer surgery." JAMIA. 2008;15(3):341-8.
This UIMA-based Tumor Stage Annotator was developed to extract tumor stage from prostate cancer pathology reports. It has achieved scores of greater than 95% F-measure at > 10 years worth of data from 8 hospitals (6 VA and 2 industry). It has been used in a pilot on breast cancer tumor stages and colorectal cancer. Those results were promising but preliminary so we have no numbers to report. Coming soon.
The Tumor Stage Annotator annotates tumor stage information in pathology reports. It saves all annotated information into CAS objects. The information in CAS objects can be retrieved for further processing based on application requirements. For example, we developed a post-process to split the annotated tumor stage string into several parts such as tumor stage number, tumor stage number string, lymph nodes, and metastasis.
The regular expression used in the annotator:
Facilitating Clinical Outcomes Assessment through the automated identification of quality measures for prostate cancer surgery. Authors: Leonard W D'Avolio, Mark S Litwin, Selwyn O Rogers, Alex A T Bui. Journal of the American Medical Informatics Association : JAMIA. 2008;15(3):341-8.
The breast cancer concept-extraction tools come in two flavors: one for reading pathology notes and another for clinic notes. Variables that are extracted: AJCC, Grade, Tumor Stage, Nodal Stage, ER Status, PR Status, Her2Neu Method, Her2Neu Test Value. Source available at Google Code
Dodgion C, Nguyen T, Karcz A, Hu YY, Jiang W, Corso K, Lipsitz SR, D?Avolio LW, Greenberg CC. Man or Machine: Multi-Institutional Evaluation of Automated Chart Review. American College of Surgeons Academic Congress. October 2011.
Greenberg CC, Dodgion C, Nguyen T, Jiang W, Hu YY, Karcz A, Lipsitz SR, D?Avolio LW. Expanding the Use of Free Text in EMR to Study Breast Cancer. Academy Health. June 2013.