Flexible NLP toolkit for automatic curation of outcomes for breast cancer patients

Banerjee, Imon (PI)
Rubin, Daniel D.L (CoPI)
Keegan, Theresa T.H (CoPI)

Diagnostic Radiology

Project: Research project

Description

Project summary/Abstract Breast cancer has the largest number of new cases in world (11.7%). Although the prognosis of breast cancer patients is generally favorable due to early detection and comprehensive treatment, 20%–30% of patients will still develop distant metastases and cases with progressive stage only have a median two-year survival time. Breast cancer is widely recognized as a heterogeneous disease in the sense of both primary tumor metastatic capacity and time to metastatic spread of disease. High-quality population-based cancer surveillance data are needed to: (1) describe cancer burden, patterns, and outcomes in order to (2) inform cancer prevention, detection and control activities; and (3) evaluate interventions on the basis of past and future trends so that optimal approaches to alleviate burden and suffering from cancer can be adopted. However, the laborious manual curation process makes the population wise surveillance data collection challenging. It has been shown in studies that a large percentage of total registry cost is devoted to labor for data curation, even in the low-income countries. In this project, our mission is to build a flexible NLP toolset that can be executed locally at the institution level and will curate the clinical and patient-centered outcomes of breast cancer patients by parsing longitudinally acquired clinic notes, radiology and pathology reports. In order to test the generalizability of the tools and to initiate their deployment for data collection, we will partner with both Georgia SEER and California state cancer registry and will curate the outcome data of past 10-years breast cancer patients from two institutions across US representing diverse patient populations - Emory University hospital (Georgia) and Stanford Medical Center (California). We will leverage the previously developed tools and technologies and extend them to automatically curate the clinical and patient- centered outcome data – recurrence date and site of recurrence, treatment administered, mental and physical outcomes – from clinic notes and convert these into structured and query-able format. The NLP tools will be dockerized and run locally at the hospital registry level for automated outcome curation. Finally, the NLP extracted outcomes will be shared with State Cancer registry for evaluation. From a methodological perspective, the framework and the open-source software tools developed can be employed for cancer research beyond the scope of our project for curating outcomes regardless of the problem domain.

Status	Active
Effective start/end date	8/1/22 → 7/31/24

Funding

National Cancer Institute: $525,159.00
National Cancer Institute: $540,872.00

Access Project

https://projectreporter.nih.gov/project_info_details.cfm?aid=10675009

Flexible NLP toolkit for automatic curation of outcomes for breast cancer patients

Project Details

Description

Funding

Access Project

Fingerprint