Semi-structured Information Retrieval in Clinical Text for Cohort Identification

Liu, Hongfang H (CoPI)
Hersh, William R. (CoPI)
Wu, Stephen T. (CoPI)
Liu, Hongfang D (PI)

Digital Health Sciences

University of Texas Health Science Center at Houston

Project: Research project

Description

Project Summary The widespread adoption of Electronic Health Records (EHRs) has enabled the use of clinical data for clinical research and healthcare delivery. Many institutions have established clinical data warehouses (CDWs) in conjunction with cohort discovery tools (e.g., i2b2) to support the use of clinical data for clinical research including retrospective clinical studies as well as feasibility assessment or patient recruitment for clinical trials. However, a significant portion of relevant patient information is embedded in clinical narratives and natural language processing (NLP) techniques such as information extraction are critical when using EHR data for clinical research. Many clinical NLP systems have been developed to extract information from text for various downstream applications but have had unsatisfactory performance and portability issues. Information retrieval (IR), a technique used in search engines for storing, retrieving, and ranking documents from a large collection of text documents based on users’ queries, can provide an alternative approach to leverage clinical narratives for cohort discovery as it is less dependent on semantics. In order to accomplish this, additional work is needed since current IR approaches are generally document-based and the formulation of cohort discovery as an IR task requires the development of innovative IR approaches to handle complex EHR data and cohort criteria with contextual (e.g., spatial or temporal) constraints. Our long-term goal is to develop informatics solutions to accelerate the use of EHR data for clinical research. The main goal of this proposal is to develop innovative IR methods, which formulate cohort discovery from EHR data as an IR task, aiming to accelerate the identification of patient cohorts for cohort studies or the recruitment of eligible patients for clinical trials. In our current R01-supported study (R01LM011934), we introduced novel language models to enable the reuse of NLP-produced artifacts for IR-based cohort retrieval and developed parallel resources for IR evaluation at two institutions (Mayo Clinic and OHSU). We hypothesize that, given complex cohort criteria with contextual constraints, an IR framework with tailored architecture components (e.g., indexing, ranking, evaluation, and query processing) for storing and querying EHR data has an advantage over traditional cohort discovery tools for querying unstructured EHR data as well as an advantage over text-based search engines for querying both structured and unstructured EHR data. For the proposed renewal, we plan to i) adopt common data models (CDMs) and deploy the framework at one additional site to assess the generalizability of methods, ii) extend the IR framework to incorporate contextual information, and iii) incorporate deep semantic representations into the IR framework. If successful, the proposed project will advance informatics research on cohort discovery and identification, which impacts many applications based on EHR data such as learning healthcare systems, predictive modeling, or AI in healthcare.

Status	Active
Effective start/end date	9/20/14 → 4/30/24

Funding

U.S. National Library of Medicine: $637,539.00

Access Project

https://projectreporter.nih.gov/project_info_details.cfm?aid=10879792

Semi-structured Information Retrieval in Clinical Text for Cohort Identification

Project Details

Description

Funding

Access Project

Fingerprint