MedShift: Automated Identification of Shift Data for Medical Image Dataset Curation

Xiaoyuan Guo, Judy Wawira Gichoya, Hari Trivedi, Saptarshi Purkayastha, Imon Banerjee

Research output: Contribution to journalArticlepeer-review

Abstract

Automated curation of noisy external data in the medical domain has long been in high demand, as AI technologies need to be validated using various sources with clean, annotated data. Identifying the variance between internal and external sources is a fundamental step in curating a high-quality dataset, as the data distributions from different sources can vary significantly and subsequently affect the performance of AI models. The primary challenges for detecting data shifts are - (1) accessing private data across healthcare institutions for manual detection and (2) the lack of automated approaches to learn efficient shift-data representation without training samples. To overcome these problems, we propose an automated pipeline called MedShift to detect top-level shift samples and evaluate the significance of shift data without sharing data between internal and external organizations. MedShift employs unsupervised anomaly detectors to learn the internal distribution and identify samples showing significant shiftness for external datasets, and then compares their performance. To quantify the effects of detected shift data, we train a multi-class classifier that learns internal domain knowledge and evaluates the classification performance for each class in external domains after dropping the shift data. We also propose a data quality metric to quantify the dissimilarity between internal and external datasets. We verify the efficacy of MedShift using musculoskeletal radiographs (MURA) and chest X-ray datasets from multiple external sources. Our experiments show that our proposed shift data detection pipeline can be beneficial for medical centers to curate high-quality datasets more efficiently.

Original languageEnglish (US)
Pages (from-to)3936-3947
Number of pages12
JournalIEEE Journal of Biomedical and Health Informatics
Volume27
Issue number8
DOIs
StatePublished - Aug 1 2023

Keywords

  • Anomaly detection
  • OOD detection
  • X-ray
  • dataset curation
  • medical shift data

ASJC Scopus subject areas

  • Health Information Management
  • Health Informatics
  • Electrical and Electronic Engineering
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'MedShift: Automated Identification of Shift Data for Medical Image Dataset Curation'. Together they form a unique fingerprint.

Cite this