Ark+: Supervised training a single high-performance AI foundation model from many differently labeled datasets—no label consolidation required

  • Dong Ao Ma
  • , Jiaxuan Pang
  • , Shivasakthi Senthil Velan
  • , Michael B. Gotway
  • , Jianming Liang

Research output: Contribution to journalArticlepeer-review

Abstract

This article presents a methodological breakthrough in supervised learning for training a single, robust , and high-performance artificial intelligence (AI) model using a multitude of datasets labeled differently —yet requiring no manual label harmonization and imposing no restriction on application domains. Deep learning offers expert-level and sometimes even super-expert-level performance, but achieving such performance demands massive labeled data for training. For example, Google’s proprietary CXR Foundation Model (CXR-FM) was trained on 821,544 labeled and mostly private chest radiographs (CXRs). Numerous datasets are publicly released in medical imaging. They are individually small and heterogeneous in expert labels, but collectively large . Therefore, we envision a powerful and robust AI foundation model trained in a supervised manner by aggregating numerous public, labeled datasets. Realizing this vision requires overcoming a long-standing barrier: label heterogeneity across the datasets to be used for the supervised training of a single outstanding model. To this end, we have developed Ark +, a framework that a ccrues and r euses k nowledge from heterogeneous expert annotations associated with various datasets. To demonstrate the capability of Ark+, we pretrained a family of Ark+ models, including Ark+5 and Ark+6 on 335,484 and 704,363 CXRs, respectively, by merging multiple public datasets including MIMIC-CXR, CheXpert, ChestX-ray14, RSNA Pneumonia, VinDr-CXR, and Shenzhen-CXR. We evaluated the two Ark+ models on a wide range of imaging tasks covering classification, segmentation, and localization via fine-tuning, linear-probing, and sex-bias analysis, and demonstrated Ark+’s superior and robust performance over the state-of-the-art fully/self-supervised baselines and Google’s proprietary CXR-FM. Ark+ has several distinctive and advantageous properties by design. To unveil its inner workings, we performed a series of ablation studies to show the contribution from each of its components and its superiority over alternative strategies. To demonstrate its capabilities of incorporating privacy-preserving data and addressing heterogeneous annotations across private clients in federated learning, we simulated Ark+ in various distributed training environments. To highlight its independence of architecture and scalability in image resolution, we developed two Ark+6 models with different architectures in different resolutions. To assess its neutrality in imaging modality and extensibility to different modalities, we pretrained a new Ark+ model on fundus photography. The enhanced performance of Ark+ is attributable to our simple yet powerful insight: aggregating various datasets diversifies patient populations and accrues knowledge from many experts, yielding unprecedented performance while simultaneously reducing annotation costs. Given the ubiquity of heterogeneous data and labels across disciplines including biology, chemistry, physics, medicine, and the social sciences, the concept underlying Ark+ is poised to have far-reaching implications beyond imaging. With all codes and pretrained models released at https://github.com/JLiangLab/Ark , we anticipate that Ark+ will exert an important impact on open science, as it demonstrates that accruing and reusing knowledge from expert annotations in even only public datasets can surpass the performance of proprietary models trained on unusually large data. We hope that Ark+ will inspire more researchers worldwide to share codes and datasets to build open, superior, and robust foundation models, accelerate open science, and democratize AI.

Original languageEnglish (US)
Article number103828
JournalMedical Image Analysis
Volume108
DOIs
StatePublished - Feb 2026

Keywords

  • Accruing and reusing knowledge
  • Federated learning
  • Foundation models
  • Label heterogeneity

ASJC Scopus subject areas

  • Radiological and Ultrasound Technology
  • Radiology Nuclear Medicine and imaging
  • Computer Vision and Pattern Recognition
  • Health Informatics
  • Computer Graphics and Computer-Aided Design

Fingerprint

Dive into the research topics of 'Ark+: Supervised training a single high-performance AI foundation model from many differently labeled datasets—no label consolidation required'. Together they form a unique fingerprint.

Cite this