Deep Denoising of Raw Biomedical Knowledge Graph From COVID-19 Literature, LitCovid, and Pubtator: Framework Development and Validation

Chao Jiang; Victoria Ngo; Richard Chapman; Yue Yu; Hongfang Liu; Guoqian Jiang; Nansu Zong

doi:10.2196/38584

Deep Denoising of Raw Biomedical Knowledge Graph From COVID-19 Literature, LitCovid, and Pubtator: Framework Development and Validation

Chao Jiang, Victoria Ngo, Richard Chapman, Yue Yu, Hongfang Liu, Guoqian Jiang, Nansu Zong

Research output: Contribution to journal › Article › peer-review

Abstract

Background: Multiple types of biomedical associations of knowledge graphs, including COVID-19-related ones, are constructed based on co-occurring biomedical entities retrieved from recent literature. However, the applications derived from these raw graphs (eg, association predictions among genes, drugs, and diseases) have a high probability of false-positive predictions as co-occurrences in the literature do not always mean there is a true biomedical association between two entities. Objective: Data quality plays an important role in training deep neural network models; however, most of the current work in this area has been focused on improving a model's performance with the assumption that the preprocessed data are clean. Here, we studied how to remove noise from raw knowledge graphs with limited labeled information. Methods: The proposed framework used generative-based deep neural networks to generate a graph that can distinguish the unknown associations in the raw training graph. Two generative adversarial network models, NetGAN and Cross-Entropy Low-rank Logits (CELL), were adopted for the edge classification (ie, link prediction), leveraging unlabeled link information based on a real knowledge graph built from LitCovid and Pubtator. Results: The performance of link prediction, especially in the extreme case of training data versus test data at a ratio of 1:9, demonstrated that the proposed method still achieved favorable results (area under the receiver operating characteristic curve >0.8 for the synthetic data set and 0.7 for the real data set), despite the limited amount of testing data available. Conclusions: Our preliminary findings showed the proposed framework achieved promising results for removing noise during data preprocessing of the biomedical knowledge graph, potentially improving the performance of downstream applications by providing cleaner data.

Original language	English (US)
Article number	e38584
Journal	Journal of medical Internet research
Volume	24
Issue number	7
DOIs	https://doi.org/10.2196/38584
State	Published - Jul 1 2022

Keywords

COVID-19
adversarial generative network
biomedical
deep denoising
knowledge graph
machine learning
network model
neural network
training data

ASJC Scopus subject areas

Health Informatics

Access to Document

10.2196/38584

Cite this

@article{b6dafa68ac32407eb42d0235354befed,

title = "Deep Denoising of Raw Biomedical Knowledge Graph From COVID-19 Literature, LitCovid, and Pubtator: Framework Development and Validation",

abstract = "Background: Multiple types of biomedical associations of knowledge graphs, including COVID-19-related ones, are constructed based on co-occurring biomedical entities retrieved from recent literature. However, the applications derived from these raw graphs (eg, association predictions among genes, drugs, and diseases) have a high probability of false-positive predictions as co-occurrences in the literature do not always mean there is a true biomedical association between two entities. Objective: Data quality plays an important role in training deep neural network models; however, most of the current work in this area has been focused on improving a model's performance with the assumption that the preprocessed data are clean. Here, we studied how to remove noise from raw knowledge graphs with limited labeled information. Methods: The proposed framework used generative-based deep neural networks to generate a graph that can distinguish the unknown associations in the raw training graph. Two generative adversarial network models, NetGAN and Cross-Entropy Low-rank Logits (CELL), were adopted for the edge classification (ie, link prediction), leveraging unlabeled link information based on a real knowledge graph built from LitCovid and Pubtator. Results: The performance of link prediction, especially in the extreme case of training data versus test data at a ratio of 1:9, demonstrated that the proposed method still achieved favorable results (area under the receiver operating characteristic curve >0.8 for the synthetic data set and 0.7 for the real data set), despite the limited amount of testing data available. Conclusions: Our preliminary findings showed the proposed framework achieved promising results for removing noise during data preprocessing of the biomedical knowledge graph, potentially improving the performance of downstream applications by providing cleaner data.",

keywords = "COVID-19, adversarial generative network, biomedical, deep denoising, knowledge graph, machine learning, network model, neural network, training data",

author = "Chao Jiang and Victoria Ngo and Richard Chapman and Yue Yu and Hongfang Liu and Guoqian Jiang and Nansu Zong",

note = "Publisher Copyright: {\textcopyright} Chao Jiang, Victoria Ngo, Richard Chapman, Yue Yu, Hongfang Liu, Guoqian Jiang, Nansu Zong.",

year = "2022",

month = jul,

day = "1",

doi = "10.2196/38584",

language = "English (US)",

volume = "24",

journal = "Journal of medical Internet research",

issn = "1439-4456",

publisher = "Journal of medical Internet Research",

number = "7",

}

TY - JOUR

T1 - Deep Denoising of Raw Biomedical Knowledge Graph From COVID-19 Literature, LitCovid, and Pubtator

T2 - Framework Development and Validation

AU - Jiang, Chao

AU - Ngo, Victoria

AU - Chapman, Richard

AU - Yu, Yue

AU - Liu, Hongfang

AU - Jiang, Guoqian

AU - Zong, Nansu

N1 - Publisher Copyright: © Chao Jiang, Victoria Ngo, Richard Chapman, Yue Yu, Hongfang Liu, Guoqian Jiang, Nansu Zong.

PY - 2022/7/1

Y1 - 2022/7/1

N2 - Background: Multiple types of biomedical associations of knowledge graphs, including COVID-19-related ones, are constructed based on co-occurring biomedical entities retrieved from recent literature. However, the applications derived from these raw graphs (eg, association predictions among genes, drugs, and diseases) have a high probability of false-positive predictions as co-occurrences in the literature do not always mean there is a true biomedical association between two entities. Objective: Data quality plays an important role in training deep neural network models; however, most of the current work in this area has been focused on improving a model's performance with the assumption that the preprocessed data are clean. Here, we studied how to remove noise from raw knowledge graphs with limited labeled information. Methods: The proposed framework used generative-based deep neural networks to generate a graph that can distinguish the unknown associations in the raw training graph. Two generative adversarial network models, NetGAN and Cross-Entropy Low-rank Logits (CELL), were adopted for the edge classification (ie, link prediction), leveraging unlabeled link information based on a real knowledge graph built from LitCovid and Pubtator. Results: The performance of link prediction, especially in the extreme case of training data versus test data at a ratio of 1:9, demonstrated that the proposed method still achieved favorable results (area under the receiver operating characteristic curve >0.8 for the synthetic data set and 0.7 for the real data set), despite the limited amount of testing data available. Conclusions: Our preliminary findings showed the proposed framework achieved promising results for removing noise during data preprocessing of the biomedical knowledge graph, potentially improving the performance of downstream applications by providing cleaner data.

AB - Background: Multiple types of biomedical associations of knowledge graphs, including COVID-19-related ones, are constructed based on co-occurring biomedical entities retrieved from recent literature. However, the applications derived from these raw graphs (eg, association predictions among genes, drugs, and diseases) have a high probability of false-positive predictions as co-occurrences in the literature do not always mean there is a true biomedical association between two entities. Objective: Data quality plays an important role in training deep neural network models; however, most of the current work in this area has been focused on improving a model's performance with the assumption that the preprocessed data are clean. Here, we studied how to remove noise from raw knowledge graphs with limited labeled information. Methods: The proposed framework used generative-based deep neural networks to generate a graph that can distinguish the unknown associations in the raw training graph. Two generative adversarial network models, NetGAN and Cross-Entropy Low-rank Logits (CELL), were adopted for the edge classification (ie, link prediction), leveraging unlabeled link information based on a real knowledge graph built from LitCovid and Pubtator. Results: The performance of link prediction, especially in the extreme case of training data versus test data at a ratio of 1:9, demonstrated that the proposed method still achieved favorable results (area under the receiver operating characteristic curve >0.8 for the synthetic data set and 0.7 for the real data set), despite the limited amount of testing data available. Conclusions: Our preliminary findings showed the proposed framework achieved promising results for removing noise during data preprocessing of the biomedical knowledge graph, potentially improving the performance of downstream applications by providing cleaner data.

KW - COVID-19

KW - adversarial generative network

KW - biomedical

KW - deep denoising

KW - knowledge graph

KW - machine learning

KW - network model

KW - neural network

KW - training data

UR - http://www.scopus.com/inward/record.url?scp=85134339757&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85134339757&partnerID=8YFLogxK

U2 - 10.2196/38584

DO - 10.2196/38584

M3 - Article

C2 - 35658098

AN - SCOPUS:85134339757

SN - 1439-4456

VL - 24

JO - Journal of medical Internet research

JF - Journal of medical Internet research

IS - 7

M1 - e38584

ER -

Deep Denoising of Raw Biomedical Knowledge Graph From COVID-19 Literature, LitCovid, and Pubtator: Framework Development and Validation

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this