TY - JOUR
T1 - Information integration and knowledge acquisition from semantically heterogeneous biological data sources
AU - Caragea, Doina
AU - Pathak, Jyotishman
AU - Bao, Jie
AU - Silvescu, Adrian
AU - Andorf, Carson
AU - Dobbs, Drena
AU - Honavar, Vasant
N1 - Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2005
Y1 - 2005
N2 - We present INDUS (Intelligent Data Understanding System), a federated, query-centric system for knowledge acquisition from autonomous, distributed, semantically heterogeneous data sources that can be viewed (conceptually) as tables. INDUS employs ontologies and inter-ontology mappings, to enable a user or an application to view a collection of such data sources (regardless of location, internal structure and query interfaces) as though they were a collection of tables structured according to an ontology supplied by the user. This allows INDUS to answer user queries against distributed, semantically heterogeneous data sources without the need for a centralized data warehouse or a common global ontology. We used INDUS framework to design algorithms for learning probabilistic models (e.g., Naive Bayes models) for predicting GO functional classification of a protein based on training sequences that are distributed among SWISSPROT and MIPS data sources. Mappings such as EC2GO and MIPS2GO were used to resolve the semantic differences between these data sources when answering queries posed by the learning algorithms. Our results show that INDUS can be successfully used for integrative analysis of data from multiple sources needed for collaborative discovery in computational biology.
AB - We present INDUS (Intelligent Data Understanding System), a federated, query-centric system for knowledge acquisition from autonomous, distributed, semantically heterogeneous data sources that can be viewed (conceptually) as tables. INDUS employs ontologies and inter-ontology mappings, to enable a user or an application to view a collection of such data sources (regardless of location, internal structure and query interfaces) as though they were a collection of tables structured according to an ontology supplied by the user. This allows INDUS to answer user queries against distributed, semantically heterogeneous data sources without the need for a centralized data warehouse or a common global ontology. We used INDUS framework to design algorithms for learning probabilistic models (e.g., Naive Bayes models) for predicting GO functional classification of a protein based on training sequences that are distributed among SWISSPROT and MIPS data sources. Mappings such as EC2GO and MIPS2GO were used to resolve the semantic differences between these data sources when answering queries posed by the learning algorithms. Our results show that INDUS can be successfully used for integrative analysis of data from multiple sources needed for collaborative discovery in computational biology.
UR - http://www.scopus.com/inward/record.url?scp=26444450699&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=26444450699&partnerID=8YFLogxK
U2 - 10.1007/11530084_15
DO - 10.1007/11530084_15
M3 - Conference article
AN - SCOPUS:26444450699
SN - 0302-9743
VL - 3615
SP - 175
EP - 190
JO - Lecture Notes in Bioinformatics (Subseries of Lecture Notes in Computer Science)
JF - Lecture Notes in Bioinformatics (Subseries of Lecture Notes in Computer Science)
T2 - Second International Workshop on Data Integration in the Life Sciences, DILS 2005
Y2 - 20 July 2005 through 22 July 2005
ER -