Overview of the gene ontology task at BioCreative IV

Yuqing Mao; Kimberly Van Auken; Donghui Li; Cecilia N. Arighi; Peter McQuilton; G. Thomas Hayman; Susan Tweedie; Mary L. Schaeffer; Stanley J.F. Laulederkind; Shur Jen Wang; Julien Gobeill; Patrick Ruch; Anh Tuan Luu; Jung Jae Kim; Jung Hsien Chiang; Yu De Chen; Chia Jung Yang; Hongfang Liu; Dongqing Zhu; Yanpeng Li; Hong Yu; Ehsan Emadzadeh; Graciela Gonzalez; Jian Ming Chen; Hong Jie Dai; Zhiyong Lu

doi:10.1093/database/bau086

Overview of the gene ontology task at BioCreative IV

Yuqing Mao, Kimberly Van Auken, Donghui Li, Cecilia N. Arighi, Peter McQuilton, G. Thomas Hayman, Susan Tweedie, Mary L. Schaeffer, Stanley J.F. Laulederkind, Shur Jen Wang, Julien Gobeill, Patrick Ruch, Anh Tuan Luu, Jung Jae Kim, Jung Hsien Chiang, Yu De Chen, Chia Jung Yang, Hongfang Liu, Dongqing Zhu, Yanpeng LiHong Yu, Ehsan Emadzadeh, Graciela Gonzalez, Jian Ming Chen, Hong Jie Dai, Zhiyong Lu

Digital Health Sciences

Research output: Contribution to journal › Review article › peer-review

29 Scopus citations

Abstract

Gene Ontology (GO) annotation is a common task among model organism databases (MODs) for capturing gene function data from journal articles. It is a time-consuming and labor-intensive task, and is thus often considered as one of the bottlenecks in literature curation. There is a growing need for semiautomated or fully automated GO curation techniques that will help database curators to rapidly and accurately identify gene function information in full-length articles. Despite multiple attempts in the past, few studies have proven to be useful with regard to assisting real-world GO curation. The shortage of sentence-level training data and opportunities for interaction between text-mining developers and GO curators has limited the advances in algorithm development and corresponding use in practical circumstances. To this end, we organized a text-mining challenge task for literature-based GO annotation in BioCreative IV. More specifically, we developed two subtasks: (i) to automatically locate text passages that contain GO-relevant information (a text retrieval task) and (ii) to automatically identify relevant GO terms for the genes in a given article (a concept-recognition task). With the support from five MODs, we provided teams with >4000 unique text passages that served as the basis for each GO annotation in our task data. Such evidence text information has long been recognized as critical for text-mining algorithm development but was never made available because of the high cost of curation. In total, seven teams participated in the challenge task. From the team results, we conclude that the state of the art in automatically mining GO terms from literature has improved over the past decade while much progress is still needed for computer-assisted GO curation. Future work should focus on addressing remaining technical challenges for improved performance of automatic GO concept recognition and incorporating practical benefits of text-mining tools into real-world GO annotation.

Original language	English (US)
Journal	Database
Volume	2014
DOIs	https://doi.org/10.1093/database/bau086
State	Published - 2014

ASJC Scopus subject areas

Information Systems
General Biochemistry, Genetics and Molecular Biology
General Agricultural and Biological Sciences

Access to Document

10.1093/database/bau086

Cite this

Mao, Y., Van Auken, K., Li, D., Arighi, C. N., McQuilton, P., Hayman, G. T., Tweedie, S., Schaeffer, M. L., Laulederkind, S. J. F., Wang, S. J., Gobeill, J., Ruch, P., Luu, A. T., Kim, J. J., Chiang, J. H., Chen, Y. D., Yang, C. J., Liu, H., Zhu, D., ... Lu, Z. (2014). Overview of the gene ontology task at BioCreative IV. Database, 2014. https://doi.org/10.1093/database/bau086

Mao, Y, Van Auken, K, Li, D, Arighi, CN, McQuilton, P, Hayman, GT, Tweedie, S, Schaeffer, ML, Laulederkind, SJF, Wang, SJ, Gobeill, J, Ruch, P, Luu, AT, Kim, JJ, Chiang, JH, Chen, YD, Yang, CJ, Liu, H, Zhu, D, Li, Y, Yu, H, Emadzadeh, E, Gonzalez, G, Chen, JM, Dai, HJ & Lu, Z 2014, 'Overview of the gene ontology task at BioCreative IV', Database, vol. 2014. https://doi.org/10.1093/database/bau086

@article{2e1a9f3f9ae84dbc8c5552aa3039bc6a,

title = "Overview of the gene ontology task at BioCreative IV",

abstract = "Gene Ontology (GO) annotation is a common task among model organism databases (MODs) for capturing gene function data from journal articles. It is a time-consuming and labor-intensive task, and is thus often considered as one of the bottlenecks in literature curation. There is a growing need for semiautomated or fully automated GO curation techniques that will help database curators to rapidly and accurately identify gene function information in full-length articles. Despite multiple attempts in the past, few studies have proven to be useful with regard to assisting real-world GO curation. The shortage of sentence-level training data and opportunities for interaction between text-mining developers and GO curators has limited the advances in algorithm development and corresponding use in practical circumstances. To this end, we organized a text-mining challenge task for literature-based GO annotation in BioCreative IV. More specifically, we developed two subtasks: (i) to automatically locate text passages that contain GO-relevant information (a text retrieval task) and (ii) to automatically identify relevant GO terms for the genes in a given article (a concept-recognition task). With the support from five MODs, we provided teams with >4000 unique text passages that served as the basis for each GO annotation in our task data. Such evidence text information has long been recognized as critical for text-mining algorithm development but was never made available because of the high cost of curation. In total, seven teams participated in the challenge task. From the team results, we conclude that the state of the art in automatically mining GO terms from literature has improved over the past decade while much progress is still needed for computer-assisted GO curation. Future work should focus on addressing remaining technical challenges for improved performance of automatic GO concept recognition and incorporating practical benefits of text-mining tools into real-world GO annotation.",

author = "Yuqing Mao and {Van Auken}, Kimberly and Donghui Li and Arighi, {Cecilia N.} and Peter McQuilton and Hayman, {G. Thomas} and Susan Tweedie and Schaeffer, {Mary L.} and Laulederkind, {Stanley J.F.} and Wang, {Shur Jen} and Julien Gobeill and Patrick Ruch and Luu, {Anh Tuan} and Kim, {Jung Jae} and Chiang, {Jung Hsien} and Chen, {Yu De} and Yang, {Chia Jung} and Hongfang Liu and Dongqing Zhu and Yanpeng Li and Hong Yu and Ehsan Emadzadeh and Graciela Gonzalez and Chen, {Jian Ming} and Dai, {Hong Jie} and Zhiyong Lu",

note = "Funding Information: This research is supported by NIH Intramural Research Program, National Library of Medicine (Y.M. and Z.L.). The BioCreative IV Workshop is funded by NSF/DBI-0850319. WormBase is funded by National Human Genome Research Institute [U41-HG002223] and the Gene Ontology Consortium by National Human Genome Research Institute (NHGRI) [U41-HG002273]. FlyBase is funded by an NHGRI/NIH grant [U41-HG000739] and the UK Medical Research Council [G1000968]. Team 238 is funded by NSF/ABI-0845523 (H.L. and D.Z.), NIH R01LM009959A1 (H.L. and D.Z.). The SIBtex (Swiss Institute of Bioinformatics) team has been partially supported by the SNF (neXtpresso #153437) and the European Union (Khresmoi #257528).",

year = "2014",

doi = "10.1093/database/bau086",

language = "English (US)",

volume = "2014",

journal = "Database",

issn = "1758-0463",

publisher = "Oxford University Press",

}

TY - JOUR

T1 - Overview of the gene ontology task at BioCreative IV

AU - Mao, Yuqing

AU - Van Auken, Kimberly

AU - Li, Donghui

AU - Arighi, Cecilia N.

AU - McQuilton, Peter

AU - Hayman, G. Thomas

AU - Tweedie, Susan

AU - Schaeffer, Mary L.

AU - Laulederkind, Stanley J.F.

AU - Wang, Shur Jen

AU - Gobeill, Julien

AU - Ruch, Patrick

AU - Luu, Anh Tuan

AU - Kim, Jung Jae

AU - Chiang, Jung Hsien

AU - Chen, Yu De

AU - Yang, Chia Jung

AU - Liu, Hongfang

AU - Zhu, Dongqing

AU - Li, Yanpeng

AU - Yu, Hong

AU - Emadzadeh, Ehsan

AU - Gonzalez, Graciela

AU - Chen, Jian Ming

AU - Dai, Hong Jie

AU - Lu, Zhiyong

N1 - Funding Information: This research is supported by NIH Intramural Research Program, National Library of Medicine (Y.M. and Z.L.). The BioCreative IV Workshop is funded by NSF/DBI-0850319. WormBase is funded by National Human Genome Research Institute [U41-HG002223] and the Gene Ontology Consortium by National Human Genome Research Institute (NHGRI) [U41-HG002273]. FlyBase is funded by an NHGRI/NIH grant [U41-HG000739] and the UK Medical Research Council [G1000968]. Team 238 is funded by NSF/ABI-0845523 (H.L. and D.Z.), NIH R01LM009959A1 (H.L. and D.Z.). The SIBtex (Swiss Institute of Bioinformatics) team has been partially supported by the SNF (neXtpresso #153437) and the European Union (Khresmoi #257528).

PY - 2014

Y1 - 2014

N2 - Gene Ontology (GO) annotation is a common task among model organism databases (MODs) for capturing gene function data from journal articles. It is a time-consuming and labor-intensive task, and is thus often considered as one of the bottlenecks in literature curation. There is a growing need for semiautomated or fully automated GO curation techniques that will help database curators to rapidly and accurately identify gene function information in full-length articles. Despite multiple attempts in the past, few studies have proven to be useful with regard to assisting real-world GO curation. The shortage of sentence-level training data and opportunities for interaction between text-mining developers and GO curators has limited the advances in algorithm development and corresponding use in practical circumstances. To this end, we organized a text-mining challenge task for literature-based GO annotation in BioCreative IV. More specifically, we developed two subtasks: (i) to automatically locate text passages that contain GO-relevant information (a text retrieval task) and (ii) to automatically identify relevant GO terms for the genes in a given article (a concept-recognition task). With the support from five MODs, we provided teams with >4000 unique text passages that served as the basis for each GO annotation in our task data. Such evidence text information has long been recognized as critical for text-mining algorithm development but was never made available because of the high cost of curation. In total, seven teams participated in the challenge task. From the team results, we conclude that the state of the art in automatically mining GO terms from literature has improved over the past decade while much progress is still needed for computer-assisted GO curation. Future work should focus on addressing remaining technical challenges for improved performance of automatic GO concept recognition and incorporating practical benefits of text-mining tools into real-world GO annotation.

AB - Gene Ontology (GO) annotation is a common task among model organism databases (MODs) for capturing gene function data from journal articles. It is a time-consuming and labor-intensive task, and is thus often considered as one of the bottlenecks in literature curation. There is a growing need for semiautomated or fully automated GO curation techniques that will help database curators to rapidly and accurately identify gene function information in full-length articles. Despite multiple attempts in the past, few studies have proven to be useful with regard to assisting real-world GO curation. The shortage of sentence-level training data and opportunities for interaction between text-mining developers and GO curators has limited the advances in algorithm development and corresponding use in practical circumstances. To this end, we organized a text-mining challenge task for literature-based GO annotation in BioCreative IV. More specifically, we developed two subtasks: (i) to automatically locate text passages that contain GO-relevant information (a text retrieval task) and (ii) to automatically identify relevant GO terms for the genes in a given article (a concept-recognition task). With the support from five MODs, we provided teams with >4000 unique text passages that served as the basis for each GO annotation in our task data. Such evidence text information has long been recognized as critical for text-mining algorithm development but was never made available because of the high cost of curation. In total, seven teams participated in the challenge task. From the team results, we conclude that the state of the art in automatically mining GO terms from literature has improved over the past decade while much progress is still needed for computer-assisted GO curation. Future work should focus on addressing remaining technical challenges for improved performance of automatic GO concept recognition and incorporating practical benefits of text-mining tools into real-world GO annotation.

UR - http://www.scopus.com/inward/record.url?scp=84996554576&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84996554576&partnerID=8YFLogxK

U2 - 10.1093/database/bau086

DO - 10.1093/database/bau086

M3 - Review article

C2 - 25157073

AN - SCOPUS:84996554576

SN - 1758-0463

VL - 2014

JO - Database

JF - Database

ER -

Overview of the gene ontology task at BioCreative IV

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this