As the pace of biological research increases, computers are being used to manage the explosive amount of biological information. Much of information relevant to biological research is recorded either as coded data in biological databases or as free text in journal articles and in annotation fields of biological databases. Natural language processing tools have shown to have the potential to decrease the difficulty of managing information in biomedical free text.
This project aims to use online resources (e.g., genetic databases, free-text corpora or machine readable dictionaries) and machine learning techniques for the construction of a biological entity tagging system that associates terms mentioned in text with entries in databases. Biological entity tagging is extremely challenging because of novelty, synonymy and ambiguity associated with terms representing biological entities in text. The project includes the construction of a biological entity dictionary and the acquisition of disambiguation knowledge using online resources. It also includes the development of dictionary lookup method and the employment of machine learning techniques for resolving ambiguity, discovering novelty, and recognizing synonymy. The research will generate several deliverables and the enriched information on gene/protein names, bibliography, and other annotation fields will be integrated into UniProt/PIR databases, which is an ongoing international effort on protein databases.
The project provides an opportunity of furthering the collaborations among Columbia University, Georgetown University Medical Center and University of Maryland at Baltimore County. The project also integrates educational and research activities by having graduate and undergraduate students involved in the overall project.
|Effective start/end date||9/1/04 → 8/31/06|
- National Science Foundation: $823,109.00