TY - GEN
T1 - Identification of Genetic Causality Statements in Medline Abstracts Leveraging Distant Supervision
AU - Wang, Liwei
AU - Rastegar-Mojarad, Majid
AU - Komandur Elayavilli, Ravikumar Komandur Elayavilli
AU - Wang, Yanshan
AU - Liu, Hongfang
N1 - Funding Information:
This work is supported by NIH 1R01LM011934-01A1 and R01EB019403.
Publisher Copyright:
© 2018 IEEE.
PY - 2018/7/16
Y1 - 2018/7/16
N2 - In the era of precision medicine, the clinical utility of next generation sequencing technology highly depends on the ability of interpreting the causality association of genetic variants and phenotyping which can be a labor intensive process. There are various resources available for cataloging such associations such as HGMD or ClinVar. Given the exponential growth in literature in the field, it is desired to accelerate the process by automatically identifying genetic causality statements from literature. Here, we define the task of identifying the statements as a classification task for sentences containing gene and disease entities. We used the cancer gene census available at the Catalogue of Somatic Mutations in Cancer (COSMIC) and to generate a weakly labeled data set for our classification task. We evaluated multiple feature sets such as: words, bi-grams, word embedding, and several machine-learning methods and showed the weighted F-measure around 95%. Evaluation using the top 50 genetic variant disease sentences demonstrated that the proposed method can identify genetic causality statements.
AB - In the era of precision medicine, the clinical utility of next generation sequencing technology highly depends on the ability of interpreting the causality association of genetic variants and phenotyping which can be a labor intensive process. There are various resources available for cataloging such associations such as HGMD or ClinVar. Given the exponential growth in literature in the field, it is desired to accelerate the process by automatically identifying genetic causality statements from literature. Here, we define the task of identifying the statements as a classification task for sentences containing gene and disease entities. We used the cancer gene census available at the Catalogue of Somatic Mutations in Cancer (COSMIC) and to generate a weakly labeled data set for our classification task. We evaluated multiple feature sets such as: words, bi-grams, word embedding, and several machine-learning methods and showed the weighted F-measure around 95%. Evaluation using the top 50 genetic variant disease sentences demonstrated that the proposed method can identify genetic causality statements.
KW - ClinVar
KW - MutD
KW - Semantic Medline
KW - cancer
KW - causality
KW - classification
KW - disease
KW - distance supervision
KW - genetic variant
UR - http://www.scopus.com/inward/record.url?scp=85051030840&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85051030840&partnerID=8YFLogxK
U2 - 10.1109/ICHI-W.2018.00008
DO - 10.1109/ICHI-W.2018.00008
M3 - Conference contribution
AN - SCOPUS:85051030840
T3 - Proceedings - 2018 IEEE International Conference on Healthcare Informatics Workshops, ICHI-W 2018
SP - 1
EP - 8
BT - Proceedings - 2018 IEEE International Conference on Healthcare Informatics Workshops, ICHI-W 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 6th IEEE International Conference on Healthcare Informatics Workshops, ICHI-W 2018
Y2 - 4 June 2018 through 7 June 2018
ER -