TY - JOUR
T1 - Cobind
T2 - quantitative analysis of the genomic overlaps
AU - Ma, Tao
AU - Guo, Lingyun
AU - Yan, Huihuang
AU - Wang, Liguo
N1 - Publisher Copyright:
© 2023 The Author(s). Published by Oxford University Press.
PY - 2023
Y1 - 2023
N2 - Motivation: Analyzing the overlap between two sets of genomic intervals is a frequent task in the field of bioinformatics. Typically, this is accomplished by counting the number (or proportion) of overlapped regions, which applies an arbitrary threshold to determine if two genomic intervals are overlapped. By making binary calls but disregarding the magnitude of the overlap, such an approach often leads to biased, non-reproducible, and incomparable results. Results: We developed the cobind package, which incorporates six statistical measures: the Jaccard coefficient, Sørensen-Dice coefficient, Szymkiewicz-Simpson coefficient, collocation coefficient, pointwise mutual information (PMI), and normalized PMI. These measures allow for a quantitative assessment of the collocation strength between two sets of genomic intervals. To demonstrate the effectiveness of these methods, we applied them to analyze CTCF's binding sites identified from ChIP-seq, cancer-specific open-chromatin regions (OCRs) identified from ATAC-seq of 17 cancer types, and oligodendrocytes-specific OCRs identified from scATAC-seq. Our results indicated that these new approaches effectively re-discover CTCF's cofactors, as well as cancer-specific and oligodendrocytes-specific master regulators implicated in disease and cell type development.
AB - Motivation: Analyzing the overlap between two sets of genomic intervals is a frequent task in the field of bioinformatics. Typically, this is accomplished by counting the number (or proportion) of overlapped regions, which applies an arbitrary threshold to determine if two genomic intervals are overlapped. By making binary calls but disregarding the magnitude of the overlap, such an approach often leads to biased, non-reproducible, and incomparable results. Results: We developed the cobind package, which incorporates six statistical measures: the Jaccard coefficient, Sørensen-Dice coefficient, Szymkiewicz-Simpson coefficient, collocation coefficient, pointwise mutual information (PMI), and normalized PMI. These measures allow for a quantitative assessment of the collocation strength between two sets of genomic intervals. To demonstrate the effectiveness of these methods, we applied them to analyze CTCF's binding sites identified from ChIP-seq, cancer-specific open-chromatin regions (OCRs) identified from ATAC-seq of 17 cancer types, and oligodendrocytes-specific OCRs identified from scATAC-seq. Our results indicated that these new approaches effectively re-discover CTCF's cofactors, as well as cancer-specific and oligodendrocytes-specific master regulators implicated in disease and cell type development.
UR - http://www.scopus.com/inward/record.url?scp=85169908026&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85169908026&partnerID=8YFLogxK
U2 - 10.1093/bioadv/vbad104
DO - 10.1093/bioadv/vbad104
M3 - Article
AN - SCOPUS:85169908026
SN - 2635-0041
VL - 3
JO - Bioinformatics Advances
JF - Bioinformatics Advances
IS - 1
M1 - vbad104
ER -