Scalable gene sequence analysis on spark

Muthahar Syed; Taehyun Hwang; Jinoh Kim

doi:10.1007/978-3-319-63917-8_6

Scalable gene sequence analysis on spark

Muthahar Syed, Taehyun Hwang, Jinoh Kim

Artificial Intelligence and Informatics

Research output: Chapter in Book/Report/Conference proceeding › Chapter

Abstract

Scientific advances in technology have helped in digitizing genetic information, which resulted in the generation of the humongous amount of genetic sequences, and analysis of such large-scale sequencing data is the primary concern. This chapter introduces a scalable genome sequence analysis system, which makes use of parallel computing features of Apache Spark and its relational processing module called Spark Structured Query Language (Spark SQL). The Spark framework provides an efficient data reuse feature by holding the data in memory, increasing performance substantially. The introduced system also provides a webbased interface, by which users can specify the search criteria, and Spark SQL performs search operations on the data stored in memory. Experiments detailed in this chapter make use of publicly available 1000 genome Variant Calling Format (VCF) data (Size 1.2TB) as input. The input data are analyzed using Spark and the end results are evaluated to measure the scalability and performance of the system.

Original language	English (US)
Title of host publication	Big Data and Visual Analytics
Publisher	Springer International Publishing
Pages	97-113
Number of pages	17
ISBN (Electronic)	9783319639178
ISBN (Print)	9783319639154
DOIs	https://doi.org/10.1007/978-3-319-63917-8_6
State	Published - Jan 15 2018

ASJC Scopus subject areas

General Computer Science
General Mathematics

Access to Document

10.1007/978-3-319-63917-8_6

Cite this

@inbook{c6ced4e650524285b87aae90a4a34cff,

title = "Scalable gene sequence analysis on spark",

abstract = "Scientific advances in technology have helped in digitizing genetic information, which resulted in the generation of the humongous amount of genetic sequences, and analysis of such large-scale sequencing data is the primary concern. This chapter introduces a scalable genome sequence analysis system, which makes use of parallel computing features of Apache Spark and its relational processing module called Spark Structured Query Language (Spark SQL). The Spark framework provides an efficient data reuse feature by holding the data in memory, increasing performance substantially. The introduced system also provides a webbased interface, by which users can specify the search criteria, and Spark SQL performs search operations on the data stored in memory. Experiments detailed in this chapter make use of publicly available 1000 genome Variant Calling Format (VCF) data (Size 1.2TB) as input. The input data are analyzed using Spark and the end results are evaluated to measure the scalability and performance of the system.",

author = "Muthahar Syed and Taehyun Hwang and Jinoh Kim",

note = "Publisher Copyright: {\textcopyright} Springer International Publishing AG 2017.",

year = "2018",

month = jan,

day = "15",

doi = "10.1007/978-3-319-63917-8_6",

language = "English (US)",

isbn = "9783319639154",

pages = "97--113",

booktitle = "Big Data and Visual Analytics",

publisher = "Springer International Publishing",

}

TY - CHAP

T1 - Scalable gene sequence analysis on spark

AU - Syed, Muthahar

AU - Hwang, Taehyun

AU - Kim, Jinoh

PY - 2018/1/15

Y1 - 2018/1/15

N2 - Scientific advances in technology have helped in digitizing genetic information, which resulted in the generation of the humongous amount of genetic sequences, and analysis of such large-scale sequencing data is the primary concern. This chapter introduces a scalable genome sequence analysis system, which makes use of parallel computing features of Apache Spark and its relational processing module called Spark Structured Query Language (Spark SQL). The Spark framework provides an efficient data reuse feature by holding the data in memory, increasing performance substantially. The introduced system also provides a webbased interface, by which users can specify the search criteria, and Spark SQL performs search operations on the data stored in memory. Experiments detailed in this chapter make use of publicly available 1000 genome Variant Calling Format (VCF) data (Size 1.2TB) as input. The input data are analyzed using Spark and the end results are evaluated to measure the scalability and performance of the system.

AB - Scientific advances in technology have helped in digitizing genetic information, which resulted in the generation of the humongous amount of genetic sequences, and analysis of such large-scale sequencing data is the primary concern. This chapter introduces a scalable genome sequence analysis system, which makes use of parallel computing features of Apache Spark and its relational processing module called Spark Structured Query Language (Spark SQL). The Spark framework provides an efficient data reuse feature by holding the data in memory, increasing performance substantially. The introduced system also provides a webbased interface, by which users can specify the search criteria, and Spark SQL performs search operations on the data stored in memory. Experiments detailed in this chapter make use of publicly available 1000 genome Variant Calling Format (VCF) data (Size 1.2TB) as input. The input data are analyzed using Spark and the end results are evaluated to measure the scalability and performance of the system.

UR - http://www.scopus.com/inward/record.url?scp=85054943329&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85054943329&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-63917-8_6

DO - 10.1007/978-3-319-63917-8_6

M3 - Chapter

AN - SCOPUS:85054943329

SN - 9783319639154

SP - 97

EP - 113

BT - Big Data and Visual Analytics

PB - Springer International Publishing

ER -

Scalable gene sequence analysis on spark

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this