Scalable gene sequence analysis on spark

Muthahar Syed, Taehyun Hwang, Jinoh Kim

Research output: Chapter in Book/Report/Conference proceedingChapter


Scientific advances in technology have helped in digitizing genetic information, which resulted in the generation of the humongous amount of genetic sequences, and analysis of such large-scale sequencing data is the primary concern. This chapter introduces a scalable genome sequence analysis system, which makes use of parallel computing features of Apache Spark and its relational processing module called Spark Structured Query Language (Spark SQL). The Spark framework provides an efficient data reuse feature by holding the data in memory, increasing performance substantially. The introduced system also provides a webbased interface, by which users can specify the search criteria, and Spark SQL performs search operations on the data stored in memory. Experiments detailed in this chapter make use of publicly available 1000 genome Variant Calling Format (VCF) data (Size 1.2TB) as input. The input data are analyzed using Spark and the end results are evaluated to measure the scalability and performance of the system.

Original languageEnglish (US)
Title of host publicationBig Data and Visual Analytics
PublisherSpringer International Publishing
Number of pages17
ISBN (Electronic)9783319639178
ISBN (Print)9783319639154
StatePublished - Jan 15 2018

ASJC Scopus subject areas

  • Computer Science(all)
  • Mathematics(all)


Dive into the research topics of 'Scalable gene sequence analysis on spark'. Together they form a unique fingerprint.

Cite this