We leverage big data framework for detection of structural variants from NGS data.
In addition to single nucleotide variants (SNVs) and small insertions or deletions (indels), whole-genome sequencing (WGS) data may also be used to identify large-scale alterations, such as copy number variations (CNVs). Existing CNV detection methods mostly rely on read depth or paired end distance or the combination thereof. Additionally, resolving small regions in WGS samples with deep coverage can be very time consuming due to massive I/O cost.
To facilitate the CNV detection from WGS data, we developed a hidden Markov model based algorithm called HadoopCNV, which infers aberration events such as copy number changes and loss of heterozygosity through information encoded in both allelic and overall read depth. Our implementation is built on the Hadoop MapReduce paradigm, enabling multiple processors at multiple hosts to efficiently process separate genomic regions in tandem. We employ a Viterbi scoring algorithm to infer the most likely copy number state for each region of the genome. We applied HadoopCNV to a 10 member pedigree sequenced by Illumina HiSeq sequencer. Our method has a Mendelian inconsistency rate that is overall lower than other competing approaches. Our method also has comparable performance on the NA12878 individual from the 1000 Genomes Project and on simulated data sets. Most importantly, our method displayed linear scalability in Hadoop clusters, where the speed of analysis linearly scales with the number of virtual machines, requiring 1.6 hours in a cluster with 30 virtual machines for a human genome with 50X coverage.
The combination of high-resolution allele-specific read depth from WGS data and Hadoop framework can result in efficient and accurate detection of whole-genome CNVs. Our study also highlights that Big Data framework such as Hadoop can be a future direction for genome analysis.