SeqHBase: A big data toolset for family-based sequencing data analysis.

 

Introduction

This is a collaboration project with Dr. Max He at Marshfield Clinic.

Whole-genome sequencing (WGS) and whole-exome sequencing (WES) technologies are widely used to identify disease mutations in human genetics studies. It can be a significant challenge to process such data, especially when a large family or cohort is sequenced. Therefore, tools to efficiently manipulate genome-wide variants, functional annotations, and coverage are needed. Hadoop is a framework for reliable, scalable, distributed processing of large data sets using MapReduce programming models. Based on Hadoop, we developed SeqHBase, a big data-based toolset for analyzing family-based sequencing data to detect de novo, inherited homozygous or compound heterozygous mutations that may be disease causal. SeqHBase takes as input BAM files (for coverage at every site), VCF files (for variant calls) and functional annotations by ANNOVAR (for variant prioritization). We demonstrate SeqHBase’s efficiency and scalability with analyses of WGS data on a 5-member nuclear family and a 10-member three-generation family, as well as WES data on a 4-member nuclear family. SeqHBase is almost linearly scalable with the number of data nodes; with 20 nodes, SeqHBase took ~1 minute for analyzing WGS family data and ~5 seconds for WES family data, illustrating the ability to scale up to a high volume of sequencing data.

Features

SeqHBase uses the following information from sequencing data, then incorporate family structure data to infer disease causal variants from families.

Data Source

Data Type

Extracted Information

Annotated variant files

Annotation

Chromosome, start position, end position, reference allele, alternative allele, allele frequency in the 1000 Genome Project and the NHLBI-ESP6500 project, ClinVar, biological function (such as SIFT, PolyPhen and CADD score), and many others

VCF Files

Variation

Sample family ID, individual ID, called variant genotypes, read depths, and Phred quality scores

BAM Files

Coverage (read-depth)

Coverage of each site of every sequencing sample (~3 billion sites in a WGS)

 

Workflow

Availality

SeqHBase is available at http://seqhbase.omicspace.org/.

Reference

He et al. SeqHBase: a big data toolset for family-based sequencing data analysis. Submitted