We develop fully automated pipelines for genome and exome sequencing analysis: all the way from FASTQ to biological insights.


A new generation of sequencing platforms that perform high-throughput sequencing (HTS) has now made it possible for individual laboratories to generate enormous amounts of DNA sequence data. However, there is a growing gap between the generation of massively parallel sequencing output and the ability to process, analyze and interpret the resulting data. New users to the sequencing era are left to navigate a bewildering maze of base calling, alignment, assembly and analysis tools. Many software tools developed for sequencing data are not sufficiently robust and can only work on one type of data generated from one type of sequencing experiment, limiting critical biological insights from sequencing experiments. Additionally, many of the academic software tools are not well maintained or documented, perhaps due to the lack of motivation after publication of the methodology or software tools. Bridging this gap is essential, or the coveted $1,000 genome will come with a $20,000 or $100,000 analysis price tag. The bioinformatics challenge in fact prevent many biologists in embracing the new sequencing technology in their own research, being fear of having too much data to handle, a phenomenon that we refer to as the “genomic deluge”.

In the lab, we develop a suite of bioinformatics approaches to facilitate better understanding of the functional content and clinical insights from genome sequencing data. These tools include:

1. SeqMule

We developed a computational pipeline, SeqMule, and a number of helper methods, to perform automated variant calling from NGS data. SeqMule integrates computational-cluster-free parallelization capability built on top of the variant callers, and facilitates normalization/intersection of variant calls to generate consensus set with high confidence. Currently, SeqMule integrates 5 alignment tools, 5 variant calling algorithms and accepts various combinations all by one-line command, therefore allowing highly flexible yet fully automated variant calling to address different user needs. In a modern machine (2 Intel Xeon X5650 CPUs and 48GB memory), when fast turn-around is needed, SeqMule generates annotated VCF files in a day from a 30X whole-genome sequencing data set; when more accurate calling is needed, SeqMule generates consensus call set that improves over single callers, as measured by both Mendelian error rate and consistency. In addition to integration of external tools, SeqMule allows automated quality check, Mendelian error check, consistency evaluation, efficient variant calling, variant normalization, integration of calls, HTML-based summary and visualization of results.

2. HadoopCNV

Compared to SNP arrays, whole-genome sequencing (WGS) allows the interrogation of genome at much finer resolution. In addition to small-scale SNPs and indels, WGS may also identify large scale alterations such as copy number variations and other types of structural variants. Existing methods mostly rely on read depth, or paired end distance or the combination thereof. To facilitate the interpretation of WGS data, we introduce new software that infers interesting aberration events such as copy number changes and loss of heterozygosity through information encoded in both allelic and overall read depth. In particular, resolving small regions in samples with deep coverage can be very time consuming due to massive I/O cost. Our implementation is built on the MapReduce paradigm on top of a Hadoop framework, enabling multiple processors to efficiently process separate regions in tandem. We employ a Viterbi scoring algorithm to infer the most likely copy number/heterozygosity state for each region of the genome.


To fill in the growing gap between a large number of variants and biological interpretation of variants calls, we developed the ANNOVAR (ANNOtate VARiation) software. ANNOVAR is an efficient software tool to utilize update-to-date information to functionally annotate genetic variants detected from diverse genomes (including human, mouse, worm, fly and many others) with different versions of genome builds. Given a list of variants with chromosome, start position, end position, reference nucleotide and observed nucleotide, ANNOVAR can perform Gene-based annotation (identify whether SNPs or CNVs cause protein coding changes and the amino acids that are affected, or whether they are located in intergenic regions with the distance to the two neighboring genes), Region-based annotation (identify variants in specific genomic regions such as conserved regions among 44 species or ChIP-Seq peaks), Filter-based annotation (identify variants that are previously annotated in public databases or with particular functional deleteriousness scores).

4. Phenolyzer

Prior biological knowledge and phenotype information may help pinpoint disease contributory genes in whole genome and exome sequencing studies on human diseases. We developed a computational tool called Phenolyzer, which follows a biologist's natural thought processes through four steps: term interpretation, seed gene generation, seed gene growth and data integration. Compared to competing approaches, Phenolyzer has superior performance on finding known disease genes, and on prioritizing recently published novel disease genes.

5. SeqHBase

We developed a software framework called SeqHBase to help quickly identify disease genes from family-based sequencing studies.  SeqHBase is based on Apache Hadoop and HBase infrastructure, which works through distributed and parallel manner over multiple data nodes. Its input includes coverage information of 3 billion sites, over 3 million variant calls and their associated functional annotations for each person.  With 20 data nodes, SeqHBase took about 5 seconds for analyzing whole-exome sequencing data for a family quartet and 1 minute for analyzing whole-genome sequencing data for a 10-member family.

The combination of these software tools enabled fast, accurate and reliable whole-genome/exome sequencing data analysis on personal genomes.