We develop fully automated pipelines for genome and exome sequencing analysis: all the way from FASTQ to biological insights.


A new generation of sequencing platforms that perform high-throughput sequencing (HTS) has now made it possible for individual laboratories to generate enormous amounts of DNA sequence data. However, there is a growing gap between the generation of massively parallel sequencing output and the ability to process, analyze and interpret the resulting data. New users to the sequencing era are left to navigate a bewildering maze of base calling, alignment, assembly and analysis tools. Many software tools developed for sequencing data are not sufficiently robust and can only work on one type of data generated from one type of sequencing experiment, limiting critical biological insights from sequencing experiments. Additionally, many of the academic software tools are not well maintained or documented, perhaps due to the lack of motivation after publication of the methodology or software tools. Bridging this gap is essential, or the coveted $1,000 genome will come with a $20,000 or $100,000 analysis price tag. The bioinformatics challenge in fact prevent many biologists in embracing the new sequencing technology in their own research, being fear of having too much data to handle, a phenomenon that we refer to as the “genomic deluge”.

In the lab, we develop a suite of bioinformatics approaches to facilitate better understanding of the functional content and clinical insights from genome sequencing data. These tools include:

1. SeqMule

We developed a computational pipeline, SeqMule, and a number of helper methods, to perform automated variant calling from NGS data. SeqMule integrates computational-cluster-free parallelization capability built on top of the variant callers, and facilitates normalization/intersection of variant calls to generate consensus set with high confidence. Currently, SeqMule integrates 5 alignment tools, 5 variant calling algorithms and accepts various combinations all by one-line command, therefore allowing highly flexible yet fully automated variant calling to address different user needs. In a modern machine (2 Intel Xeon X5650 CPUs and 48GB memory), when fast turn-around is needed, SeqMule generates annotated VCF files in a day from a 30X whole-genome sequencing data set; when more accurate calling is needed, SeqMule generates consensus call set that improves over single callers, as measured by both Mendelian error rate and consistency. In addition to integration of external tools, SeqMule allows automated quality check, Mendelian error check, consistency evaluation, efficient variant calling, variant normalization, integration of calls, HTML-based summary and visualization of results.

2. HadoopCNV

Compared to SNP arrays, whole-genome sequencing (WGS) allows the interrogation of genome at much finer resolution. In addition to small-scale SNPs and indels, WGS may also identify large scale alterations such as copy number variations and other types of structural variants. Existing methods mostly rely on read depth, or paired end distance or the combination thereof. To facilitate the interpretation of WGS data, we introduce new software that infers interesting aberration events such as copy number changes and loss of heterozygosity through information encoded in both allelic and overall read depth. In particular, resolving small regions in samples with deep coverage can be very time consuming due to massive I/O cost. Our implementation is built on the MapReduce paradigm on top of a Hadoop framework, enabling multiple processors to efficiently process separate regions in tandem. We employ a Viterbi scoring algorithm to infer the most likely copy number/heterozygosity state for each region of the genome.

3. PennCNV-Seq

BACKGROUND: The use of high-throughput sequencing data has improved the results of genomic analysis due to the resolution of mapping algorithms. Although several tools for copy-number variation calling in whole genome sequencing have been published, the noisy nature of sequencing data is still a limitation for accuracy and concordance among such tools. To assess the performance of PennCNV original algorithm for array data in whole genome sequencing data, we processed mapping (BAM) files to extract coverage, representing log R ratio (LRR) of signal intensity, and B allele frequency (BAF). RESULTS: We used high quality sample NA12878 from the recently reported NIST database and created 10 artificial samples with several CNVs spread along all chromosomes. We compared PennCNV-Seq with other tools with general deletions and duplications, as well as for different number of copies and copy-neutral loss-of-heterozygosity (LOH). CONCLUSION: PennCNV-Seq was able to find correct CNVs and can be integrated in existing CNV calling pipelines to report accurately the number of copies in specific genomic regions.


To fill in the growing gap between a large number of variants and biological interpretation of variants calls, we developed the ANNOVAR (ANNOtate VARiation) software. ANNOVAR is an efficient software tool to utilize update-to-date information to functionally annotate genetic variants detected from diverse genomes (including human, mouse, worm, fly and many others) with different versions of genome builds. Given a list of variants with chromosome, start position, end position, reference nucleotide and observed nucleotide, ANNOVAR can perform Gene-based annotation (identify whether SNPs or CNVs cause protein coding changes and the amino acids that are affected, or whether they are located in intergenic regions with the distance to the two neighboring genes), Region-based annotation (identify variants in specific genomic regions such as conserved regions among 44 species or ChIP-Seq peaks), Filter-based annotation (identify variants that are previously annotated in public databases or with particular functional deleteriousness scores).

5. Phenolyzer

Prior biological knowledge and phenotype information may help pinpoint disease contributory genes in whole genome and exome sequencing studies on human diseases. We developed a computational tool called Phenolyzer, which follows a biologist's natural thought processes through four steps: term interpretation, seed gene generation, seed gene growth and data integration. Compared to competing approaches, Phenolyzer has superior performance on finding known disease genes, and on prioritizing recently published novel disease genes.

6. SeqHBase

We developed a software framework called SeqHBase to help quickly identify disease genes from family-based sequencing studies.  SeqHBase is based on Apache Hadoop and HBase infrastructure, which works through distributed and parallel manner over multiple data nodes. Its input includes coverage information of 3 billion sites, over 3 million variant calls and their associated functional annotations for each person.  With 20 data nodes, SeqHBase took about 5 seconds for analyzing whole-exome sequencing data for a family quartet and 1 minute for analyzing whole-genome sequencing data for a 10-member family.

7. InterVar

In 2015, the American College of Medical Genetics and Genomics (ACMG) and the Association for Molecular Pathology (AMP) published updated standards and guidelines for the clinical interpretation of sequence variants with respect to human diseases on the basis of 28 criteria. However, variability between individual interpreters can be extensive because of reasons such as the different understandings of these guidelines and the lack of standard algorithms for implementing them, yet computational tools for semi-automated variant interpretation are not available. To address these problems, we propose a suite of methods for implementing these criteria and have developed a tool called InterVar to help human reviewers interpret the clinical significance of variants. InterVar can take a pre-annotated or VCF file as input and generate automated interpretation on 18 criteria. Furthermore, we have developed a companion web server, wInterVar, to enable user-friendly variant interpretation with an automated interpretation step and a manual adjustment step. These tools are especially useful for addressing severe congenital or very early-onset developmental disorders with high penetrance. Using results from a few published sequencing studies, we demonstrate the utility of InterVar in significantly reducing the time to interpret the clinical significance of sequence variants.

8. EHR-Phenolyzer

Integration of detailed phenotypic information with genetic data is well established to facilitate accurate diagnosis of hereditary disorders. As a rich source of phenotypic information, electronic health records (EHRs) could empower diagnostic variant interpretation. However, how to accurately and efficiently extract phenotypes from heterogeneous EHR narratives remains a challenge. Here we present EHR-Phenolyzer, a high-throughput EHR phenotype extraction and analysis framework. EHR-Phenolyzer performs Human Phenotype Ontology (HPO) concept extraction and normalization from EHR narratives, then prioritizes disease genes based on the HPO-coded phenotypic manifestations. In summary, EHR-Phenolyzer leverages EHR data to automate phenotype-driven analysis of clinical exomes or genomes, facilitating the implementation of genomic medicine on scale.

9. Integration of genomic data with EHRs

We are in the process to outline strategies to integrate genomic information within EHRs, which facilitate precision treatments for patients affected with genetic diseases. We are also developing automated systems to integrate EHR-Phenolyzer into EHRs, that allows physicians to specify patient phenotypes in requisition form when ordering a genetic testing. These clinical informatics tools will greatly faciliate the incorporation of genomic information in healthcare, ultimately enabling the implementation of genomic medicine.

10. EHR-based drug discovery and repurposing

We are developing a generalized platform for computational drug repositioning using longitudinal information stored in EHRs, which allows the discovery of approved drugs that have unintended beneficial roles in a different disease.


The combination of these software tools enabled fast, accurate and reliable whole-genome/exome sequencing data analysis on personal genomes.