We develop fully automated pipelines for genome and exome sequencing analysis: all the way from raw sequencing data to biological insights.


A new generation of sequencing platforms that perform high-throughput sequencing (HTS) has now made it possible for individual laboratories to generate enormous amounts of DNA sequence data. However, there is a growing gap between the generation of massively parallel sequencing output and the ability to process, analyze and interpret the resulting data. In the lab, we develop a suite of bioinformatics approaches to facilitate better understanding of the functional content and clinical insights from genome sequencing data. Some examples include:

1. SeqMule

We developed a computational pipeline, SeqMule, and a number of helper methods, to perform automated variant calling from NGS data. SeqMule integrates computational-cluster-free parallelization capability built on top of the variant callers, and facilitates normalization/intersection of variant calls to generate consensus set with high confidence. Currently, SeqMule integrates 5 alignment tools, 5 variant calling algorithms and accepts various combinations all by one-line command, therefore allowing highly flexible yet fully automated variant calling to address different user needs. In a modern machine (2 Intel Xeon X5650 CPUs and 48GB memory), when fast turn-around is needed, SeqMule generates annotated VCF files in a day from a 30X whole-genome sequencing data set; when more accurate calling is needed, SeqMule generates consensus call set that improves over single callers, as measured by both Mendelian error rate and consistency. In addition to integration of external tools, SeqMule allows automated quality check, Mendelian error check, consistency evaluation, efficient variant calling, variant normalization, integration of calls, HTML-based summary and visualization of results. For more details, see https://www.nature.com/articles/srep14283.

2. PennCNV-Seq

Although several tools for copy-number variation calling in whole genome sequencing have been published, the noisy nature of sequencing data is still a limitation for accuracy and concordance among such tools. To assess the performance of PennCNV original algorithm for array data in whole genome sequencing data, we processed mapping (BAM) files to extract coverage, representing log R ratio (LRR) of signal intensity, and B allele frequency (BAF). This method, called PennCNV-Seq, was able to find correct CNVs and can be integrated in existing CNV calling pipelines to report accurately the number of copies in specific genomic regions. For more details, see https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1802-x.

3. HadoopCNV

Compared to SNP arrays, whole-genome sequencing (WGS) allows the interrogation of genome at much finer resolution. In addition to small-scale SNPs and indels, WGS may also identify large scale alterations such as copy number variations and other types of structural variants. Existing methods mostly rely on read depth, or paired end distance or the combination thereof. To facilitate the interpretation of WGS data, we introduce new software that infers interesting aberration events such as copy number changes and loss of heterozygosity through information encoded in both allelic and overall read depth. In particular, resolving small regions in samples with deep coverage can be very time consuming due to massive I/O cost. Our implementation is built on the MapReduce paradigm on top of a Hadoop framework, enabling multiple processors to efficiently process separate regions in tandem. We employ a Viterbi scoring algorithm to infer the most likely copy number/heterozygosity state for each region of the genome. For more details, see https://www.biorxiv.org/content/early/2017/04/05/124339.


To fill in the growing gap between a large number of variants and biological interpretation of variants calls, we developed the ANNOVAR (ANNOtate VARiation) software. ANNOVAR is an efficient software tool to utilize update-to-date information to functionally annotate genetic variants detected from diverse genomes (including human, mouse, worm, fly and many others) with different versions of genome builds. Given a list of variants with chromosome, start position, end position, reference nucleotide and observed nucleotide, ANNOVAR can perform Gene-based annotation (identify whether SNPs or CNVs cause protein coding changes and the amino acids that are affected, or whether they are located in intergenic regions with the distance to the two neighboring genes), Region-based annotation (identify variants in specific genomic regions such as conserved regions among 44 species or ChIP-Seq peaks), Filter-based annotation (identify variants that are previously annotated in public databases or with particular functional deleteriousness scores). For more details, see https://academic.oup.com/nar/article/38/16/e164/1749458.

5. Phenolyzer

Prior biological knowledge and phenotype information may help pinpoint disease contributory genes in whole genome and exome sequencing studies on human diseases. We developed a computational tool called Phenolyzer, which follows a biologist's natural thought processes through four steps: term interpretation, seed gene generation, seed gene growth and data integration. Compared to competing approaches, Phenolyzer has superior performance on finding known disease genes, and on prioritizing recently published novel disease genes. For more details, see https://www.nature.com/articles/nmeth.3484.

6. SeqHBase

We developed a software framework called SeqHBase to help quickly identify disease genes from family-based sequencing studies.  SeqHBase is based on Apache Hadoop and HBase infrastructure, which works through distributed and parallel manner over multiple data nodes. Its input includes coverage information of 3 billion sites, over 3 million variant calls and their associated functional annotations for each person.  With 20 data nodes, SeqHBase took about 5 seconds for analyzing whole-exome sequencing data for a family quartet and 1 minute for analyzing whole-genome sequencing data for a 10-member family. For more details, see https://jmg.bmj.com/content/52/4/282.long.

7. SparkText

Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment. We designed and developed an efficient text mining framework called SparkText on a Big Data infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers) from tens of thousands of articles downloaded from PubMed, and then employed Naïve Bayes, Support Vector Machine (SVM), and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes. This study demonstrates the potential for mining large-scale scientific articles on a Big Data infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research. For more details, see https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0162721.

8. EHR-Phenolyzer

Clinicians enter a patient history, including a description of the patient’s physical traits—their phenotype—into the electronic health record (EHR), using natural language. The EHR-Phenolyzer extracts phenotypic information from the EHR and translates it into standardized terms. The EHR-Phenolyzer then searches for genes known to be associated with that phenotype and generates a list of under 1000 genes, one of which has a high probability of underlying the patient’s condition. Use of phenotypic information improves the quality of genetic data found. Using natural language processing to probe EHRs automates the process and builds further efficiency and speed to genetic diagnoses. For more details, see https://www.cell.com/ajhg/fulltext/S0002-9297(18)30171-X.

The combination of these software tools enabled fast, accurate and reliable whole-genome/exome sequencing data analysis on personal genomes.