Graduate Student in Biostatistics, University of Southern California, 2012

B.S. in Biological Sciences, Nanjing University, China, 2011

Current Position

Software engineer at Roche Diagnostics


Research Area

SeqMule: automated pipeline for variant detection

Enlight: Integrating GWAS results with epigenetic and functional annotation

Hierarchical modeling for selecting variants in GWAS


Research Summary


SeqMule performs a series of automated steps for identifying variants from next-generation sequencing (NGS) data. It requires minimal human involvement while sophisticated fine-tuning is still possible. It integrates 4 popular alignment tools, 5 popular variant calling algorithms and allows various combinations of them. The intersection of sets of variants generated by different programs can be extracted to achieve higher accuracy. Multi-sample variant calling is possible when input files come from a single family. Analysis on a family trio with an autosomal recessive Mendelian disease demonstrated the effectiveness of the approach to identify causal variants. The following figure shows its workflow. SeqMule is available at

 seqmule workflow fq


Identifying causal variants remains a key challenge in post-GWAS (Genome Wide Association Study) era, since many GWAS SNPs fall into non-coding regions, making it especially difficult to associate statistical significance with predicted functionality. Therefore, we created a web-based tool, Enlight, which overlays functional annotation information, such as histone modification states, methylation patterns, transcription factor binding peaks, to GWAS results. The following diagram shows its workflow. Enlight is available at

server architecture

Hierarchical Modeling

Association tests on multiple levels of regions, such as genes and biological pathways provide an alternative to conventional genome-wide association studies (GWAS). Compared with single SNP-based methods, this approach may have increased power, can facilitate direct comparison across different populations or genotyping platforms, and can ease interpretation. Quintana and Conti (Genetic Epidemiology, 2011) proposed to use a combination of Bayesian model uncertainty and a hierarchical model to construct a Bayesian risk index as a measurement for strength of regional association. In their method, various sources of biological annotations can be incorporated into the model to help select variants and Metropolis-Hastings (MH) and Gibbs Sampling techniques are used for inference. There are, however, two major hurdles for this approach. First, computational cost associated with MH method grows cubically with the number of variants, limiting its scalability with next-generation sequencing data. Second, the approach is limited to a few sources of biological information and it is unclear how to select annotations from thousands of options available.


Here we alter the algorithm to gain efficiency in computation. At the beginning of the MH procedure, reversible jump Markov Chain Monte Carlo is used to select a subset of variants, which constitute a more plausible model, as a starting point. In each iteration of the stochastic search, a proposal kernel is more likely to add a variant that is close to elevated signal of activity or in high linkage disequilibrium with a nearby GWAS hit. To select prior information, we applied the statistical model (fgwas) by Pickrell (American Journal of Human Genetics, 2014) to pick a few annotations that most differentiate associated variants from non-associated variants. Using this approach, we incorporated multiple DNase, chromHMM, ChIP annotation data tracks from UCSC genome browser. Our method has several conceptual advantages than existing approaches and can be easily scaled up from GWAS to sequencing studies on large population cohorts. Through simulation and real data analysis, we demonstrate the applicability of our method and show performance improvements compared to competing methods.



  1. Shi L*, Guo Y*, Dong C, Huddleston J, Yang H, Han X, Fu A, Li Q, Li N, Gong S, Lintner KE, Ding Q, Wang Z, Hu J, Wang D, Wang F, Wang L, Lyon GJ, Guan Y, Shen Y, Evgrafov OV, Knowles JA, Thibaud-Nissen F, Schneider V, Yu CY, Zhou L, Eichler EE, So KF, Wang K. Long read sequencing and de novo assembly of a Chinese genome, Nature Communications, 7:12065, 2016
  2. Guo Y, Ding X, Shen Y, Lyon GJ, Wang K: SeqMule: automated human exome/genome variants detection. Scientific Reports, doi: 10.1038/srep14283, 2015
  3. Guo Y, Conti D V, Wang K. Enlight: web-based integration of GWAS results with biological annotations Bioinformatics, doi: 10.1093/bioinformatics/btu639, 2014
  4. Wei WH, Guo Y, Kindt AS, Merriman TR, Semple CA, Wang K, Haley CS: Abundant local interactions in the 4p16.1 region suggest functional mechanisms underlying SLC2A9 associations with human serum uric acid. Human molecular genetics 2014.
  5. Jia H, Guo Y, Zhao W, Wang K: Long-range PCR in next-generation sequencing: comparison of six enzymes and evaluation on the MiSeq sequencer. Scientific reports 2014.
  6. Wang K, Kim C, Bradfield J, Guo Y, Toskala E, Otieno FG, Hou C, Thomas K, Cardinale C, Lyon GJ, et al: Whole-genome DNA/RNA sequencing identifies truncating mutations in RBCK1 in a novel Mendelian disease with neuromuscular and cardiac involvement. Genome medicine 2013, 5:67.
  7. Chen GK, Guo Y: Discovering epistasis in large scale genetic association studies by exploiting graphics cards. Frontiers in genetics 2013, 4:266.
  8. Wu Yan, Y. Guo, Shan Lu (2011). "Plant terpenoid volatile as defensive signals." Plant Physiology Journal 48(4): 7.