We offer a variety of dry and wet lab opportunities for rotation students and undergraduate research students to perform genomics research in our lab.


We strive to create an envrionment that everybody can develop computional or experimental tools that improve genetic diagnostic rates and understand the genetic basis of human diseases. A variety of rotation projects or undergraduate research projects are available at the lab.


1. Detection and annotation of complex structural variants

Although short-read sequencing has been widely used in research and clinical settings, it has limited ability to identify SVs due to the presence of repeat elements. It is known that pathogenic SVs might be missed by short-read sequencing, potentially contributing to the low diagnostic rates (~30-40%) in clinical genome/exome sequencing. The lack of reliable tools for clinical interpretation of SVs further limits our ability to identify mutations that contribute to human diseases. The main goal of this project is to develop a suite of computational tools to detect structural variants (SVs) by multiple genomics technologies (including but not limited to linked-read sequencing, optical mapping, long-read sequencing), and evaluate various tools on several real data sets generated on undiagnosed subjects or cancer samples. In addition, based on InterVar, a framework we previously developed for clinical interpretation of SNPs/indels, the student may also participate in the development of a SV annotation workflow to identify and prioritize disease-relevant SVs from tens of thousands of SV calls per human subjects.

2. Detection of microsatellite (tandem repeats) in genetic syndromes

Microsatellites, repeats of certain DNA motifs (typically 1-6 bases), widely exist in human genomes. The repeat expansions of microsatellites in human genome have been found to cause many brain diseases, such as Huntington's disease and spinocerebellar ataxia. However, traditional next-generation sequencing techniques cannot assay microsatellite accurately, due to technical limitations of the sequencing platforms. Long-read sequencing platform developed by PacBio and Oxford Nanopore can potentially address these limitations. Our lab has developed RepeatHMM, a computational tool for detecting microsatellite from long-read sequencing data on human genomes. However, the error rates of base calls in repeat regions are higher than other regions due to the low complexity, which poses a challenge for repeat quantification from sequence data alone.We are currently developing novel deep learning method, which uses a recurrent neural network to estimate self-similarity of Nanopore signals of neighborhood subsequences with repeat length, and then automatically detect repeat regions in long reads. The research project will evaluate a variety of bioinformatics methods on repeat detection, participate in the development of novel methods, and can participate in the writing of high impact scientific manuscripts.

3. Exploring relationships between human diseases and phenotypes through text mining

Translating advances in genomic medicine into precise disease prevention and treatment for patients can be realized if and only if we can accurately identify and interpret genetic variants associated with human disease. It is well known that prior knowledge on disease-gene relationships can be mined from diverse knowledgebases such as the electronic health records (EHRs) and published scientific literature. However, innovative and scalable methods are still lacking for abstracting relevant patient phenotypes from EHRs or scientific manuscripts, for using phenotypic data to inform the prioritization of genes/variants, and for systematically comparing and aggregating disease phenotypes across patients. This research project is part of an effort to develop comprehensive gene-phenotype-disease knowledgebases to help interpret genome sequencing data for patients with genetic diseases. We will build a benchmarking data set from published scientific manuscripts on human genetic diseases or genetic case reports, as well as internal clinical notes on patients with confirmed genetic diagnosis at CHOP. We will address two questions: (1) whether natural language processing algorithms can be applied to automate gene-finding,  based on clinical descriptions from published literature or clinical notes (2) what is the optimal parameters and approaches for ranking causal genes higher within the list of all genes, based on the benchmarking data sets (3) How state of the art method such as BERT and GPT2 can be applied in the context of genetic diagnosis using phenotype information hidden in EHRs.

4. Development of novel genomic assays for neurological diseases

We will develop novel genomic assays for repeat expansions involved in various ataxia or other neurological disorders, and evaluate its clinical utility by comparing to requested clinical tests (WES and repeat panels). We have previously developed barcoded amplicon assays on various ATXN, HTT and C9orf72, and we will extend to other repeat types so that we can test all known repeat expansions simultaneously. We will compare results to commercial assays that only examine specific repeat expansions and are only offered by three diagnostic labs across US, to evaluate the technical advantage and commercial viability of the novel genomic approach. More importantly, our novel approach allows long-range haplotype phasing for each individual patient, thus making it feasible for individualized gene therapy using CRISPR technologies through long-range allele-specific gene targeting.