Artificial Intelligence in Next-generation Biology and Medicine



We develop fully automated pipelines for genome and exome sequencing analysis: all the way from raw sequencing data to biological and clinical insights.

A new generation of sequencing platforms that perform high- throughput sequencing have now made it possible to generate enormous amounts of DNA sequence data on individual human genomes. We develop a suite of bioinformatics methods to understand the functional contents and clinical insights from personal genome sequencing data. Our tools are widely used by research labs, clinical diagnostic labs and pharmaceutical companies to understand genome sequencing data.

Some examples of the computational tools that we created include: PennCNV, ANNOVAR, Phenolyzer, InterVar/wInterVar, EHR-Phenolyzer and CancerVar. For example, our flagship tool, ANNOVAR, is an efficient software to utilize up-to- date information to functionally annotate genetic variants detected from diverse genomes (including human, mouse, worm, fly and many others), and has been widely used by the community, with over 2 million page views per year and over 7,500 citations. Its web-service version, wANNOVAR, has processed almost 300,000 genome files from submissions worldwide as of early 2021. Its commercial version was licensed by Qiagen with a video tutorial. The combination of these tools enables end-to-end, fast and reliable whole-genome/exome sequencing data analysis on personal genomes, and facilitaets the implementation of genomic medicine on scale.



We develop novel methods and software tools for long-read sequencing data on diverse technical platforms, such as PacBio, Nanopore, 10X Genomics, Bionano Genomics.

Although not widely known yet, we and others have shown that long- read sequencing technologies, such as Oxford Nanopore and Pacific BioSciences, have revolutionalized the field of biomedical research and genomic medicine, with significant advantages compared to conventional short-read sequencing technologies. For example, long-read sequencing can identify pathogenic SVs missed by short-read sequencing, and even detect those traditional “unsequenceable” SVs. Furthermore, long-read sequencing can detect DNA and RNA modifications directly, contributing to our understanding of the epigenetic regulation of human genome. It is considered one of the most innovative "Big Ideas" in next 10 years by ARK. We are developing various deep neural network approaches to handle unique challenges in long-read sequencing data, to facilitate novel genetic discoveries and accelerate the implementation of precision medicine.

Some examples of the tools that we developed include RepeatHMM, NextSV, LinkedSV, NanoMod, DeepMod, DeepRepeat and NanoCaller. Spearheaded by Chris Liu in our lab, DeepMod is a bidirectional recurrent neural network (RNN) with long short-term memory (LSTM) to detect DNA modifications using raw electric signals of Oxford Nanopore sequencing. DeepRepeat convert a DNA repeat unit and its upstream and downstream units into RGB (red- green-blue) channels of a color image represented by different pixels, thus we transform the repeat detection problem into an image recognition problem by deep neural networks. NanoCaller is led by Umair Ahsan, which detects SNPs/indels by examining pileup information from other candidate SNPs that share the same long reads solely using long-range haplotype information in a deep neural network. What is exciting is that we recently won the PrecisionFDA Variant Calling Challenge on MHC (the most difficult region in the entire human genome) using NanoCaller-based Ensemble Caller.

Finally, we have the sequencing platform at the Wang Genomics lab and we are evaluating the use of long-read sequencing to study microbial communities, and demonstrated how this technology can significantly improve our understanding of microbiome and human health. We are applying the method to transcriptome (RNA-Seq) data and demonstrated superior performance in finding differential isoform usage and finding novel fusion genes in various types of cancer. We are super excited and fully prepared to embrace this new frontier and the revolution of long-read sequencing!



We develop and apply deep learning based natural language processing approaches to mine a gold mine of clinically relevant information from electronic health records.

Unlike the massive amounts of texts from Twitter, Reddit and Facebook with questionable quality, clinical notes in electronic health records (EHRs) in our hospital system represent expert-compiled and human curated information on different diseases. These billions of records are sitting there, waiting to be explored by the next generation of data scientists. By demonstrating clinical utility of EHR-Phenolyzer with Dr. Chunhua Weng, are among the pioneers to integrate deep phenotyping and clinical sequencing data for patients with rare diseases, to facilitate diagnosis and to improve decision making in patient care. Recent development of deep learning based NLP in the past few years (such as BERT/GPT-2/GPT-3) revolutionalized the entire field, such that machines can finallly understand human logic and human thinking when writing clinical notes for patients. We leveaged these latest development in the field, and use clinical phenotypic data to inform the prioritization of variants, and for systematically build knowledgebases on every human disease.

Translating advances in genomic medicine into precise disease diagnoses and personalized disease prevention and treatment for patients can be realized if and only if we can accurately identify and interpret genetic variants associated with human diseases. Integration of phenotypes from electronic health records (EHRs) and clinical sequencing data for patients likely to have monogenetic disorders promises to improve the efficiency and effectiveness of genomic diagnostics. However, innovative and scalable methods are still lacking for abstracting relevant patient phenotypes from EHRs, for using phenotypic data to inform the prioritization of variants, and for systematically comparing disease phenotypes across diseases.

We are developing and validating scalable approaches to abstracting characteristic phenotypes of all genetic disorders from EHR narratives and standardize the concept representations of these EHR phenotypes. We developed Doc2HPO, that automatically converts a clinical note into computer-understanble terminologies, as well as Phen2Gene, that automatically predict likely genetic syndromes from these phenotypic terminologies. We also build PhenCards as a one- stop shop that catalog all known biomedical knowledge related to clinical phenotypes. With advanced NLP methods, we are also building Knowledge Graphs for all emerging diseases such as COVID-19, so that researchers can quickly gain biological insights from thousands of manuscripts within a few hours by asking computers to read all these papers.



We offer a variety of dry and wet lab opportunities for rotation students and undergraduate research students to perform genomics research in our lab. Some example projects are described below, but we are quite flexible to tailor projects to the specific interests of the trainees.
  1. Detection and annotation of complex structural variants
  2. Although short-read sequencing has been widely used in research and clinical settings, it has limited ability to identify SVs due to the presence of repeat elements. It is known that pathogenic SVs might be missed by short-read sequencing, potentially contributing to the low diagnostic rates (~30-40%) in clinical genome/exome sequencing. The lack of reliable tools for clinical interpretation of SVs further limits our ability to identify mutations that contribute to human diseases. The main goal of this project is to develop a suite of computational tools to detect structural variants (SVs) by multiple genomics technologies (including but not limited to linked-read sequencing, optical mapping, long-read sequencing), and evaluate various tools on several real data sets generated on undiagnosed subjects or cancer samples. In addition, based on InterVar, a framework we previously developed for clinical interpretation of SNPs/indels, the student may also participate in the development of a clinically relevant SV annotation workflow to identify and prioritize disease-relevant SVs from tens of thousands of SV calls per human subjects.

  3. Efficient quantification of gene expression from long-read RNA-Seq data using weighted likelihood method
  4. Long-read RNA sequencing (RNA-seq) technologies have made it possible to sequence full-length transcripts, facilitating the exploration of isoform-specific gene expression over conventional short-read RNA-seq. We previously developed LIQA (Hu et al., 2020, bioRxiv), a statistical method for isoform-specific gene expression estimation by modeling read length distribution. However, in addition to the accuracy of estimated isoform expression, efficiency, which measures the degree of estimation uncertainty is also an important factor for downstream analysis. We will develop a weighted likelihood framework to improve estimation efficiency. Moreover, similar to Cuffdiff, we can estimate variance for each isoform expression estimate, which allows 1 case vs 1 control comparison of differential expression and has some specific application scenarios in personalized medicine.

  5. Exploring relationships between human diseases and phenotypes through text mining
  6. Translating advances in genomic medicine into precise disease prevention and treatment for patients can be realized if and only if we can accurately identify and interpret genetic variants associated with human disease. It is well known that prior knowledge on disease-gene relationships can be mined from diverse knowledgebases such as the electronic health records (EHRs) and published scientific literature. However, innovative and scalable methods are still lacking for abstracting relevant patient phenotypes from EHRs or scientific manuscripts, for using phenotypic data to inform the prioritization of genes/variants, and for systematically comparing and aggregating disease phenotypes across patients. This research project is part of an effort to develop comprehensive gene- phenotype-disease knowledgebases (some existing componets include the Phen2Gene Knowledgebase and the PhenCards data source developed by Dr. Jim Havrilla) to help interpret genome sequencing data for patients with genetic diseases. We will build a benchmarking data set from published scientific manuscripts on human genetic diseases or genetic case reports, as well as internal clinical notes on patients with confirmed genetic diagnosis at CHOP. We will address two questions: (1) whether natural language processing algorithms can be applied to automate gene-finding, based on clinical descriptions from published literature or clinical notes (2) what is the optimal parameters and approaches for ranking causal genes higher within the list of all genes, based on the benchmarking data sets (3) How state of the art method such as BERT and GPT2 can be applied in the context of genetic diagnosis using phenotype information hidden in EHRs.

  7. Development of novel genomic assays for neurological diseases
  8. Our lab maintains a wetlab operation with Nanopore sequencers and Bionano optical mapping platform, and we will develop novel genomic assays for repeat expansions involved in various ataxia or other neurological disorders, and evaluate its clinical utility by comparing to requested clinical tests (WES and repeat panels). We have previously developed barcoded amplicon assays on various ATXN3, HTT and C9orf72, and we will extend to other repeat types so that we can test all known repeat expansions simultaneously. We will compare results to commercial assays that only examine specific repeat expansions and are only offered by three diagnostic labs across US, to evaluate the technical advantage and commercial viability of the novel genomic approach. More importantly, together with a novel multiplex barcoding approach developed by Dr. Fang Li in our lab, our novel approach allows long-range haplotype phasing for each individual patient among thousands of DNA samples, thus making it feasible for individualized gene therapy using CRISPR/Cas9 technologies through long-range allele-specific gene targeting.

  9. Cell type specific perturbation of gene regulatory networks
  10. Our previous studies on a number of brain diseases demonstrated that genetic perturbations of specific genes can influence genetic network in a cellular system, which may explain the resulting cellular, molecular and clinical phenotypes. For example, we demonstrated that genetic perturbation of NRXN1 influence coordinated expression of a network of genes, genetic perturbation of NLGN4X delayed neuronal development and compromised neurite formation, and recently reported that the transcription factor TCF4 influence a large gene regulatory network in hiPSC-derived neural progenitor cells (NPCs) and glutamatergic neurons (Glut_Ns). We have developed a cell type deconvolution approach by Dr. Abolfazl Doostparast that directly impute cell type specific gene expression data from bulk (mixed cell type) RNA-Seq data. Using a large set of post-mortem brain gene expression data from control subjects and patients with various brain disorders such as autism and schizophrenia, we will evaluate how gene regulatory network changes in disease state in different cell types, and identify the most relevant cell type for different brain diseases. Using single-cell RNA-Seq data on post-mortem brain from patients with autism or Alzheimer's disease, we will further delineate the gene regulatory network as a result of genetic perturbations of key regulatory genes. This project facilitate our understanding of how multiple genes in the same genetic pathway or regulatory network work together to confer risk to complex brain disorders.