We develop novel methods and software tools for long-read sequencing data on diverse technical platforms, such as PacBio, Nanopore, 10X Genomics, Bionano Genomics.


Conventional short-read sequencing technologies have severe limitations in detecting pathogenic SVs, potentially contributing to the low diagnostic rates (~30-40%) in clinical sequencing studies. Some of the reasons include the inability to provide information on breakpoints of intronic SVs, the inability to reliably resolve repeats longer than the read length, and the low power to detect balanced translocations or inversions. These limitations can be addressed by the new generation of long-read sequencing technologies, including 10X Genomics linked-read sequencing, PacBio/Nanopore long-read sequencing and Bionano optical mapping. We and others have shown that long-read sequencing is able to detect SVs far more comprehensively than short-read sequencing. Coupled with innovative bioinformatics analysis, long-read sequencing can identify pathogenic SVs missed by short-read sequencing, and even detect those traditional “unsequenceable” SVs. Furthermore, long-read sequencing can detect DNA and RNA modifications (such as 5mC, 4mC and 6mA methylations) directly, contributing to our understanding of the epigenetic regulation of human genome. We are developing novel methods and software tools to improve our understanding of long-read sequencing data, facilitate genetic discoveries and accelerate the implementation of precision medicine.

1. RepeatHMM

Microsatellite expansion, such as trinucleotide repeat expansion (TRE), is known to cause a number of genetic diseases. Sanger sequencing and next-generation short-read sequencing are unable to interrogate TRE reliably. We developed a novel algorithm called RepeatHMM to estimate repeat counts from long-read sequencing data. Evaluation on simulation data, real amplicon sequencing data on two repeat expansion disorders, and whole-genome sequencing data generated by PacBio and Oxford Nanopore technologies showed superior performance over competing approaches. We concluded that long-read sequencing coupled with RepeatHMM can estimate repeat counts on microsatellites and can interrogate the “unsequenceable” genomic trinucleotide repeat disorders. For more details, see https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-017-0456-7.

2. NextSV

Structural variants (SVs) in human genomes are implicated in a variety of human diseases. Long-read sequencing delivers much longer read lengths than short-read sequencing and may greatly improve SV detection. However, due to the relatively high cost of long-read sequencing, it is unclear what coverage is needed and how to optimally use the aligners and SV callers. In this study, we developed NextSV, a meta-caller to perform SV calling from low coverage long-read sequencing data. NextSV integrates three aligners and three SV callers and generates two integrated call sets (sensitive/stringent) for different analysis purposes. We evaluated SV calling performance of NextSV under different PacBio coverages on two personal genomes, NA12878 and HX1. Our results showed that, compared with running any single SV caller, NextSV stringent call set had higher precision and balanced accuracy (F1 score) while NextSV sensitive call set had a higher recall. At 10X coverage, the recall of NextSV sensitive call set was 93.5 to 94.1% for deletions and 87.9 to 93.2% for insertions, indicating that ~10X coverage might be an optimal coverage to use in practice, considering the balance between the sequencing costs and the recall rates. We further evaluated the Mendelian errors on an Ashkenazi Jewish trio dataset. Our results provide useful guidelines for SV detection from low coverage whole-genome PacBio data and we expect that NextSV will facilitate the analysis of SVs on long-read sequencing data. For more details, see https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2207-1

3. LinkedSV

Reliable detection of structural variants (SVs) from short-read sequencing remains challenging, mainly due to the presence of repetitive DNA elements that are longer than typical short reads (~100-150bp). Linked-read sequencing provides long-range information from short-read sequencing data by linking reads originating from the same HMW DNA molecule, and thus has the potential to improve the sensitivity of SV detection and accuracy of breakpoint identification for certain classes of SVs. We present LinkedSV (https://github.com/WGLab/LinkedSV), a novel SV detection algorithm which combines two types of evidence. Simulation and real data analysis demonstrated that LinkedSV outperforms several existing tools including Longranger, GROC-SVs and NAIBR. LinkedSV works particularly well on exome sequencing data and on SVs with low variant allele frequencies due to somatic mosaicism. Our results support the use of linked-read sequencing to detect hidden SVs missed by conventional short-read sequencing approaches and helps resolve negative cases from clinical genome or exome sequencing. For more details, see https://www.biorxiv.org/content/10.1101/409789v3.

4. Bionano optical mapping

Facioscapulohumeral Muscular Dystrophy (FSHD) is a common adult muscular dystrophy in which the muscles of the face, shoulder blades and upper arms are among the most affected. FSHD is the only disease in which "junk" DNA was reactivated to cause disease, and the only repeat-related disease where less repeats cause disease. More than 95% of FSHD cases are associated with copy number loss of a 3.3kb tandem repeat (D4Z4 repeat) at the subtelomeric chromosomal region 4q35, of which pathogenic allele contains less than 10 repeats and has a specific genomic configuration called 4qA. Currently, genetic diagnosis of FSHD requires pulsed-field gel electrophoresis followed by Southern blot, which is labor-intensive, semi-quantitative and requires long turnaround time. Here, we developed a novel approach for genetic diagnosis of FSHD, by leveraging Bionano Saphyr single-molecule optical mapping platform. Using a bioinformatics pipeline developed for this assay, we found that the method gives direct quantitative measures of repeat numbers, can differentiate 4q35 and the highly paralogous 10q26 region, can determine the 4qA/4qB allelic configuration, and can quantitate levels of post-zygotic mosaicism. We evaluated this approach on 5 patients (including two with post-zygotic mosaicism) and 2 patients (including one with post-zygotic mosaicism) from two separate cohorts, and had complete concordance with Southern blots, but with improved quantification of repeat numbers. We concluded that single-molecule optical mapping is a viable approach for molecular diagnosis of FSHD and may be applied in clinical diagnostic settings once more validations are performed. For more details, see https://www.biorxiv.org/content/10.1101/286104v2.

5. Genome assembly

We are working on a number of methods to improve genome assembly by combining short-read sequencing, moderately long read sequencing, ultralong read sequencing, and chromosome-scale scaffolding together. As an earlier example, we have published one of the first genome assemblies for human genome, using a combination of Illumina, PacBio and Bionano optical mapping. This study was published a few years ago at https://www.nature.com/articles/ncomms12065. We are currently improving this assembly using 10X Genomics platform, HiC platform, and ultralong-read sequencing. We are also working on a few challenging yet extremely important genomes for model organisms that are routinely used in human genetic studies.

6. NanoMod

Recent advances in single-molecule sequencing techniques, such as Nanopore sequencing, improved read length, increased sequencing throughput, and enabled direct detection of DNA modifications through the analysis of raw signals. These DNA modifications include naturally occurring modifications such as DNA methylations, as well as modifications that are introduced by DNA damage or through synthetic modifications to one of the four standard nucleotides.  To improve the performance of detecting DNA modifications, especially synthetically introduced modifications, we developed a novel computational tool called NanoMod. NanoMod takes raw signal data on a pair of DNA samples with and without modified bases, extracts signal intensities, performs base error correction based on a reference sequence, and then identifies bases with modifications by comparing the distribution of raw signals between two samples, while taking into account of the effects of neighboring bases on modified bases ("neighborhood effects"). We evaluated NanoMod on simulation data sets, based on different types of modifications and different magnitudes of neighborhood effects, and found that NanoMod outperformed other methods in identifying known modified bases. Additionally, we demonstrated superior performance of NanoMod on an E. coli data set with 5mC (5-methylcytosine) modifications. Conclusions: In summary, NanoMod is a flexible tool to detect DNA modifications with single-base resolution from raw signals in Nanopore sequencing, and will greatly facilitate large-scale functional genomics experiments in the future that use modified nucleotides. For more details, see https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-5372-8.

7. DeepMod

DNA base modifications, such as C5-methylcytosine (5mC) and N6-methyldeoxyadenosine (6mA), are important types of epigenetic regulations and play critical roles in cellular and molecular functions. We designed DeepMod, a bidirectional recurrent neural network (RNN) with long short-term memory (LSTM) to detect DNA 5mC and 6mA modifications using raw electric signals of Oxford Nanopore sequencing. We evaluated DeepMod on three types of genomes (Escherichia coli, Chlamydomonas reinhardtii and human genomes), and demonstrated that it performs well for genome-scale detection of DNA modifications. For more details, see https://www.nature.com/articles/s41467-019-10168-2.

8. Complete characterization of microbiome

The era of complete genome assemblies for microbiome studies are coming!!! We are evaluating the use of long-read sequencing to perform complete genome assembly on mixed microbial communities, and assess how this technology can significantly improve our understanding of microbiome and human health, compared to traditional 16s studies or shotgun sequencing studies using Illumina sequencing. Some exciting results are forthcoming.