We are developing algorithms and software tools for handling and analyzing long-read sequencing data.


Long-read sequencing technologies, such as those from 10X Genomics, Bionano Genomics, Pacific Biosciences and Oxford Nanopore, have now enabled the generation of sequenced fragments that are 10kb or longer, allowing for the interrogation of complex genomic reigions that are otherwise inaccessible by short-read sequencing. However, to best leverage these new generations of sequencing technology and extract maximum benefits from them, novel computational approaches are urgently needed.

Our lab is among the first groups to publish a de novo assembly of a personal human genome, using PacBio long-read sequencing techniques. We have also been working with 10X Genomics genome and exome sequencing data to identify structural variants that are missed by typical Illumain sequencing. Currently, we are developing a number of different methods to improve the analysis of long-read sequencing data:

1. RepeatHMM

Microsatellite expansion, such as trinucleotide repeat expansion (TRE), is known to cause a number of genetic diseases. For example, over 55 CAG repeats in the ATXN3 gene can cause spinocerebellar ataxia type 3 (SCA3). Sanger sequencing and next-generation short-read sequencing are unable to interrogate TRE reliably. Long-read sequencing has the potential to address this problem, but the high error rates pose significant challenges to data analysis. We developed an algorithm called RepeatHMM to estimate repeat counts on long-read sequencing data from whole-genome or amplicon sequencing. RepeatHMM takes a set of reads, uses a split-and-align strategy to improve alignments, performs optional error correction, and leverages a hidden Markov model and a peak calling algorithm to infer repeat counts. We also analyzed real data on SCA3 and SCA10 generated by the PacBio sequencer, and found that RepeatHMM showed much higher concordance with the gold standard capillary electrophoresis than BAMself and another competing tool TRhist. We confirmed the reliability of RepeatHMM on whole-genome datasets on NA12878 generated by PacBio and Oxford nanopore technologies. Details can be found here.

2. NextSV

Structural variants (SVs) in human genome are implicated in a variety of human diseases. Long-read sequencing (such as those from PacBio) delivers much longer read lengths than short-read sequencing (such as those from Illumina) and may greatly improve SV detection. However, due to the relatively high cost of long-read sequencing, users are often faced with issues such as what coverage is needed and how to optimally use the aligners and SV callers. To automate SV calling, we developed a computational pipeline called NextSV, which integrates PBhoney and Sniffles and generates the union (high sensitivity) or intersection (high specificity) call sets. Our results provide useful guidelines for SV identification from low coverage whole-genome PacBio data and we expect that NextSV will facilitate the analysis on SVs on long-read sequencing data. Details can be found here.

3. LinkedSV

10X Genomics allows the generation of pseudo long reads that are linked by a common barcode. However, manufacturer-provided software tools cannot fully analyze the linked reads data and often miss important structural variants that can be even visually identified from dot plots. To address these limitations, we developed LinkedReadSV, an algorithm that incorporate multiple sources of information, includng alignment and link-distance, to identify various types of structural variants (deletions, insertions, inversions, translocations) reliably. Details can be found here.

4. LongSV

Due to the relatively high error rates for long-read sequencing technologies, variant calling from long-read sequencing data remains a major challenge. Existing software tools are not parameterized for error-prone data, and cannot be easily adapted for data from PacBio and Nanopore platforms. We have been working on the development of a machine-learning approach that incorporate context-dependent sequence characteristics into the variant calling process, to improve the accuracy and power of variant detection.

5. NanoMod

Recent advances in single-molecule sequencing techniques, such as Nanopore sequencing, improved read length, increased sequencing throughput, and enabled direct detection of DNA modifications through the analysis of raw signals. These DNA modifications include naturally occurring modifications such as DNA methylations, as well as modifications that are introduced by DNA damage or through synthetic modifications to one of the four standard nucleotides. To improve the performance of detecting DNA modifications, especially synthetically introduced modifications, we developed a novel computational tool called NanoMod. NanoMod takes raw signal data on a pair of DNA samples with and without modified bases, extracts signal intensities, performs base error correction based on a reference sequence, and then identifies bases with modifications by comparing the distribution of raw signals between two samples, while taking into account of the effects of neighboring bases on modified bases (“neighborhood effects”). We evaluated NanoMod on simulation data sets, based on different types of modifications and different magnitudes of neighborhood effects, and found that NanoMod outperformed other methods in identifying known modified bases. Additionally, we demonstrated superior performance of NanoMod on an E. coli data set with 5mC (5-methylcytosine) modifications. In summary, NanoMod is a flexible tool to detect DNA modifications with single-base resolution from raw signals in Nanopore sequencing, and will greatly facilitate large-scale functional genomics experiments in the future that use modified nucleotides. Details can be found here.

6. Optical mapping

Optical mapping from Bionano Genomics quantifies the distances between specific 7-bp motifs in large DNA fragments up to several megabases. It allows the detection of SVs with kilobase resolution and complements linked-read and long-read sequencing. We have used optical mapping in the diagnosis of a “unsequenceable” disease called Facioscapulohumeral Muscular Dystrophy (FSHD) with post-zygotic mosaicism and in congenital chromothripsis, and we are currently developing methods to detect complex SVs from optical mapping data and integrating these SV calls with base-resolution SV calls to better understand the complex structure of human genome. Details can be found here.


In summary, long-read sequencing technologies can faciliate the generation of biological insights that are missed by Illumina short-read sequencing data, and the development of more powerful computational approaches will allow us fully utilize these new generations of sequencing technolgoies.