We are developing algorithms and software tools for handling and analyzing long-read sequencing data.


Long-read sequencing technologies, such as those from 10X Genomics, Pacific Biosciences and Oxford Nanopore, have now enabled the generation of sequenced fragments that are 10kb or longer, allowing for the interrogation of complex genomic reigions that are otherwise inaccessible by short-read sequencing. However, to best leverage these new generations of sequencing technology and extract maximum benefits from them, novel computational approaches are urgently needed.

Our lab is among the first groups to publish a de novo assembly of a personal human genome, using PacBio long-read sequencing techniques. We have also been working with 10X Genomics genome and exome sequencing data to identify structural variants that are missed by typical Illumain sequencing. Currently, we are developing a number of different methods to improve the analysis of long-read sequencing data:

1. RepeatHMM: Microsatellite expansion, such as trinucleotide repeat expansion (TRE), is known to cause a number of genetic diseases. For example, over 55 CAG repeats in the ATXN3 gene can cause spinocerebellar ataxia type 3 (SCA3). Sanger sequencing and next-generation short-read sequencing are unable to interrogate TRE reliably. Long-read sequencing has the potential to address this problem, but the high error rates pose significant challenges to data analysis. We developed an algorithm called RepeatHMM to estimate repeat counts on long-read sequencing data from whole-genome or amplicon sequencing. RepeatHMM takes a set of reads, uses a split-and-align strategy to improve alignments, performs optional error correction, and leverages a hidden Markov model and a peak calling algorithm to infer repeat counts. We also analyzed real data on SCA3 and SCA10 generated by the PacBio sequencer, and found that RepeatHMM showed much higher concordance with the gold standard capillary electrophoresis than BAMself and another competing tool TRhist. We confirmed the reliability of RepeatHMM on whole-genome datasets on NA12878 generated by PacBio and Oxford nanopore technologies.

2. NextSV: Structural variants (SVs) in human genome are implicated in a variety of human diseases. Long-read sequencing (such as those from PacBio) delivers much longer read lengths than short-read sequencing (such as those from Illumina) and may greatly improve SV detection. However, due to the relatively high cost of long-read sequencing, users are often faced with issues such as what coverage is needed and how to optimally use the aligners and SV callers. To automate SV calling, we developed a computational pipeline called NextSV, which integrates PBhoney and Sniffles and generates the union (high sensitivity) or intersection (high specificity) call sets. Our results provide useful guidelines for SV identification from low coverage whole-genome PacBio data and we expect that NextSV will facilitate the analysis on SVs on long-read sequencing data.

3. LinkedReadSV: 10X Genomics allows the generation of pseudo long reads that are linked by a common barcode. However, manufacturer-provided software tools cannot fully analyze the linked reads data and often miss important structural variants that can be even visually identified from dot plots. To address these limitations, we developed LinkedReadSV, an algorithm that incorporate multiple sources of information, includng alignment and link-distance, to identify various types of structural variants (deletions, insertions, inversions, translocations) reliably.

4. LongVar: Due to the relatively high error rates for long-read sequencing technologies, variant calling from long-read sequencing data remains a major challenge. Existing software tools are not parameterized for error-prone data, and cannot be easily adapted for data from PacBio and Nanopore platforms. We have been working on the development of a machine-learning approach that incorporate context-dependent sequence characteristics into the variant calling process, to improve the accuracy and power of variant detection.

In summary, long-read sequencing technologies can faciliate the generation of biological insights that are missed by Illumina short-read sequencing data, and the development of more powerful computational approaches will allow us fully utilize these new generations of sequencing technolgoies.