We are developing statistical approaches for extracting biological insights from transcriptome data.


We are part of several large-scale transcriptome profiling projects, including the Single Cell Analysis Transcriptome Program (USC), the Psychiatric ENCODE project consortium (USC) and Transcriptome Sequencing of Neuronal Cell Lines From Patients with Schizophrenia. To better extract biological information from these data, we are working on a few areas:

1. Evaluate of computational approaches for single-cell analysis

A number of software tools have been developed for RNA-Seq to perform splice-aware alignments, quantify gene expression levels, or test for differential expression. However, it is unclear what are their relative advantages and disadvantages, especially on single cell data that present more technical challenges than conventional RNA-Seq. We aim to evaluate analytical tools for RNA-Seq, using both artificially created spike-in data sets and real single-cell data sets.We generated RNA-Seq data on 3 different RNA quantities (10pg, 100pg and 1,000pg), each with 3-6 replicates, of two human reference RNA samples –– Human Brain Reference (HBR) and Universal Human Reference (UHR). We also analyzed two types of mouse single cell RNA-Seq samples – three live single neurons from brain slices, one neuronal pool from brain slices and six neuronal cells cultured from embryonic mouse hippocampus. We compared five commonly used alignment tools (TopHat, STAR, MapSplice, Subread and GSNAP) with four different quantitative methods (Cufflinks FPKM, HTSeq, multiCovBam from BEDTools and featureCount from Subread tools). For differential expression analysis, we compared four tools (DESeq, DESeq2, edgeR and Cuffdiff).

2. lncRNA prediction from RNA-Seq data

Despite the increased interests on lncRNA, our understanding on ncRNA is still at rudimentary and studying known lncRNA and identifying novel lncRNA are challenging problems without ideal solutions yet. Typical features of lncRNA include the longer size of transcript length >200bp, the likely genomic location between genes, and the lack of complete ORF, but these features alone cannot be used to infer lncRNA confidently. To address this issue, we developed a called lncScore for predicting novel lncRNA directly from alignment files from RNA-Seq data sets. Our strategy is based on a scoring system using multiple features for lncRNA, and use well known/annotated lncRNA to train a statistical modelSupport Vector Machine (SVM) to identify optimized parameters to separate candidate lncRNA candidates from random noises. 

3. SplineAdjust for p-value adjustment of RNA-Seq data

Recent studies have shown that DE analysis of RNA-Seq is often prone to gene length bias, that is, longer genes are more likely to be declared significant than shorter genes. When popular software tools such as CuffDiff, DESeq and edgeR are used to calculate DE p-values, substantial length bias often exist, even after  normalizing counts by lengths. We proposed a novel statistical approach "SplineAdjust" that takes p-values from any DE analysis tool to generate adjusted p-values for unbiased assessment of DE. SplineAdjust models gene lengths nonlinearly using a penalized spline regression model, can accommodate a variety of study designs in addition to case-control studies, and is also useful for adjusting p-values of isoforms or exons. Using both simulated and real RNA-Seq datasets, we showed that SplineAdjust reduced bias more effectively than previously proposed methods, and the adjusted p-values had improved power and well-controlled type I error rate. With the ever-increasing application of RNA-Seq in gene expression studies, the proposed method will help improve our ability to detect genuinely DE genes in RNA-Seq studies.

4. Statistical models for IsoSeq analysis

IsoSeq allows direct inference of full-length transcriptional isoforms. However, typical library preparation protocols (such as using 3-4 size selection libraries) complicates the task of quantification, and resulted in vastly different error rates of bases from concensus read. We are developing models to address these issues to estimate the true expression level of allele-specific transcriptional isoforms, which will be very important to study splicing QTLs and isoform-switching QTLs in the future. We are also developing transcript assembly methods that leverage both full-length and partial-length IsoSeq transcripts to identify all isoforms, and demonstrated its improvded performance than Cufflinks and other tools develoepd for short reads.

5. Circular RNA identification from RNA-Seq data

Recent evidence suggested that many endogenous circular RNAs (circRNAs) may play roles in biological processes. However, the presence of circRNAs and their expression patterns in human species are not well understood. Computationally identifying circRNAs from RNA-seq data is a primary step to study their expression pattern and biological roles. Here, we have developed computational pipeline named UROBORUS to detect circRNAs in total RNA-seq data. Compared with two available computational approaches, UROBORUS is an easy-to-use and efficient tool that can detect circRNAs with low expression levels in total RNA-seq without RNase R treatment.