Ph.D. in Biomedical Science, Catholic School of Medicine, 2012
M.S. in Computer Science, State University of New York at Buffalo, 2005
B.S. in Electronic Engineering, ChungAng University, 1997
Software for RNA-Seq data analysis, novel algorithm for lncRNA detection
Software for RNA-Seq data analysis
Background: Over the past few years, single-cell RNA-Seq techniques began to be widely adopted to evaluate cell-specific gene expression. Compared to conventional RNA-Seq using bulk samples, the much higher technical noise, lower mapping rate and higher level of amplification lead to significant challenges in data analysis. A number of software tools have been developed for RNA-Seq to perform splice-aware alignments, quantify gene expression levels, or test for differential expression. However, it is unclear what are their relative advantages and disadvantages, especially on single cell data that present more technical challenges than conventional RNA-Seq. We aim to evaluate analytical tools for RNA-Seq, using both artificially created spike-in data sets and real single-cell data sets.
Methods: We generated RNA-Seq data on 3 different RNA quantities (10pg, 100pg and 1,000pg), each with 3-6 replicates, of two human reference RNA samples –– Human Brain Reference (HBR) and Universal Human Reference (UHR). We also analyzed two types of mouse single cell RNA-Seq samples – three live single neurons from brain slices, one neuronal pool from brain slices and six neuronal cells cultured from embryonic mouse hippocampus. We compared five commonly used alignment tools (TopHat, STAR, MapSplice, Subread and GSNAP) with four different quantitative methods (Cufflinks FPKM, HTSeq, multiCovBam from BEDTools and featureCount from Subread tools). For differential expression analysis, we compared four tools (DESeq, DESeq2, edgeR and Cuffdiff).
Results: STAR showed the best performance of sequence alignment in terms of speed and mapping rate (when truncation option is on), yet yielding quantification results comparable to other tools. The same alignments showed high correlation with different quantitative methods (?=0.814~0.954 for neuronal cultured mouse samples, ?=0.873~0.989 for 1,000pg of HBR). For differential expression analysis, our test indicated that edgeR and DESeq were less likely to yield false positive hits than the other two methods.
Conclusion: STAR is one of the best aligners. Tools for quantifying gene expression levels work similarly. edgeR or DESeq perform well in differential expression analysis for single-cell RNA-Seq data. All data sets are made available in public domain for future benchmarking studies.
Novel algorithm for lncRNA detection
RNA-Seq techniques are widely used in biomedical research to investigate the transcriptional landscape of different tissues or conditions on human and other species. Although most studies focused on the analysis of protein coding genes, the importance of non-coding transcripts, especially long non-coding RNAs (lncRNAs), has been increasingly recognized as important regulators of transcriptional activity. However, unlike protein coding genes, our understanding on lncRNA is still at rudimentary, yet the category of lncRNA and other non-coding RNA could be substantially larger than previously thought. Despite the increased interests on lncRNA, studying known lncRNA and identifying novel lncRNA are challenging problems without ideal solutions yet. Typical features of lncRNA include the longer size of transcript length >200bp, the likely genomic location between genes, and the lack of complete ORF, but these features alone cannot be used to infer lncRNA confidently.
To address this issue, we developed a machine-learning method called LNCScore for predicting novel lncRNA directly from alignment files generated from RNA-Seq data sets, on either well-annotated or poorly annotated species. Our strategy is based on building a scoring system using multiple informative features for lncRNA, and use well known/annotated lncRNA to train a statistical model to identify optimized parameters to separate candidate lncRNA from random noises. We incorporated most of the commonly used features that other algorithms for lncRNA prediction are using, but we added more criteria to score the likelihood of a predicted transcript to be a genuine lncRNA, such as phylogenetic modeling and comparison with related species, the use of SNPs that may result in new open reading frames, the epigenetic markers from ChIP-Seq experiments where available, the gene family information of neighboring genes, and comparison to well established database like NONCODE and LNCipedia. LNCScore can be used on poorly studied tissues from model organisms with rich annotation data, but can also be applied to newly sequenced species with limited functional annotations on genes.
The input of our software is the alignment files in BAM format generated with reference-guided assembly, and the output is a list of possible novel lncRNA, each with LNCScore (a confidence score for the likelihood of being a genuine lncRNA), as well as FPKM values and read counts, to facilitate downstream analysis such as differential expression or splicing. Our method will help researchers identify novel lncRNAs and help determine the potential functional significance of these novel non-coding RNAs.
- Kim J, Kim JM, Evgrafov O, Knowles JA, Wang K. Comparative analysis of five splice-aware alignment tools and eight expression analysis tools for single-cell RNA-Seq data. Submitted