We develop and apply systems biology approaches to better understand the biology of complex diseases using genomic, transcriptomic and epigenomic data.


It is increasingly recognized that complex disorders such as schizophrenia and autism spectrum disorders (ASDs) are influenced by many genes, most of which individually confer only small to moderate risks. In the most extreme case of “omnigenic” model, almost all the genes expressed in disease-relevant cell types may confer risk through widespread network interactions with a core set of genes. Systems biology approaches based on co-expression and protein-protein interaction have started to unravel gene networks that are relevant to complex diseases. However, most such studies did not integrate rapidly expanding multi-dimensional genomic and literature knowledge, and did not identify core genes driving network functions. This has delayed the attainment of our ultimate goal in this pursuit - the development of more effective therapeutic approaches. We are developing and applying a suite of systems biology approaches that enable researchers to better understand the massive amounts of genomic data on complex diseases, and formulate biological hypotheses for functional validation.


Background: A range of rare and common genetic variants have been discovered to be potentially associated with mental diseases, but many more have not been uncovered. Powerful integrative methods are needed to systematically prioritize both variants and genes that confer susceptibility to mental diseases in personal genomes of individual patients and to facilitate the development of personalized treatment or therapeutic approaches. Methods: Leveraging deep neural network on the TensorFlow framework, we developed a computational tool, integrated Mental-disorder GEnome Score (iMEGES), for analyzing whole genome/exome sequencing data on personal genomes.  iMEGES takes as input genetic mutations and phenotypic information from a patient with mental disorders, and outputs the rank of whole genome susceptibility variants and the prioritized disease-specific genes for mental disorders by integrating contributions from coding and non-coding variants, SVs, known brain expression quantitative trait locus (eQTLs), and epigenetic information from PsychENCODE. Results: iMEGES was evaluated on multiple datasets of mental disorders, and it achieved improved performance than competing approaches when large training data is available. Conclusion: iMEGES can be used in population studies to help the prioritization of novel genes or variants that might be associated with the susceptibility of mental disorders, and also on individual patients to help the identification of novel genes or variants related to for mental diseases.

2. DeepSZ

We plan to build a knowledge integration framework that incorporates multi-dimensional biological knowledge to predict SZ genes. The decades of SZ genetics research have accumulated valuable knowledge. We previously developed Phenolyzer that constructs networks from prior knowledge, and demonstrated its application to mental disorders by incorporating proteomics data (neurocomplex.usc.edu). Here, extending our recently developed deep neural network with group lasso regularization, we will build a knowledge integration framework called DeepSZ that incorporates various sources of biological knowledge such as phenotype similarity, disease-gene relationships, and pathway information, as well as exome, GWAS, and CNV studies, to increase the specificity for SZ. Similar to cBioPortal for cancer, we will build framework that allows researchers to interactively mine information on SZ genes.


Existing co-expression based methods (e.g., ARACNe and WGCNA) do not incorporate prior information, consider the differences between cases and controls as part of the network reconstruction process,  or account for the effects of different genotypes. We will develop an expression network deconvolution method called KNDA (Knowledge-based Network Deconvolution Algorithm) that takes into account of prior biological knowledge (for example, known transcriptional regulators and downstream targets and known biological pathways), so that they can be applied to data sets with relatively small sample sizes and can increase the signal-to-noise ratio in network inference.

4. CellR

We are developing CellR, a computational method to deconvolve bulk-tissue RNA-Seq data and infer the cellular compositions and cell type-specific gene expression values  of bulk samples. Given an external reference single-cell RNA-seq data from a tissue of interest, CellR first decomposes the data and extracts cell type-specific gene markers, without prior biological knowledge on various cell types in the bulk tissue. CellR then employs genome-wide tissue-wise expression signatures to addresses cross-individual variations by weighting gene markers differently, and transforms the cellular deconvolution problem into a linear programming model considering inter/intra cellular correlations. Comparative analysis demonstrated superior performance of CellR against competing approaches that rely on a few known cell type-specific gene markers.

5. Application to schizophrenia: Deconvolution of transcriptional network

Tissue-specific reverse engineering of transcriptional networks has uncovered master regulators (MRs) of cellular networks in various cancers, yet the application of this method to neuropsychiatric disorders is largely unexplored. Here, using RNA-Seq data on postmortem dorsolateral prefrontal cortex (DLPFC) from schizophrenia (SCZ) patients and control subjects, we deconvoluted the transcriptional network to identify MRs that mediate expression of a large body of target genes. Together with an independent RNA-Seq data on cultured primary neuronal cells derived from olfactory neuroepithelium, we identified TCF4, a leading SCZ risk locus implicated by genome-wide association studies, as a candidate MR dysregulated in SCZ. We validated the dysregulated TCF4-related transcriptional network through examining the transcription factor binding footprints inferred from human induced pluripotent stem cell (hiPSC)-derived neuronal ATAC-Seq data and direct binding sites obtained from ChIP-seq data in SH-SY5Y cells that serve as in vitro models of neuronal function and differentiation. The predicted TCF4 transcriptional targets were enriched for genes showing transcriptomic changes upon knockdown of TCF4 in hiPSC-derived neural progenitor cells (NPC) and glutamatergic neurons (Glut_N), in which the hiPSC cell line was sampled from a SCZ patient. The altered TCF4 gene network perturbations in NPC, as compared to that in Glut_N, was more similar to the expression differences in the TCF4 gene network observed in the DLPFC of individuals with SCZ. Moreover, TCF4-associated gene expression changes in NPC were more enriched than Glut_N for pathways involved in neuronal activity, genome-wide significant SCZ risk genes, and SCZ-associated de novo mutations. Our results suggest that TCF4 serves as a MR of a gene network that confers susceptibility to SCZ at early stage of neurodevelopment, highlighting the importance of network dysregulation involving core genes and many hundreds of peripheral genes in conferring susceptibility to neuropsychiatric diseases. For more details, see https://www.biorxiv.org/content/early/2018/05/16/133363.