Courses and training activies



The 2019 Dragon Star Bioinformatics course was taught by Dr. Kai Wang on July 19-August 2, 2019. This is a five-day course with the theme "Genomics of Human Diseases".

Housekeeping issues and acknowledgements are given here:

  1. Day 1
      Genomic technologies in disease studies: Course slide, Computing exercise (in Chinese)
      NGS data formats and variant calling: Course slide, Computing exercise
  2. Day 2
      Alignment of short/long-read sequencing data: Course slide, Computing exercise
      Genome assembly by short/long-read sequencing: Course slide, Computing exercise
  3. Day 3
      Detection of structural variants in human diseases: Course slide, Computing exercise
      Annotation and phenotype-driven interpretation of genetic variants: Course slide, Computing exercise 1, Computing exercise 2
  4. Day 4
      SNP and Sequencing-based genome-wide association studies (GWAS): Course slide, Computing exercise
      Rare variants and de novo variants association studies: Course slide, Computing exercise
      Introduction to cloud computing in bioinformatics
  5. Day 5
      RNA-Seq in human diseases: Course slide, Computing exercise
      Deep Learning in sequencing data analysis: Course slide



This is a quantitative genomics summer training course (June 6 - June 11, 2020) developed by Iuliana Ionita-Laza (Columbia), Hae Kyung Im (Chicago) and Kai Wang (Penn). The topics of the course is "Methods and tools for whole-genome and transcriptome analyses".

The course is not restricted within Columbia; in fact it is open to all educational/research institutes and biopharmaceutical companies in the NYC region. However, due to COVID-19, this course was switched to be an online-only format.

In the session 1 of the first day, we discussed various computational approaches that can be used for genomic variant annotation. In the session 2, we discussed phenotype-driven prioritization of candidate genes. After lunch break, we had a hands-on exercise to analyze real-world genomic data sets using a provided VCF files with genetic variants on a patient with known phenotypic presenations.

The exercise can be reproduced from for those who registered for the course but did not have enough time to complete all exercises.



I started a bioinformatics lab in 2010. I built a computing cluster for lab members, and we have been using the cluster ever since. Occasionally some people ask me how I can admin a cluster myself, and I usually point them to the documentation on Rocks cluster. However, the documentation may be overly complicated yet insufficient to solve many of the practical problems that I encountered over the past a few years. Therefore, I decided to write a simple tutorial on building a computing cluster for a bioinformatics lab.

This tutorial is organized into several main sections:

  1. Hardware
  2. Installation
  3. System customization
  4. User account administration
  5. Network administration
  6. SGE administration
  7. Storage administration
  8. NFS administration
  9. Remote management
  10. Misc Linux configuration

The entire tutorial can be accessed at and it will be constantly updated.



This is a quantitative genomics summer training course (June 24 - June 25, 2021) developed by Iuliana Ionita-Laza (Columbia), Hae Kyung Im (Chicago) and Kai Wang (Penn). The topics of the course is "Methods and tools for whole- genome and transcriptome analyses".

This two-day intensive workshop will provide a rigorous introduction to several different techniques to analyze whole-genome sequencing and transcriptome data. Led by a team of experts in statistical genomics and bioinformatics, who have developed their own methods to analyze such data, the training will integrate seminar lectures with hands-on computer lab sessions to put concepts into practice. The training will focus on reviewing existing approaches based on predicted expression association with traits, colocalization of causal variants, and Mendelian Randomization, including discussion on how they relate to each other, and their advantages and limitations. Emphasis will also be given to reviewing integrative sequence based association studies for whole-genome sequencing data, and functional annotation of variants in noncoding regions of the genome.

You can register the 2021 course here.



The IDDRC Data Science Core Machine Learning Short Course will be open to Penn and CHOP community who work on Intellectual and Developmental Disabilities. The goal of the short course is to provide basic introduction to machine-learning concepts and demonstrate examples of their research/clinical use, to increase the awareness of these computational approaches in the IDD community.

The course will be composed of two modules, each of four 2-hour sessions. Module 1 will introduce basic concepts of machine learning to researchers with limited background. At completion, an attendee should be able to understand commonly used methods applied in the biomedical literature. The coursse will cover classification, regression, clustering, the main steps of machine learning and predictive modeling (training, testing, cross- validation), feature engineering, validation metrics, and decision making. Module 2 will involve implementation of basic machine learning techniques. This module will target trainees with Linux and command- line experience, to gain hands-on experience through a series of exercises using Python. Jupyter Notebook will be used to document software code and figures/graphs generated in the course. We will introduce Pandas for rapid access and management of large data sets, NumPy and SciPy for simple mathematical operations on the data sets, simple Tensorflow or PyTorch scripts for machine- learning exercises, and a cornerstone project to use existing deep- learning tools to analyze omics data sets. We will illustrate reproducible workflow using Conda and reproducible code using GitHub. Details of the course, including registration links, will be announced once the details are finalized.