NATIONAL CHIAO TUNG UNIVERSITY

INSTITUTE OF STATISTICS

ANALYSIS OF HIGH-THROUGHPUT GENOMIC DATA

FALL 2020

Instructor:	Guan-Hua Huang, Ph.D.
	Office: 423 Joint Education Hall
	Phone: 03-513-1334
	Email: ghuang@stat.nctu.edu.tw
Class meetings:	Monday 9:00 - 12:00 at 406 Joint Education Hall
Office hours:	By appointment
Class website:	http://ghuang.stat.nctu.edu.tw/course/htgenomic20/
Credit:	Three (3) credits

COURSE SUMMARY

Novel statistical methodology can enhance understanding of the interactions between multiple genes and environmental factors on a complex disease. The massive amount of high-throughput genomic data brings a great challenge of developing advanced statistical and computational data mining tools. In this course, we will go through some effective statistical methods for analyzing these high-throughput data. The course especially focuses on three types of high-throughput data: gene expression microarray, single nucleotide polymorphism (SNP) markers, and next-generation sequencing (NGS) reads.

Topics include

Ÿ Gene expression:

- Technology and measurement

- Quality assessment

- Preprocessing Affymetrix GeneChip: background adjustment, normalization and summarization

- Differential expression

- Clustering and prediction

- Gene set enrichment analysis

Ÿ SNP markers:

- Preliminary analyses: Hardy-Weinberg equilibrium, haplotype and genotype data, measures of linkage disequilibrium, estimates of recombination rates, SNP tagging

- Population-based association study: case-control and family study

- Candidate-gene and genome-wide association studies

- Population stratification

- Tests of association: single and multiple SNPs

- Epistatic effects and gene-environment interactions

- Multiple testing

Ÿ NGS reads

- DNA sequencing

- Next-generation sequencing platforms

- 1000 Genomes project

- Genotype and SNP calling from NGS

- Tests of association for common and rare SNPs

- Structural variation in the human genome

- Best practice in analyzing next-generation sequencing data

HANDOUTS AND TEXTBOOKS

Handouts corresponding to each lecture will be available on the class website before each class. There is no required textbook for this course. Following books are recommended for further reading:

Gentleman R, Carey VJ, Huber W, Irizarry RA, Dudoit S (Editor) (2005). Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer.

Draghici S (2012). Statistics and Data Analysis for Microarrays Using R and Bioconductor, 2nd Edition. Chapman & Hall/CRC Press.

Datta S, Nettleton D (Editor) (2014). Statistical Analysis of Next Generation Sequencing Data. Springer.

Thomas DC (2004). Statistical Methods in Genetic Epidemiology. Oxford.

Collins AR (Editor) (2007). Linkage Disequilibrium and Association Mapping: Analysis and Applications. Humana Press.

PREREQUISITES

Students are expected to be familiar with computer languages R and Bioconductor. Background on probability and mathematical statistics is required.

METHOD OF STUDENT EVALUATION

The course grade will be based on three homework assignments, attendance, participation and a final project.