¸ê®Æ¬ì¾Ç¡B²Îp»PR¡X Data
Science, Statistics and R
Fall 2017½Òµ{ºõn
±Â½Ò±Ð®v |
¶À«aµØ±Ð±Â ¿ì¤½«Ç¡Gºî¦X¤@À]423«Ç ¹q¸Ü¡G03-5131334 ¹q¤l¶l¥ó¡Gghuang@stat.nctu.edu.tw |
¤W½Ò®É¶¡»P¦aÂI |
¨C¬P´Á¤G9:00-12:00©óºî¦X¤@À]202«Ç |
½Òµ{ºô¶ |
|
¶}½Ò³æ¦ì |
²ÎpºÓ |
¥Ã¤[½Ò¸¹ |
IST5575 |
¾Ç¤À¼Æ |
3 |
½Òµ{·§z»P¥Ø¼Ð
¦b¦¹¼Æ¾ÚÃz¬µªº®É¥N¡A¥©§®¹B¥Î¡u¤j¼Æ¾Ú¡v(big
data)¡A±N¥i¬°§Ú̪º¥Í¬¡±qÂåÀø¡B¬F©²¡B±Ð¨|¡B¸gÀÙ¡B¤H¤å¦UӤ象A±a¨Ó·sªº»ùÈ»P³Ð·s¡CµM¦Ó¤j¼Æ¾Úªº¤º®e±`±`¬O²V¶Ã¤£»ô¡B«~½è¤£¤@¡A¦Ó¥B¤À¥¬¦bµL¼Æ¦øªA¾¹¤¤¡C¦]¦¹¦p¦ó±q¤j¼Æ¾Ú¸Ì¡A¤Þ¥X¼çÂè䤤ªº»ùÈ¡A«K¦¨¬°²{¦b³Ì«æ¢ªº¤u§@¡A¤@Ó·sªº¬ì¾Ç»â°ì¡G¡u¸ê®Æ¬ì¾Ç¡v(data
science)¤]¥¥¨|¦Ó¥Í¡C²Îp¬O±q½ÆÂø¼Æ¾Ú¤¤µÑ¨ú¥X¦³¥Î°T®§ªº¾Ç°Ý¡A¦]¦¹¦b¸ê®Æ¬ì¾Ç»â°ì¸Ì¡A«K§êºt¤FÁ|¨¬»´«ªº¨¤¦â¡C¶Ç²Î²Îp»â°ìµÛ«©ó¼Æ²z¤èªk¾Çªºµo®i¡A¤JªùªùÂe°ª¡A©¹©¹Åý¨ä¥L»â°ì·Q¹B¥Î²Îp¤ÀªR¤èªkªº¤H«o¨B¡Cªñ¦~¨ÓR²Îp³nÅé(https://www.r-project.org/)ªº¥X²{¡A§ïÅܤF²Îp¤ÀªR¤èªkÃø¥H¿Ëªñªº±»ª¡A³z¹LR²Îp³nÅé¡A¨Ï¥ÎªÌ¤£¥Î§¹¥þ¤F¸Ñ²Îp¤èªk²`¶øªº²z½×I´º¡A«K¥i¥H»´©ö°õ¦æ³\¦h½ÆÂøªº²Îp¤ÀªR¡C
¥»½Òµ{±N¥H¹ê»Úªº¤j¼Æ¾Ú¬°®Ö¤ß¡A·f°tR²Îp³nÅ骺¨Ï¥Î¡A¤Þ¾É½Òµ{°Ñ»PªÌ±µÄ²¨Ã¾Ç²ß²Îp°ò¥»ì²z¡B¸ê®Æ±´¯Á¤èªk(exploratory
data analysis)¡B²ÎpÀË©w(statistical
hypothesis testing)¬ÛÃöªº·§©À©M¤èªk¡B°jÂk¤ÀªR(regression analysis)¡B¥D¦¨¥÷»P¦]¯À¤ÀªR(principal
component and factor analysis)¡B¶°¸s¤ÀªR(cluster
analysis)¡B¤ÀÃþ»P§P§O¤ÀªR(classification
and discrimination analysis)µ¥¸ê®Æ±´°É(data mining)¬ÛÃöªº·§©À©M¤èªk¡C
¤W½Ò¤º®e¡A±N¼sªx¥]§t©Ò¦³¬ÛÃöª¾ÃÑ¡A¤W½Ò®É°¼«Á¿z³o¨Çª¾ÃѪº°ò¥»Æ[©À»P¼Ò«¬¸ÑÄÀ(¦pªG»Ýn®É)¡C¹ï©ó²`¤Jªº²z½×»P¨ä¾l¸Ô²Ó¸ê°T¡A«h¶È§@«ÂI´£¥Ü©Î´£¨Ñ°Ñ¦Ò¤åÄm¡C½Ò°ó¤¤±N¥H¹ê»Úªº¨Ò¤l¨Ó¸É¥R¤W½Ò¤º®e¡A¨Ã°Q½×¬ÛÃö¤èªkªº²Îp³nÅéRªº¹ê§@¡C¾Ç´Á¦¨ÁZªºµû©w¡A«h¨Ì¾Úú¥æªº§@·~»P½Òµ{¹ê§@p¹º³ø§i¡C§Ú̱Nµ²¦X¤£¦PI´ºªº¾Ç¥Í²Õ¦¨½Òµ{¹ê§@p¹º¤u§@¤p²Õ¡A¨C¤@¤u§@¤p²Õ±N¦U¦Û¿ï©w¤@¤j¼Æ¾Ú¤ÀªRijÃD¡A°w¹ï¯S©wªº°ÝÃD´£¥X¸Ñ¨M¤è®×¡A¹ê§@¾ãÓ¤j¼Æ¾Ú¤ÀªR¡C
½Òµ{²Õ¦¨³¡¤À
½Ò°óÁ¿¸Ñ
ì«h¤W¡A¨C¬P´Á¤G9:00-12:00¡A¥Ñ±Â½Ò±Ð®v©ÎÁܽÐÁ¿ªÌ¡AÁ¿¸Ñ½Òµ{¬ÛÃöªº¥DÃD¡C¤W½Ò¤º®e¡A±N¼sªx¥]§t©Ò¦³¬ÛÃöª¾ÃÑ¡A¤W½Ò®É°¼«Á¿z³o¨Çª¾ÃѪº¥Í¦¨°Ê¾÷¡B°ò¥»Æ[©À»P¼Ò«¬¸ÑÄÀ¡]¦pªG»Ýn®É¡^¡C¹ï©ó²`¤Jªº²z½×»P¨ä¾l¸Ô²Ó¸ê°T¡A«h¶È§@«ÂI´£¥Ü©Î´£¨Ñ°Ñ¦Ò¤åÄm¡C´Á¬ß¤é«á·í¾Ç¥Í¿W¥ß¶i¦æ²Îp¤ÀªR®É¡A³o¨Ç¼sªxªºª¾ÃÑ¡A¯à¼W¼s¥LÌ«ä¦Ò°ÝÃDªº¨¤«×¡A¨Ã¦¨¬°²³¦h¥LÌ¥i¿ï¾Üªº¸Ñ¨M¤è®×¡CYn¶i¦æ§ó²`¤Jªº¼Ò«¬¬ã¨s»P²z½×±À¾É®É¡A«hª¾¹Dn±q¦ó¤U¤â»P¨ì¦ó³B¥h§ä´M¬ÛÃöªº»²§U¸ê°T¡C
ºt²ß½Ò
¤£©w´Á¡A©ó¬P´Á¤G11:10-12:00¡A¥Ñ§U±Ð©ÎÁܽÐÁ¿ªÌ¡A´N¬Y¤@¥DÃDªº¤W½Ò¤º®e¡A¶i¦æ¸É¥R¡Cºt²ß½Ò±NµÛ«©ó¥H¹ê»Úªº¨Ò¤l¨Ó¸É¥R¤W½Ò¤º®e¡A©Î°Q½×¬ÛÃö¤ÀªR¤èªkªºR²Îp³nÅé¹ê§@¡C
½Ò«e¡B½Ò«áªº¦Û¦æ¾\Ū¡B¦Û¦æ¾Ç²ß
½Ò°óÁ¿¸Ñ·|¼sªx¥]§t©Ò¦³¬ÛÃö¥DÃD¡A°¼«Æ[©ÀªºÁ¿z¡C¸É¥R»Pl¥Í¤º®e¡A«h·|´£¨Ñ¬ÛÃö¨Ó·½»Pºô¸ô³sµ²¡An¨D¾Ç¥Í©ó½Ò«e©Î½Ò«á¦Û¦æ¾\Ū¡C¤S¥Ñ©ó¤j¼Æ¾Ú¤ÀªR»â°ìªº½´«kµo®i¡A¬ÛÃö¶}©ñ½Òµ{¡B¤ÀªR¤èªk¡B¤ÀªR¤u¨ã¡B¦¨ªGÀ³¥Î¡B¶}©ñ¸ê®Æ¡B¡Kµ¥¹M§G©óºô¸ô¡A¦]¦¹¦P¾ÇÌ«h±`»Ýn¡]©Î¥i¥H¡^¦Û¦æ¾Ç²ß·sªº³nÅé¡B¤u¨ã¡A¨Ã§l¦¬·sªºª¾ÃÑ¡BÀ³¥Î¡Cª`·N¡A³\¦hºô¸ô³sµ²»P¤å¥ó¬O¥H^¤å¼¶¼g¡A^¤å¾\Ūªº¯à¤O±N·|«D±`«n¡C
§@·~
§@·~·|¥H¹ê»Úªº²Îp¸ê®Æ¤ÀªR¬°®Ö¤ß¡A½m²ß¸ê®ÆªººI¨ú¡B²M²z¡B¦s¨ú¡]¸ê®Æª¦®Þ¡^¡F¦p¦ó¹B¥Î¥¿½T¡B·s¿oªº²Îp¤èªk¡F¸ê®Æ¡Bµ²ªGªºµøı¤Æ¡C§@·~ªº¥Øªº¦b¾Ç²ß¹ê§@¸ê®Æ¤ÀªRªº§Þ¯à¡A¨Ã¥B´ú¸Õ§A¹ï½Ò°ó¤º®eªº²z¸Ñµ{«×¡C§â¼g§@·~µø¬°¤@ӾDzߪº¾÷·|¡A¦Ó¤£¬O¬°¤FnÁȨú¤À¼Æ¡C
¥Ñ©ó¤j³¡¥÷ªº§@·~°ÝÃD¡A·|¶·n¥HRµ{¦¡³nÅé¨Ó¶i¦æ¹ê§@¡B¤ÀªR¡A¦]¦¹n¨D¦P¾Ç̪º§@·~n¥HR Markdown
(http://rmarkdown.rstudio.com/)ªº®æ¦¡¨Ó¼¶¼g¡CR markdown¯à±N§Aªº¤å¦r»¡©ú¡B¼Æ¾Ç¦¡¤l¡BRµ{¦¡¡BR°õ¦æµ²ªG¡B¡Kµ¥¡Aµ²¦X¦¨¤@Ó¤å¥ó¡A¦p¦¹±N©ö©ó¥L¤H¾\Ū»P«»s(reproduce)§Aªº¤ÀªR¡C
§A¥i»P¨ä¥L¦P¾Ç°Q½×§@·~¡A¥HÀ°§U²z¸Ñ©Ò°Ýªº°ÝÃD¡BÂç²M½Òµ{·§©À¡C¦ý¬O§A¥²¶·¿W¥ß§¹¦¨©Òú¥æªº§@·~¡A§@·~¤¤n¨D¼gªº¹q¸£µ{¦¡¡B¶]ªº¸ê®Æ¤ÀªR¡B¸ÑÄÀªº¤ÀªRµ²ªG¡A³£¤£¥i»P¥L¤H¦@¦P¦X§@¡C
½Òµ{¹ê§@p¹º
׽ҾǥͶ·§¹¦¨¤@¥÷¤j¼Æ¾Ú¤ÀªRªºp¹º¡A¨ä¥Øªº¦bÅý§A¯à´N¤@Ó©ÒÃö¤ß©Î¦³¿³½ìªºÄ³ÃD¡A¹B¥Î½Ò°ó¤W©Ò¾Çªº¤èªk»P§Þ³N¡A±q°ÝÃD§Î¦¨¡B¸ê®Æ¨Ó·½½T»{¡B¸ê®Æ·j¶°¡BÀx¦s»P¾ã²z¡B¼Ò«¬«Ø¥ß»P¤ÀªR¡Bµ²ªG§e²{¡B»¡©ú»Pµøı¤Æ¡A¹ê§@¾ãÓ¤j¼Æ¾Ú¤ÀªRpµe¡A¥H¤@¿s¤j¼Æ¾Ú¤ÀªRªº¥þ»ª¡C
¨C¥÷p¹º³ø§i±N¥Ñ³Ì¦h4¦ì×½Ò¦P¾Ç¦@¦P§¹¦¨¡A¦¨û´Á¬ß¯àµ²¦X¤£¦P±M·~I´º¡]²Îp¡B¸ê¤u¡B¨ä¥L±M·~ª¾ÃÑ»â°ì¡^¡C¨C¤@³ø§i¤u§@¤p²Õ¡A±N¦U¦Û¿ï©w¤@Ó©ÒÃö¤ß©Î¦³¿³½ìªºÄ³ÃD¡]«D¼Ò«¬¡B¤èªk¡B²z½×µ¥§Þ³N©Ê±´°Q¡^¡C¾Ç´Á¤¤¡A¨CÓ²Õû±N¥ý´Npµe¥DÃD¡]¥]§t¡G´yz°ÝÃD¡B¹wp¦p¦ó¦^µª¡^¡A¦U¦Ûú¥æ¤@¥÷®Ñ±³ø§i¡C¾Ç´Á¥½¡A¾ãÓ¤u§@¤p²Õ±N´Np¹ºªº¡G°ÝÃD¡]¥Øªº¬°¦ó¡H·Q¹w´ú©Î¦ôp¤°»ò¡H¡^¡B¸ê®Æ¡]¨º¸Ì¨Óªº¡H¬Ý°_¨Ó¹³¤°»ò¡H¡^¡B¤ÀªR¼Ò«¬¡Bµ²ªG¡]·sµo²{¡B»PÅ¥²³·¾³q¡Bµøı¤Æ¡^¡A¶i¦æ15¤ÀÄÁªº¤fÀY³ø§i¡A»Pú¥æ³Ì²×®Ñ±³ø§i¡C
¥ý׬ì¥Ø©Î¥ý³Æ¯à¤O
1.
¦³¼g¹q¸£µ{¦¡ªº¸gÅç
l ¹³¡GC, C++, Java, Python, R,¡K
2.
³Ì¦n×¹L°ò¦²Îp¾Ç
l ª¾¹D¡GÀH¾÷ÅܼơB«H¿à°Ï¶¡¡B°²³]ÀË©w¡B¡K
3.
Ä@·N¾Ç²ß·sªº³nÅé¡B¤u¨ã
l ±`·|«D±`ªá®É¶¡
l n¤j¶q¾\Ūºô¸ô¤Wªº¤å¥ó
l ¾\Ū³\¦h^¤å¤å¥ó
½Òµ{¹ê§@³nÅé»P±Ð¬ì®Ñ
¥»ªù½Ò±N·|¥HR²Îp³nÅé(http://www.r-project.org/)¡A¨Ó·í§@¸ê®Æ¤ÀªR¹ê§@ªº¤u¨ã¡C¦]¦¹¤£½×ºt²ß½Ò§U±ÐÁ¿¸Ñ»P§@·~°ÝÃD¡A¬Ò·|¥HRµ{¦¡³nÅ骺¾Þ§@»P¼¶¼g¬°°ò¦¡C¦P¾Ç̪º§@·~n¥HR
Markdown (http://rmarkdown.rstudio.com/)ªº®æ¦¡¨Ó¼¶¼g¡A¥H§Q©ó±N§Aªº¤å¦r»¡©ú¡B¼Æ¾Ç¦¡¤l¡BRµ{¦¡¡BR°õ¦æµ²ªG¡B¡Kµ¥¡Aµ²¦X¦¨¤@Ó¤å¥ó¡A¤è«K¥L¤H¾\Ū»P«»s§Aªº¤ÀªR¡C
¥»ªù½ÒÁöµL«ü©w¡B¥²¶·ÁʶRªº±Ð¬ì®Ñ¡AµM¬ÛÃöªº¦Û¦æ¾\Ū¡B¸É¥R±Ð§÷¤º®e¡A±N¥X¦Û¥H¤U´X¥»°Ñ¦Ò®ÑÄy¡G
1.
Irizarry RA, Love MI (2015): Data Analysis for the Life Sciences. ³o¥»®Ñªº¬ÛÃö°T®§¡A¥i±q¥H¤U³sµ²Àò±o¡Ghttps://leanpub.com/dataanalysisforthelifesciences
2.
Montgomery DC, Peck EA, Vining GG (2012): Introduction to Linear
Regression Analysis (5th Edition). Wiley. ³o¥»®Ñ¬O¡u°jÂk¤ÀªR¡vªº¥Dn°Ñ¦Ò®Ñ¥Ø¡C
3.
Johnson RA, Wichern DW (2007): Applied Multivariate Statistical
Analysis (6th Edition). Prentice Hall, Upper Saddle River, NJ. ³o¥»®Ñ¬O¡u¦hÅܶq¤ÀªR¡vªº¥Dn°Ñ¦Ò®Ñ¥Ø¡C
¥»½Òµ{©Ò¦³¤W½Ò§ë¼v¤ù»P¬ÛÃö¸É¥R¸ê®Æ¡AÁÙ¦³¥Î¥H°õ¦æºt²ß½Ò¹ê»Ú¨Ò¤l»P¤W½ÒÁ¿¸q¹Ï§ÎªºRµ{¦¡¡A³£±N·|¤½§G©ó½Òµ{ºô¶¡C
¾Ç´Á¦¨ÁZµû¤À¤è¦¡
¾Ç´Á¦¨ÁZªºpºâ¤è¦¡¬°¡G
1.
§@·~¡G50%¡]®Ú¾ÚÓ¤Hú¥æ¤§§@·~¡^
2.
¹ê§@p¹º´Á¤¤³ø§i¡G20%¡]®Ú¾ÚÓ¤Hú¥æ¤§®Ñ±³ø§i¡^
3.
¹ê§@p¹º´Á¥½³ø§i¡G30%¡]®Ú¾Ú¾ãÓ¤u§@¤p²Õªº³ø§i¡^
½Òµ{¤jºõ
l
Fundamental of statistics
━ Summary statistics
━ Measure of association
━ Random variables
━ Probability mass (density)
function
━ Cumulative distribution
function
━ Mean and variance
━ Central limit theorem
━ Statistical inference
━ Point estimate
━ Confidence interval
━ Test of significance
━ P-value
l
Exploratory data analysis
━ Measurement scales,
data types
━ R graphic package:
ggplot2
━ Displaying distribution
of univariate data: stem-and-leaf plot, q-q plot, histogram, box plot, bar
chart, pie chart
━ Displaying correlation
for bivariate data: scatterplot, box plots, stacked bar chart,
faceting bar charts, stacked area chart, time series plot
━ Displaying association
for multivariate data: 3d scatterplot, lattice in the 3rd dim, map the 3rd dim
to colors, lay out panels in the 3rd dim, scatterplot matrices, heatmap
l
Statistical decision making: hypothesis testing
━ Basic concepts: null
versus alternative hypothesis, type I type II errors, significance level, test
statistic, power, p-values
━ Hypothesis testing for
continuous random variables: one-sample t-test, two-sample t-test, F-test for
equal variance, ANOVA, paired t-test,
━ Hypothesis testing for categorical
data: binomial test, 𝑥2 test / Fisher¡¦s exact
test, McNemar's test, Cohen's kappa test, Mantel-Haenszel test
━ Nonparametric
statistical methods: sign test, Wilcoxon signed-rank test, Wilcoxon rank-sum
test, Kruskal-Wallis test
━ Computational methods:
permutation test, bootstrap
l
Regression analysis
━ Simple and multiple
linear regressions for continuous data
━ Interpretation and
estimation of regression coefficients
━ Confounding and
interaction
━ Regression diagnostics
━ Logistic regressions
for binary data
l
Principal component and factor analysis
━ Population principal
components
━ Summarizing sample
variation by principal components
━ Orthogonal factor model
━ Factor rotation
━ Factor scores
l
Cluster analysis
━ Similarity and distance
measures
━ Hierarchical clustering
methods
━ K-means clustering
methods
━ Multidimensional
scaling
l
Classification and discrimination analysis
━ Linear/quadratic discrimination
analysis
━ Support vector machine
(SVM)
━ Neural networks (NN)
━ Classification and
regression trees (CART)
━ K-nearest neighbor
(KNN)