¸ê®Æ¬ì¾Ç¡B²Î­p»PR¡X Data Science, Statistics and R

Fall 2017½Òµ{ºõ­n

 

 


±Â½Ò±Ð®v

¶À«aµØ±Ð±Â

¿ì¤½«Ç¡Gºî¦X¤@À]423«Ç

¹q¸Ü¡G03-5131334

¹q¤l¶l¥ó¡Gghuang@stat.nctu.edu.tw

¤W½Ò®É¶¡»P¦aÂI

¨C¬P´Á¤G9:00-12:00©óºî¦X¤@À]202«Ç

½Òµ{ºô­¶

http://ghuang.stat.nctu.edu.tw/course/datasci17/

¶}½Ò³æ¦ì

²Î­pºÓ

¥Ã¤[½Ò¸¹

IST5575

¾Ç¤À¼Æ

3

 

½Òµ{·§­z»P¥Ø¼Ð

 

¦b¦¹¼Æ¾ÚÃz¬µªº®É¥N¡A¥©§®¹B¥Î¡u¤j¼Æ¾Ú¡v(big data)¡A±N¥i¬°§Ú­Ìªº¥Í¬¡±qÂåÀø¡B¬F©²¡B±Ð¨|¡B¸gÀÙ¡B¤H¤å¦U­Ó¤è­±¡A±a¨Ó·sªº»ù­È»P³Ð·s¡CµM¦Ó¤j¼Æ¾Úªº¤º®e±`±`¬O²V¶Ã¤£»ô¡B«~½è¤£¤@¡A¦Ó¥B¤À¥¬¦bµL¼Æ¦øªA¾¹¤¤¡C¦]¦¹¦p¦ó±q¤j¼Æ¾Ú¸Ì¡A¤Þ¥X¼çÂè䤤ªº»ù­È¡A«K¦¨¬°²{¦b³Ì«æ­¢ªº¤u§@¡A¤@­Ó·sªº¬ì¾Ç»â°ì¡G¡u¸ê®Æ¬ì¾Ç¡v(data science)¤]¥¥¨|¦Ó¥Í¡C²Î­p¬O±q½ÆÂø¼Æ¾Ú¤¤µÑ¨ú¥X¦³¥Î°T®§ªº¾Ç°Ý¡A¦]¦¹¦b¸ê®Æ¬ì¾Ç»â°ì¸Ì¡A«K§êºt¤FÁ|¨¬»´­«ªº¨¤¦â¡C¶Ç²Î²Î­p»â°ìµÛ­«©ó¼Æ²z¤èªk¾Çªºµo®i¡A¤JªùªùÂe°ª¡A©¹©¹Åý¨ä¥L»â°ì·Q¹B¥Î²Î­p¤ÀªR¤èªkªº¤H«o¨B¡Cªñ¦~¨ÓR²Î­p³nÅé(https://www.r-project.org/)ªº¥X²{¡A§ïÅܤF²Î­p¤ÀªR¤èªkÃø¥H¿Ëªñªº­±»ª¡A³z¹LR²Î­p³nÅé¡A¨Ï¥ÎªÌ¤£¥Î§¹¥þ¤F¸Ñ²Î­p¤èªk²`¶øªº²z½×­I´º¡A«K¥i¥H»´©ö°õ¦æ³\¦h½ÆÂøªº²Î­p¤ÀªR¡C

 

¥»½Òµ{±N¥H¹ê»Úªº¤j¼Æ¾Ú¬°®Ö¤ß¡A·f°tR²Î­p³nÅ骺¨Ï¥Î¡A¤Þ¾É½Òµ{°Ñ»PªÌ±µÄ²¨Ã¾Ç²ß²Î­p°ò¥»­ì²z¡B¸ê®Æ±´¯Á¤èªk(exploratory data analysis)¡B²Î­pÀË©w(statistical hypothesis testing)¬ÛÃöªº·§©À©M¤èªk¡B°jÂk¤ÀªR(regression analysis)¡B¥D¦¨¥÷»P¦]¯À¤ÀªR(principal component and factor analysis)¡B¶°¸s¤ÀªR(cluster analysis)¡B¤ÀÃþ»P§P§O¤ÀªR(classification and discrimination analysis)µ¥¸ê®Æ±´°É(data mining)¬ÛÃöªº·§©À©M¤èªk¡C

 

¤W½Ò¤º®e¡A±N¼sªx¥]§t©Ò¦³¬ÛÃöª¾ÃÑ¡A¤W½Ò®É°¼­«Á¿­z³o¨Çª¾ÃѪº°ò¥»Æ[©À»P¼Ò«¬¸ÑÄÀ(¦pªG»Ý­n®É)¡C¹ï©ó²`¤Jªº²z½×»P¨ä¾l¸Ô²Ó¸ê°T¡A«h¶È§@­«ÂI´£¥Ü©Î´£¨Ñ°Ñ¦Ò¤åÄm¡C½Ò°ó¤¤±N¥H¹ê»Úªº¨Ò¤l¨Ó¸É¥R¤W½Ò¤º®e¡A¨Ã°Q½×¬ÛÃö¤èªkªº²Î­p³nÅéRªº¹ê§@¡C¾Ç´Á¦¨ÁZªºµû©w¡A«h¨Ì¾Úú¥æªº§@·~»P½Òµ{¹ê§@­p¹º³ø§i¡C§Ú­Ì±Nµ²¦X¤£¦P­I´ºªº¾Ç¥Í²Õ¦¨½Òµ{¹ê§@­p¹º¤u§@¤p²Õ¡A¨C¤@¤u§@¤p²Õ±N¦U¦Û¿ï©w¤@¤j¼Æ¾Ú¤ÀªRijÃD¡A°w¹ï¯S©wªº°ÝÃD´£¥X¸Ñ¨M¤è®×¡A¹ê§@¾ã­Ó¤j¼Æ¾Ú¤ÀªR¡C

 

½Òµ{²Õ¦¨³¡¤À

 

½Ò°óÁ¿¸Ñ

­ì«h¤W¡A¨C¬P´Á¤G9:00-12:00¡A¥Ñ±Â½Ò±Ð®v©ÎÁܽÐÁ¿ªÌ¡AÁ¿¸Ñ½Òµ{¬ÛÃöªº¥DÃD¡C¤W½Ò¤º®e¡A±N¼sªx¥]§t©Ò¦³¬ÛÃöª¾ÃÑ¡A¤W½Ò®É°¼­«Á¿­z³o¨Çª¾ÃѪº¥Í¦¨°Ê¾÷¡B°ò¥»Æ[©À»P¼Ò«¬¸ÑÄÀ¡]¦pªG»Ý­n®É¡^¡C¹ï©ó²`¤Jªº²z½×»P¨ä¾l¸Ô²Ó¸ê°T¡A«h¶È§@­«ÂI´£¥Ü©Î´£¨Ñ°Ñ¦Ò¤åÄm¡C´Á¬ß¤é«á·í¾Ç¥Í¿W¥ß¶i¦æ²Î­p¤ÀªR®É¡A³o¨Ç¼sªxªºª¾ÃÑ¡A¯à¼W¼s¥L­Ì«ä¦Ò°ÝÃDªº¨¤«×¡A¨Ã¦¨¬°²³¦h¥L­Ì¥i¿ï¾Üªº¸Ñ¨M¤è®×¡C­Y­n¶i¦æ§ó²`¤Jªº¼Ò«¬¬ã¨s»P²z½×±À¾É®É¡A«hª¾¹D­n±q¦ó¤U¤â»P¨ì¦ó³B¥h§ä´M¬ÛÃöªº»²§U¸ê°T¡C

 

ºt²ß½Ò

¤£©w´Á¡A©ó¬P´Á¤G11:10-12:00¡A¥Ñ§U±Ð©ÎÁܽÐÁ¿ªÌ¡A´N¬Y¤@¥DÃDªº¤W½Ò¤º®e¡A¶i¦æ¸É¥R¡Cºt²ß½Ò±NµÛ­«©ó¥H¹ê»Úªº¨Ò¤l¨Ó¸É¥R¤W½Ò¤º®e¡A©Î°Q½×¬ÛÃö¤ÀªR¤èªkªºR²Î­p³nÅé¹ê§@¡C

 

½Ò«e¡B½Ò«áªº¦Û¦æ¾\Ū¡B¦Û¦æ¾Ç²ß

½Ò°óÁ¿¸Ñ·|¼sªx¥]§t©Ò¦³¬ÛÃö¥DÃD¡A°¼­«Æ[©ÀªºÁ¿­z¡C¸É¥R»P­l¥Í¤º®e¡A«h·|´£¨Ñ¬ÛÃö¨Ó·½»Pºô¸ô³sµ²¡A­n¨D¾Ç¥Í©ó½Ò«e©Î½Ò«á¦Û¦æ¾\Ū¡C¤S¥Ñ©ó¤j¼Æ¾Ú¤ÀªR»â°ìªº½´«kµo®i¡A¬ÛÃö¶}©ñ½Òµ{¡B¤ÀªR¤èªk¡B¤ÀªR¤u¨ã¡B¦¨ªGÀ³¥Î¡B¶}©ñ¸ê®Æ¡B¡Kµ¥¹M§G©óºô¸ô¡A¦]¦¹¦P¾Ç­Ì«h±`»Ý­n¡]©Î¥i¥H¡^¦Û¦æ¾Ç²ß·sªº³nÅé¡B¤u¨ã¡A¨Ã§l¦¬·sªºª¾ÃÑ¡BÀ³¥Î¡Cª`·N¡A³\¦hºô¸ô³sµ²»P¤å¥ó¬O¥H­^¤å¼¶¼g¡A­^¤å¾\Ūªº¯à¤O±N·|«D±`­«­n¡C

 

§@·~

§@·~·|¥H¹ê»Úªº²Î­p¸ê®Æ¤ÀªR¬°®Ö¤ß¡A½m²ß¸ê®ÆªººI¨ú¡B²M²z¡B¦s¨ú¡]¸ê®Æª¦®Þ¡^¡F¦p¦ó¹B¥Î¥¿½T¡B·s¿oªº²Î­p¤èªk¡F¸ê®Æ¡Bµ²ªGªºµøı¤Æ¡C§@·~ªº¥Øªº¦b¾Ç²ß¹ê§@¸ê®Æ¤ÀªRªº§Þ¯à¡A¨Ã¥B´ú¸Õ§A¹ï½Ò°ó¤º®eªº²z¸Ñµ{«×¡C§â¼g§@·~µø¬°¤@­Ó¾Ç²ßªº¾÷·|¡A¦Ó¤£¬O¬°¤F­nÁȨú¤À¼Æ¡C

 

¥Ñ©ó¤j³¡¥÷ªº§@·~°ÝÃD¡A·|¶·­n¥HRµ{¦¡³nÅé¨Ó¶i¦æ¹ê§@¡B¤ÀªR¡A¦]¦¹­n¨D¦P¾Ç­Ìªº§@·~­n¥HR Markdown (http://rmarkdown.rstudio.com/)ªº®æ¦¡¨Ó¼¶¼g¡CR markdown¯à±N§Aªº¤å¦r»¡©ú¡B¼Æ¾Ç¦¡¤l¡BRµ{¦¡¡BR°õ¦æµ²ªG¡B¡Kµ¥¡Aµ²¦X¦¨¤@­Ó¤å¥ó¡A¦p¦¹±N©ö©ó¥L¤H¾\Ū»P­«»s(reproduce)§Aªº¤ÀªR¡C

 

§A¥i»P¨ä¥L¦P¾Ç°Q½×§@·~¡A¥HÀ°§U²z¸Ñ©Ò°Ýªº°ÝÃD¡BÂç²M½Òµ{·§©À¡C¦ý¬O§A¥²¶·¿W¥ß§¹¦¨©Òú¥æªº§@·~¡A§@·~¤¤­n¨D¼gªº¹q¸£µ{¦¡¡B¶]ªº¸ê®Æ¤ÀªR¡B¸ÑÄÀªº¤ÀªRµ²ªG¡A³£¤£¥i»P¥L¤H¦@¦P¦X§@¡C

 

½Òµ{¹ê§@­p¹º

­×½Ò¾Ç¥Í¶·§¹¦¨¤@¥÷¤j¼Æ¾Ú¤ÀªRªº­p¹º¡A¨ä¥Øªº¦bÅý§A¯à´N¤@­Ó©ÒÃö¤ß©Î¦³¿³½ìªºÄ³ÃD¡A¹B¥Î½Ò°ó¤W©Ò¾Çªº¤èªk»P§Þ³N¡A±q°ÝÃD§Î¦¨¡B¸ê®Æ¨Ó·½½T»{¡B¸ê®Æ·j¶°¡BÀx¦s»P¾ã²z¡B¼Ò«¬«Ø¥ß»P¤ÀªR¡Bµ²ªG§e²{¡B»¡©ú»Pµøı¤Æ¡A¹ê§@¾ã­Ó¤j¼Æ¾Ú¤ÀªR­pµe¡A¥H¤@¿s¤j¼Æ¾Ú¤ÀªRªº¥þ»ª¡C

 

¨C¥÷­p¹º³ø§i±N¥Ñ³Ì¦h4¦ì­×½Ò¦P¾Ç¦@¦P§¹¦¨¡A¦¨­û´Á¬ß¯àµ²¦X¤£¦P±M·~­I´º¡]²Î­p¡B¸ê¤u¡B¨ä¥L±M·~ª¾ÃÑ»â°ì¡^¡C¨C¤@³ø§i¤u§@¤p²Õ¡A±N¦U¦Û¿ï©w¤@­Ó©ÒÃö¤ß©Î¦³¿³½ìªºÄ³ÃD¡]«D¼Ò«¬¡B¤èªk¡B²z½×µ¥§Þ³N©Ê±´°Q¡^¡C¾Ç´Á¤¤¡A¨C­Ó²Õ­û±N¥ý´N­pµe¥DÃD¡]¥]§t¡G´y­z°ÝÃD¡B¹w­p¦p¦ó¦^µª¡^¡A¦U¦Ûú¥æ¤@¥÷®Ñ­±³ø§i¡C¾Ç´Á¥½¡A¾ã­Ó¤u§@¤p²Õ±N´N­p¹ºªº¡G°ÝÃD¡]¥Øªº¬°¦ó¡H·Q¹w´ú©Î¦ô­p¤°»ò¡H¡^¡B¸ê®Æ¡]¨º¸Ì¨Óªº¡H¬Ý°_¨Ó¹³¤°»ò¡H¡^¡B¤ÀªR¼Ò«¬¡Bµ²ªG¡]·sµo²{¡B»PÅ¥²³·¾³q¡Bµøı¤Æ¡^¡A¶i¦æ15¤ÀÄÁªº¤fÀY³ø§i¡A»Pú¥æ³Ì²×®Ñ­±³ø§i¡C

 

¥ý­×¬ì¥Ø©Î¥ý³Æ¯à¤O

 

1.        ¦³¼g¹q¸£µ{¦¡ªº¸gÅç

l   ¹³¡GC, C++, Java, Python, R,¡K

2.        ³Ì¦n­×¹L°ò¦²Î­p¾Ç

l   ª¾¹D¡GÀH¾÷ÅܼơB«H¿à°Ï¶¡¡B°²³]ÀË©w¡B¡K

3.        Ä@·N¾Ç²ß·sªº³nÅé¡B¤u¨ã

l   ±`·|«D±`ªá®É¶¡

l   ­n¤j¶q¾\Ūºô¸ô¤Wªº¤å¥ó

l   ¾\Ū³\¦h­^¤å¤å¥ó

 

½Òµ{¹ê§@³nÅé»P±Ð¬ì®Ñ

 

¥»ªù½Ò±N·|¥HR²Î­p³nÅé(http://www.r-project.org/)¡A¨Ó·í§@¸ê®Æ¤ÀªR¹ê§@ªº¤u¨ã¡C¦]¦¹¤£½×ºt²ß½Ò§U±ÐÁ¿¸Ñ»P§@·~°ÝÃD¡A¬Ò·|¥HRµ{¦¡³nÅ骺¾Þ§@»P¼¶¼g¬°°ò¦¡C¦P¾Ç­Ìªº§@·~­n¥HR Markdown (http://rmarkdown.rstudio.com/)ªº®æ¦¡¨Ó¼¶¼g¡A¥H§Q©ó±N§Aªº¤å¦r»¡©ú¡B¼Æ¾Ç¦¡¤l¡BRµ{¦¡¡BR°õ¦æµ²ªG¡B¡Kµ¥¡Aµ²¦X¦¨¤@­Ó¤å¥ó¡A¤è«K¥L¤H¾\Ū»P­«»s§Aªº¤ÀªR¡C

 

¥»ªù½ÒÁöµL«ü©w¡B¥²¶·ÁʶRªº±Ð¬ì®Ñ¡AµM¬ÛÃöªº¦Û¦æ¾\Ū¡B¸É¥R±Ð§÷¤º®e¡A±N¥X¦Û¥H¤U´X¥»°Ñ¦Ò®ÑÄy¡G

1.        Irizarry RA, Love MI (2015): Data Analysis for the Life Sciences. ³o¥»®Ñªº¬ÛÃö°T®§¡A¥i±q¥H¤U³sµ²Àò±o¡Ghttps://leanpub.com/dataanalysisforthelifesciences

2.        Montgomery DC, Peck EA, Vining GG (2012): Introduction to Linear Regression Analysis (5th Edition). Wiley. ³o¥»®Ñ¬O¡u°jÂk¤ÀªR¡vªº¥D­n°Ñ¦Ò®Ñ¥Ø¡C

3.        Johnson RA, Wichern DW (2007): Applied Multivariate Statistical Analysis (6th Edition). Prentice Hall, Upper Saddle River, NJ. ³o¥»®Ñ¬O¡u¦hÅܶq¤ÀªR¡vªº¥D­n°Ñ¦Ò®Ñ¥Ø¡C

 

¥»½Òµ{©Ò¦³¤W½Ò§ë¼v¤ù»P¬ÛÃö¸É¥R¸ê®Æ¡AÁÙ¦³¥Î¥H°õ¦æºt²ß½Ò¹ê»Ú¨Ò¤l»P¤W½ÒÁ¿¸q¹Ï§ÎªºRµ{¦¡¡A³£±N·|¤½§G©ó½Òµ{ºô­¶¡C

 

¾Ç´Á¦¨ÁZµû¤À¤è¦¡

 

¾Ç´Á¦¨ÁZªº­pºâ¤è¦¡¬°¡G

1.        §@·~¡G50%¡]®Ú¾Ú­Ó¤Hú¥æ¤§§@·~¡^

2.        ¹ê§@­p¹º´Á¤¤³ø§i¡G20%¡]®Ú¾Ú­Ó¤Hú¥æ¤§®Ñ­±³ø§i¡^

3.        ¹ê§@­p¹º´Á¥½³ø§i¡G30%¡]®Ú¾Ú¾ã­Ó¤u§@¤p²Õªº³ø§i¡^

 

½Òµ{¤jºõ

 

l   Fundamental of statistics

      Summary statistics

      Measure of association

      Random variables

      Probability mass (density) function

      Cumulative distribution function

      Mean and variance

      Central limit theorem

      Statistical inference

      Point estimate

      Confidence interval

      Test of significance

      P-value

l   Exploratory data analysis

      Measurement scales, data types

      R graphic package: ggplot2

      Displaying distribution of univariate data: stem-and-leaf plot, q-q plot, histogram, box plot, bar chart, pie chart

      Displaying correlation for bivariate data: scatterplot, box plots, stacked bar chart, faceting bar charts, stacked area chart, time series plot

      Displaying association for multivariate data: 3d scatterplot, lattice in the 3rd dim, map the 3rd dim to colors, lay out panels in the 3rd dim, scatterplot matrices, heatmap

l   Statistical decision making: hypothesis testing

      Basic concepts: null versus alternative hypothesis, type I type II errors, significance level, test statistic, power, p-values

      Hypothesis testing for continuous random variables: one-sample t-test, two-sample t-test, F-test for equal variance, ANOVA, paired t-test,

      Hypothesis testing for categorical data: binomial test, 𝑥2 test / Fisher¡¦s exact test, McNemar's test, Cohen's kappa test, Mantel-Haenszel test

      Nonparametric statistical methods: sign test, Wilcoxon signed-rank test, Wilcoxon rank-sum test, Kruskal-Wallis test

      Computational methods: permutation test, bootstrap

l   Regression analysis

      Simple and multiple linear regressions for continuous data

      Interpretation and estimation of regression coefficients

      Confounding and interaction

      Regression diagnostics

      Logistic regressions for binary data

l   Principal component and factor analysis

      Population principal components

      Summarizing sample variation by principal components

      Orthogonal factor model

      Factor rotation

      Factor scores

l   Cluster analysis

      Similarity and distance measures

      Hierarchical clustering methods

      K-means clustering methods

      Multidimensional scaling

l   Classification and discrimination analysis

      Linear/quadratic discrimination analysis

      Support vector machine (SVM)

      Neural networks (NN)

      Classification and regression trees (CART)

      K-nearest neighbor (KNN)