====== Statistics for Data Science (628PP) A.Y. 2022/23 ====== =====Instructor===== * **Salvatore Ruggieri** * Università di Pisa * [[http://pages.di.unipi.it/ruggieri/]] * [[salvatore.ruggieri@unipi.it]] * **Office hours:** Wednesdays h 14:00 - 16:00 or by appointment, at the Department of Computer Science, room 321/DO, or via Teams. =====Classes===== ^ Day of Week ^ Hour ^ Room ^ | Wednesday | 9:00 - 11:00 | Fib-C | | Thursday | 11:00 - 13:00 | Fib-C | | Friday | 14:00 - 16:00 | Fib-C | =====Pre-requisites===== Students should be comfortable with most of the topics on mathematical calculus covered in: * **[P]** J. Ward, J. Abdey. **Mathematics and Statistics**. University of London, 2013. __Chapters 1-8 of Part 1__. Extra-lessons refreshing such notions may be planned in the first part of the course. =====Mandatory Teaching Material===== The following are //mandatory text books//: * **[T]** F.M. Dekking C. Kraaikamp, H.P. Lopuha, L.E. Meester. **A Modern Introduction to Probability and Statistics**. Springer, 2005. * **[R]** P. Dalgaard. **Introductory Statistics with R**. 2nd edition, Springer, 2008. * selected chapters of other books for advanced topics =====Software===== * [[https://cran.r-project.org/|R]] * [[https://www.rstudio.com/|R Studio]] =====Preliminary program and calendar===== * [[https://esami.unipi.it/programma.php?c=57053&aa=2022|Preliminary program]]. * [[https://didattica.di.unipi.it/laurea-magistrale-in-data-science-and-business-informatics/orario-magistrale-data-science-business-informatics/|Calendar of lessons]]. =====Exams===== __//There are no mid-terms//.__ The exam consists of a written part and an oral part. The written part consists of exercises on the topics of the course. Each question is assigned a grade, summing up to 30 points. Students are admitted to the oral part if they receive a grade of at least 18 points. The written part consists of open questions and exercises. Example written texts: **{{ :mds:sds:sds_sample1.pdf | sample1}}**, **{{ :mds:sds:sds_sample2.pdf | sample2}}**. The oral part consists of critical discussion of the written part and of open questions and problem solving on the topics (both theory and R programming) of the course. Registration to exams is mandatory (**beware of the registration deadline!**): [[https://esami.unipi.it/esami2/|register here]]\\ ^ Date ^ Hour ^ Room ^ Notes ^ | 6/3/2024 | 11:00 - 13:00 | Dip. Inf. - Seminari Est | [[https://didattica.di.unipi.it/en/appelli-straordinari/|Extra-ordinary exam]] | =====Student project===== * The project replaces the written part of the examination * {{:mds:sds:sds.project.2023.pdf |Project description and rules and Q&A}}. =====Teams channel ===== A [[https://teams.microsoft.com/l/team/19%3aUXLp8LsaQdVRG5tOpd1wu8iBkhzgz8uUt-eEfWGgoNk1%40thread.tacv2/conversations?groupId=ee415f6c-9177-47d7-9be4-639da1fe5ea0&tenantId=c7456b31-a220-47f5-be52-473828670aa1|Teams channel]] will be used to post news, Q&A, and other material related to the course. =====Class calendar===== Lessons will be **NOT** be live-streamed, but recordings of past years are available here for non-attending students.\\ To watch the recordings online, you must be connected to the [[https://start.unipi.it/en/help-ict/vpn/|unipi.it VPN]]. Alternatively, right click on the link and download the whole file, then watch it locally on your device using e.g. [[http://www.videolan.org/vlc/|VLC media player]]. Slides and R scripts might be updated after the classes to align with actual content of lessons and to correct typos. Be sure to download the updated versions. ^ # ^ Date ^ Room ^ Topic ^ Teaching material ^ |01| 22/02 9-11| Fib-C | Introduction. Probability and independence. [[http://131.114.72.230/sds/video/sds01_20220215.mp4|rec01 (.mp4)]] | **[T]** Chpts. 1-3 {{:mds:sds:sds01.pdf|slides01 (.pdf)}}| |02| 23/02 11-13| Fib-C | R basics. [[http://131.114.72.230/sds/video/sds02_20220217.mp4|rec02 (.mp4)]] | **[R]** Chpts. 1,2.1-2.3 {{:mds:sds:sds02.pdf|slides02 (.pdf)}}, {{:mds:sds:sds02.r|script02 (.R)}}| |03| 24/02 14-16| Fib-E | Bayes' rule and applications. [[http://131.114.72.230/sds/video/sds03_20220218.mp4|rec03 (.mp4)]] | **[T]** Chpt. 3 {{:mds:sds:sds03.pdf|slides03 (.pdf)}}, {{:mds:sds:sds03.r|script03 (.R)}}| |04| 01/03 9-11 | Fib-C | Discrete random variables. [[http://131.114.72.230/sds/video/sds04_20220222.mp4|rec04 (.mp4)]] | **[T]** Chpts. 4, 9.1, 9.2, 9.4 **[R]** Chpt. 3 {{:mds:sds:sds04.pdf|slides04 (.pdf)}}, {{:mds:sds:sds04.r|script04 (.R)}}| |05| 02/03 11-13 | Fib-C | Discrete random variables (continued). [[http://131.114.72.230/sds/video/sds05_20220224.mp4|rec05 (.mp4)]] | | |06| 03/03 14-16 | Fib-C | Recalls: derivatives and integrals. [[http://131.114.72.230/sds/video/sds06_20220225.mp4|rec06 (.mp4)]] | **[P]** Chpt. 1-8 {{:mds:sds:sds06.pdf|slides06 (.pdf)}}, {{:mds:sds:sds06.r|script06 (.R)}}| |07| 08/03 9-11 | Fib-C | R data access and programming. [[http://131.114.72.230/sds/video/sds07_20220301.mp4|rec07 (.mp4)]] | **[R]** Chpt. 2.3,2.4 {{:mds:sds:sds07.zip|script07 (.zip)}} | |08| 09/03 11-13 | Fib-C | Continuous random variables.[[http://131.114.72.230/sds/video/sds08_20220303.mp4|rec08 (.mp4)]] | **[T]** Chpts. 5, 9.2-9.4 **[R]** Chpt. 3 {{:mds:sds:sds08.pdf|slides08 (.pdf)}}, {{:mds:sds:sds08.r|script08 (.R)}}| |09| 10/03 14-16 | Fib-C | Expectation and variance. Computations with random variables.[[http://131.114.72.230/sds/video/sds09_20220304.mp4|rec09 (.mp4)]] | **[T]** Chpts. 7,8 {{:mds:sds:sds09.pdf|slides09 (.pdf)}}, {{:mds:sds:sds09.r|script09 (.R)}}| |10| 15/03 9-11| Fib-C | Expectation and variance. Computations with random variables (continued).[[http://131.114.72.230/sds/video/sds10_20220308.mp4|rec10 (.mp4)]] | | |11| 16/03 11-13| Fib-C | Moments. Functions of random variables.[[http://131.114.72.230/sds/video/sds11_20220310.mp4|rec11 (.mp4)]] | **[T]** Chpts. 9-11 {{:mds:sds:sds11.pdf|slides11 (.pdf)}}, {{:mds:sds:sds11.zip|script11 (.zip)}} | |12| 17/03 14-16 | Fib-C | Simulation. [[http://131.114.72.230/sds/video/sds12_20220311.mp4|rec12 (.mp4)]] | **[T]** Chpts. 6.1-6.2 {{:mds:sds:sds12.pdf|slides12 (.pdf)}}, {{:mds:sds:sds12.r|script12 (.R)}} {{:mds:sds:sds12_sol07.r|script12_sol07 (.R)}}| |13| 22/03 9-11 | Fib-C | Power laws and Zipf's law. [[http://131.114.72.230/sds/video/sds13_20220315.mp4|rec13 (.mp4)]] | [[https://arxiv.org/pdf/cond-mat/0412004.pdf | Newman's paper]] Sect I, II, III(A,B,E,F) {{:mds:sds:sds13.pdf|slides13 (.pdf)}}, {{:mds:sds:sds13.r|script13 (.R)}}| |14| 23/03 11-13| Fib-C | Law of large numbers. The central limit theorem. [[http://131.114.72.230/sds/video/sds14_20220317.mp4|rec14 (.mp4)]] | **[T]** Chpts. 13-14 {{:mds:sds:sds14.pdf|slides14 (.pdf)}}, {{:mds:sds:sds14.R|script14 (.R)}} | |15| 24/03 14-16 | Fib-C | Graphical summaries. Kernel Density Estimation. [[http://131.114.72.230/sds/video/sds15_20220322.mp4|rec15 (.mp4)]] | **[T]** Chpt. 15, **[R]** Chpt. 4 {{:mds:sds:sds15.pdf|slides15 (.pdf)}}, {{:mds:sds:sds15.r|script15 (.R)}}| |16| 29/03 9-11| Fib-C | Numerical summaries.[[http://131.114.72.230/sds/video/sds16_20220324.mp4|rec16 (.mp4)]] | **[T]** Chpt. 16, **[R]** Chpt. 4 {{:mds:sds:sds16.pdf|slides16 (.pdf)}}, {{:mds:sds:sds16.r|script16 (.R)}} | |17| 30/03 11-13 | Fib-C |Data preprocessing in R. Estimators.[[http://131.114.72.230/sds/video/sds17_20220325.mp4|rec17 (.mp4)]] | **[R]** Chpt. 10, **[T]** Chpts. 17.1-17.3{{:mds:sds:sds17.r|script17 (.R)}}, {{ :mds:sds:dataprep.r | dataprep.R}} | |18| 31/03 14-16 | Fib-C | Unbiased estimators. Efficiency and MSE.[[http://131.114.72.230/sds/video/sds18_20220329.mp4|rec18 (.mp4)]] | **[T]** Chpts. 19, 20 {{:mds:sds:sds18.pdf|slides18 (.pdf)}}, {{:mds:sds:sds18.r|script18 (.R)}} | |19| 05/04 9-11 | Fib-C | Maximum likelihood estimation.[[http://131.114.72.230/sds/video/sds19_20220331.mp4|rec19 (.mp4)]] | **[T]** Chpt. 21 {{ :mds:sds:sdsln.pdf |}} Chpt. 1 {{:mds:sds:sds19.pdf|slides19 (.pdf)}}, {{:mds:sds:sds19.r|script19 (.R)}} | |20| 06/04 11-13 | Fib-C | Linear regression. Least squares estimation.[[http://131.114.72.230/sds/video/sds20_20220405.mp4|rec20 (.mp4)]] | **[T]** Chpts. 17.4,22 **[R]** Chpt. 6 {{ :mds:sds:sdsln.pdf |}} Chpt. 2 {{:mds:sds:sds20.pdf|slides20 (.pdf)}}, {{:mds:sds:sds20.r|script20 (.R)}} | |21| 12/04 9-11 | Fib-C | Non-linear, and multiple linear regression.[[http://131.114.72.230/sds/video/sds21_20220407.mp4|rec21 (.mp4)]] | **[R]** Chpt. 12.1,13,16.1-16.2 {{ :mds:sds:sdsln.pdf |}} Chpt. 2 {{:mds:sds:sds21.pdf|slides21 (.pdf)}}, {{:mds:sds:sds21.R|script21 (.R)}} | |22| 13/04 11-13 | Fib-C | Issues with linear regression. Logistic regression.[[http://131.114.72.230/sds/video/sds22_20220408.mp4|rec22 (.mp4)]] | **[R]** Chpt. 12.1,13,16.1-16.2 {{:mds:sds:sds22.pdf|slides22 (.pdf)}}, {{:mds:sds:sds21.zip|script22 (.zip)}} | |23| 14/04 14-16 | Fib-C | Statistical decision theory.[[http://131.114.72.230/sds/video/sds23_20220412.mp4|rec23 (.mp4)]] | {{ :mds:sds:sdsln.pdf |}} Chpt. 4 {{:mds:sds:sds23.pdf|slides23 (.pdf)}}, {{:mds:sds:sds23.r|script23 (.R)}} | |24| 19/04 9-11 | Fib-C | Statistical decision theory (continued).[[http://131.114.72.230/sds/video/sds24_20220421.mp4|rec24 (.mp4)]] | | |25| 20/04 11-13 | Fib-C | Statistical decision theory (continued). Project presentation. | [[http://didawiki.di.unipi.it/doku.php/mds/sds/start#student_project|See student project]] | |26| 21/04 14-16 | Fib-C | Confidence intervals: mean, proportion, linear regression.[[http://131.114.72.230/sds/video/sds26_20220422.mp4|rec26 (.mp4)]] | **[T]** Chpts. 23.1,23.2,23.4,24.3,24.4 {{ :mds:sds:sdsln.pdf |}} Chpt. 3 {{:mds:sds:sds26.pdf|slides26 (.pdf)}}, {{:mds:sds:sds26.r|script26 (.R)}} | |27| 26/04 9-11| Fib-C| Bootstrap and resampling methods.[[http://131.114.72.230/sds/video/sds27_20220426.mp4|rec27 (.mp4)]] | **[T]** Chpts. 18.1-18.3,23.3 {{:mds:sds:sds27.pdf|slides27 (.pdf)}}, {{:mds:sds:sds27.r|script27 (.R)}} | |28| 27/04 11-13| Fib-C | Bootstrap and resampling methods (continued).[[http://131.114.72.230/sds/video/sds28_20220428.mp4|rec28 (.mp4)]] | | |29| 28/04 14-16| Fib-C | Hypotheses testing. One-sample tests of the mean and application to linear regression.[[http://131.114.72.230/sds/video/sds29_20220429.mp4|rec29 (.mp4)]] | **[T]** Chpts. 25,26,27, **[R]** Chpts. 5.1,5.2 {{ :mds:sds:sdsln.pdf |}} Chpt.3.3 {{:mds:sds:sds29.pdf|slides29 (.pdf)}}, {{:mds:sds:sds29.r|script29 (.R)}} | |30| 3/05 9-11| Fib-C | One-sample tests of the mean and application to linear regression (continued).[[http://131.114.72.230/sds/video/sds30_2022comp.mp4|rec30 (.mp4)]] | | |31| 4/05 11-13| Fib-C | Two-sample tests of the mean and applications to classifier comparison.[[http://131.114.72.230/sds/video/sds31_2022comp.mp4|rec31 (.mp4)]] | **[T]** Chpts. 28, **[R]** Chpts. 5.3-5.7 {{:mds:sds:sds31.pdf|slides31 (.pdf)}}, {{:mds:sds:sds31.r|script31 (.R)}} | |32| 5/05 14-16| Fib-C | Two-sample tests of the mean and applications to classifier comparison (continued).[[http://131.114.72.230/sds/video/sds32_2022comp.mp4|rec32 (.mp4)]] | | |33| 10/05 9-11| Fib-C | Multiple-sample tests of the mean and applications to classifier comparison.[[http://131.114.72.230/sds/video/sds33_2022comp.mp4|rec33 (.mp4)]] | **[R]** Chpt. 7 {{:mds:sds:sds33.pdf|slides33 (.pdf)}}, {{:mds:sds:sds33.r|script33 (.R)}} | |34| 11/05 11-13| Fib-C | Fitting distributions. Testing independence/association.[[http://131.114.72.230/sds/video/sds34_2022comp.mp4|rec34 (.mp4)]] | **[R]** Chpt. 8 {{ :mds:smd:ks.pdf | K-S}}, {{:mds:sds:sds34.pdf|slides34 (.pdf)}}, {{:mds:sds:sds34.r|script34 (.R)}} | |35| 12/05 14-16| Fib-C | Fitting distributions. Testing independence/association (continued). Project Q&A. | | |36| 17/05 9-11| Fib-C | Project Q&A. | | =====Seminars of past years===== In some years, speakers were invited to give a seminar on advanced topics. Here it is a list of seminars held in past years. ^ # ^ Date ^ Room ^ Topic ^ Teaching material ^ |s01| 04/05/2022 9-11| Gerace+Teams | Bias in statistics and causal reasoning. Speaker: prof. Fabrizia Mealli [[http://131.114.72.230/sds/video/sds_s01_20220504.mp4|rec_s01 (.mp4)]] | {{:mds:sds:s4ds_s01.pdf|slides_s01 (.pdf)}} [[https://statistics.fas.harvard.edu/files/statistics-2/files/statistical_paradises_and_paradoxes.pdf|Optional reading]] | |s02| 04/05/2022 11-13| Gerace+Teams | Bias in statistics and causal reasoning (continued). Speaker: prof. Fabrizia Mealli [[http://131.114.72.230/sds/video/sds_s02_20220504.mp4|rec_s02 (.mp4)]] | | =====Past years===== This course of 9 ECTS replaces an older 6 ECTS version: [[mds:smd: |Statistical Methods for Data Science A.Y. 2020/21 (500PP)]]. The 6 ECTS version is discontinued. Students having the 6 ECTS version in their study plan can still take the 6 ECTS version exam for the A.Y. 2021/22, 2022/23 and 2023/24. However, there will no specific project for the 6 ECTS version.