====== LABORATORY OF DATA SCIENCE (2021/2022) ====== **Instructors**: * **Anna Monreale** * KDD Laboratory, Università di Pisa * [[http://pages.di.unipi.it/amonreale/]] * [[anna.monreale@unipi.it]] * Office hours: Wednesday: 11:00-13:00 online using Teams (Appointment by email). * Telephone +39-050-2213119 * **Roberto Pellungrini** * KDD Laboratory, Università di Pisa * [[roberto.pellungrini@di.unipi.it]] * Office hours: Thursday 14:00-16:00, Online using Teams (Appointment by email). * Telephone +39-050-2212728 ====== News ===== * [01-12-2021]: Third and final part of the project is up * [16-11-2021]: Instructions for the SSAS project in the Lecture of today: to avoid conflicts in deployment/process follow this steps once the solution is opened: (1) rename the project as _foodmart (2) from project properties select 'Deployment', then rename the database as _foodmart; (3) click on the button "show all files" just above "Solution explorer" right click on "view code" on the .database file that is visualized, and then change the ID from ruggieri_foodmart into _foodmart, and finally save the file; (4) change the credentials of connection to database on SQL Server. As an alternative solution you may[[ http://technet.microsoft.com/en-us/library/ms175630.aspx#bkmk_newusingwizard|import the project]] from the SSAS server and rename it as _foodmart (step 4 is still necessary). * [15/10/2021] Instructions for installing Data Tools for Visual Studio 2019 are in the software section of the wiki. Please follow them closely, step by step. * [15/10/2021] **IMPORTANT** The first part of the project is available. Checkpoint: 15 November. * [02/10/2021] **The lecture of Monday 4th October will be canceled.** * [08/09/2021] The first lecture will be on 16 Sept. * [08/09/2021] You can join the class by using this link: https://teams.microsoft.com/l/team/19%3amm3HFMqMSvpUrGY2sMYlpzxQ-atdxhfXreRUHhvrODs1%40thread.tacv2/conversations?groupId=c196ac40-93a2-4436-adfe-a81af3d06eef&tenantId=c7456b31-a220-47f5-be52-473828670aa1 * [16/09/2021] **IMPORTANT** Please, fill the document at the following link with your information, so that we can provide you access to teaching database and mailing list: https://docs.google.com/spreadsheets/d/1yYzHXmykhbfwy7G9uB_Z1fGcW_Vtvjugy4Yvlj-aM2Y/edit?usp=sharing ====== Hours and Rooms ====== **Classes ** Lessons will be held onilne by Teams Platform ^ Day of Week ^ Hour ^ Room ^ | Monday | 11:00 - 12:45 | Teams | | Thursday | 09:00 - 10:45 | Teams | **Link to Teams module:** https://teams.microsoft.com/l/team/19%3amm3HFMqMSvpUrGY2sMYlpzxQ-atdxhfXreRUHhvrODs1%40thread.tacv2/conversations?groupId=c196ac40-93a2-4436-adfe-a81af3d06eef&tenantId=c7456b31-a220-47f5-be52-473828670aa1 ====== Learning Material ====== ===== Slides & Registration of the classes ===== * The slides used in the course will be inserted in the calendar after each class. * Registration of each lecture will be available on Teams ===== Past Exams ===== * {{ :mds:lbi:2016midterm1text.pdf |2016/17 text}}, {{ :mds:lbi:2015fallmidterm1text.pdf | 2015/16 text}} and {{ :mds:lbi:2015wintermidterm1.zip | 2015/16 solution}}, {{:mds:lbi:2015midterm1text.pdf | 2014/15 text}} and {{ :mds:lbi:2015midterm1.zip |2014/2015 solution}}, {{ :mds:lbi:2014midterm1text.pdf | 2013/14 text}},{{ :mds:lbi:2013midterm1.pdf | 2012/13 text }} and {{ :mds:lbi:2013midterm1.zip |2012/13 solution}}. ===== Software===== * Anaconda with Python 3.7 (Please, avoid Python 3.8) * SQL Server 2019 Developer Edition:[[https://docs.microsoft.com/it-it/sql/ssms/download-sql-server-management-studio-ssms?view=sql-server-ver15|SQL Server 2019 Management Studio]]. * Data Tools for Visual Studio 2019: instructions here Italian: [[https://docs.microsoft.com/it-it/sql/ssdt/download-sql-server-data-tools-ssdt?view=sql-server-ver15#ssdt-for-visual-studio-2019|Data Tools Visual Studio 2019 IT]] English: [[https://docs.microsoft.com/en-us/sql/ssdt/download-sql-server-data-tools-ssdt?view=sql-server-ver15#ssdt-for-visual-studio-2019|Data Tools Visual Studio 2019 EN]] * Microsoft Excel * [[https://powerbi.microsoft.com/it-it/desktop/| Power BI Desktop]] ===== F.A.Q. ===== * [[http://www.sid.unipi.it/polo2/2015/03/26/connessione-alle-reti-wifi/ | Connection to wi-fi]] * [[http://www.sid.unipi.it/polo2/studenti/ | F.A.Q.s about the labs]] * [[https://start.unipi.it/help-ict/vpn/ | Unipi VPN ]] * [[https://autenticazione.unipi.it/auth/auth.signin | Unipi Authentication]] to access the VPN, make sure that network access services are enabled on you profile. Follow this link to access your Unipi profile. ====== Class calendar - (2021-2022) ====== ^ ^ Day ^ Topic ^ Slides ^ Data/Software ^ References ^ Video Lectures ^ Teacher | | | 13.09 11:00-12:45| Lecture canceled | | | | | |1. | 16.09 09:00-10:45| Introduction. File data access. | {{ :mds:lbi:2021-lds.01.introduction.pdf |}} {{ :mds:lbi:2020-lds.02.bi_architectures.pptx.pdf |}} {{ :mds:lbi:2020-lds.03.file_data_access.pptx.pdf |}}| | -** BI technology:** [[https://cacm.acm.org/magazines/2011/8/114953-an-overview-of-business-intelligence-technology/fulltext | An Overview of Business Intelligence Technology]] - **File access:** {{ :mds:lbi:filesystem.pdf | File System Interface}} |[[https://unipiit.sharepoint.com/sites/a__td_524292/Shared%20Documents/General/Recordings/LDS%20Lecture-20210916_091511-Meeting%20Recording.mp4?web=1|Video1]] [[https://unipiit.sharepoint.com/sites/a__td_524292/Shared%20Documents/General/Recordings/LDS%20Lecture-20210916_102822-Meeting%20Recording.mp4?web=1|Video2]] | Monreale | |2. | 20.09 11:00-12:45| Representation formats: CSV, FLV, ARFF, XML. Python Recap | {{ :mds:lbi:2020-lds.04.python.pptx.pdf |}}| | - **File Formats:** [[http://www.stat.auckland.ac.nz/~paul/ItDT | Introduction to data technologies(Chps. 5, 6)]], [[http://weka.wikispaces.com/ARFF+(stable+version)|Weka ARFF Format]], [[http://weka.wikispaces.com/XRFF|XRFF Format]] - **Python reference:** [[https://www.spronck.net/pythonbook/ | Free python book with exercises]] |[[https://unipiit.sharepoint.com/sites/a__td_524292/Shared%20Documents/General/Recordings/LDS%20Lecture-20210920_110330-Registrazione%20della%20riunione.mp4?web=1|Video1]] [[https://unipiit.sharepoint.com/sites/a__td_524292/Shared%20Documents/General/Recordings/LDS%20Lecture-20210920_121602-Registrazione%20della%20riunione%201.mp4?web=1|Video2]] | Pellungrini | |3. | 23.09 11:00-12:45 | File Access in Python | {{ :mds:lbi:lds.05.fileaccess-python2021.pdf |}} | {{ :mds:lbi:census.csv.zip |}} {{ :mds:lbi:data1.zip |Collection of files}} {{ :mds:lbi:solutionspart1.zip | Partial Solutions to Python Exercises}} | | [[https://unipiit.sharepoint.com/sites/a__td_524292/Shared%20Documents/General/Recordings/LDS%20Lectures-20210923_090410-Registrazione%20della%20riunione.mp4?web=1|Video1]] [[https://unipiit.sharepoint.com/sites/a__td_524292/Shared%20Documents/General/Recordings/LDS%20Lectures-20210923_103339-Registrazione%20della%20riunione.mp4?web=1|Video2]]| Pellungrini | |4. | 27.09 9:00-10:45 | File Access in Python Practice | {{ :mds:lbi:lds.05.fileaccess-python2021.pdf |}} | {{ :mds:lbi:census.csv.zip |}} {{ :mds:lbi:data1.zip |Collection of files}} {{ :mds:lbi:solutionspart1.zip | Partial Solutions to Python Exercises}} {{ :mds:lbi:csv2arrf.py.zip | csv to Arff conversion solution}}| | [[https://unipiit.sharepoint.com/sites/a__td_524292/Shared%20Documents/General/Recordings/Riunione%20in%20_General_-20210927_113233-Registrazione%20della%20riunione.mp4?web=1|Video]] | Pellungrini | |5. | 30.09 9:00-10:45 | Python Exercises | {{ :mds:lbi:ex-customers.pdf |}} | {{ :mds:lbi:ex-customers_solution.zip |}} {{ :mds:lbi:data-customers.zip |}} {{ :mds:lbi:lds.file.format.zip |}}| | [[https://unipiit.sharepoint.com/sites/a__td_524292/Shared%20Documents/General/Recordings/LDS%20Lectures-20210930_090341-Registrazione%20della%20riunione.mp4?web=1|Video1]] [[https://unipiit.sharepoint.com/sites/a__td_524292/Shared%20Documents/General/Recordings/LDS%20Lectures-20210930_095844-Registrazione%20della%20riunione.mp4?web=1|Video2]] [[https://unipiit.sharepoint.com/sites/a__td_524292/Shared%20Documents/General/Recordings/LDS%20Lectures-20210930_103921-Registrazione%20della%20riunione.mp4?web=1|Video3]] |Pellungrini| | | 04.10 11:00-12:45 | Lecture canceled | | | | | |6. | 07.10 9:00-10:45 |RDBMS access protocols: ODBC, OLE DB, JDBC. ODBC Programming. | {{ :mds:lbi:lds.06.relational_data_access-2021.pdf |}} | | | | Monreale| |7. | 11.10 11:00-12:45 |RDBMS access protocols: ODBC, OLE DB, JDBC. ODBC Programming. | {{ :mds:lbi:lds.06.relational_data_access-2021.pdf |}} | {{ :mds:lbi:2021-code-db-samples.zip |}}| | | Monreale| |8. | 14.10 9:00-10:45 |Stratified sampling | {{ :mds:lbi:lds.07.sqlserver.pdf |}} | {{ :mds:lbi:stratifiedsampling.zip |}}| |[[https://unipiit.sharepoint.com/:v:/r/sites/a__td_524292/Shared%20Documents/General/Recordings/LDS%20Lectures-20211014_090430-Registrazione%20della%20riunione.mp4?csf=1&web=1&e=XxhSv7|Video]] | Pellungrini| |9. | 18.10 12:00-12:45 | ETL Introduction | {{ :mds:lbi:lds.08.etlandssis.pdf |}} | | |[[https://unipiit.sharepoint.com/:v:/r/sites/a__td_524292/Shared%20Documents/General/Recordings/LDS%20Lecture-20211018_121039-Registrazione%20della%20riunione.mp4?csf=1&web=1&e=Nq6CMT|Video]] | Monreale| |10. | 21.10 9:00-10:45 | SSIS: toCSV, FromCSV | | {{ :mds:lbi:2021-lds-etl-project.zip |}} | | [[https://unipiit.sharepoint.com/sites/a__td_524292/Shared%20Documents/General/Recordings/LDS%20Lectures-20211021_091055-Regi{{ :mds:lbi:foodmart_monreale_full.zip |}}strazione%20della%20riunione.mp4|Video]]| Monreale | |11. | 25.10 11:00-12:45 |SSIS exercises: Pipeline, Update | {{ :mds:lbi:exercisefact_table.pdf |}} | | | [[https://unipiit.sharepoint.com/sites/a__td_524292/Shared%20Documents/General/Recordings/LDS%20Lecture-20211025_110645-Registrazione%20della%20riunione.mp4|Video]]| Monreale | |12. | 28.10 9:00-10:45 | SSIS exercises: Stratified Subsampling | {{ :mds:lbi:ex-midterm.pdf |}} | | | | Monreale | |13. | 04.11 9:00-10:45 | Project Support & Discussion | | | | | Monreale | |14. | 08.11 11:00-12:45 | SSIS: Surrogate keys + Slowly changing dimensions | | | | | Monreale | |15. | 11.11 9:00-10:45 | SSIS: Slowly changing dimensions + Datawarehousing and OLAP recap. | {{ :mds:lbi:lds.09.dwandolap.pdf |}} | {{ :mds:lbi:2021-lds-etl-project_full.zip |}} | | | Monreale | |16. | 15.11 11:00-12:45 | OLAP with SQL Server Analysis Services (SSAS): data source views, dimensions, hierarchies. Data cubes, Parent-child hierarchies. | {{ :mds:lbi:lds.09.ssas-21.pdf |}} | {{ :mds:lbi:foodmart_monreale_full-cube.zip |}} | **1) SSAS (olap):** [[http://msdn.microsoft.com/en-us/library/bb522607.aspx|documentation]]; 2) S. Harinath et al. {{ :mds:lbi:ssas2012ch456.pdf |Professional Microsoft SQL Server Analysis Services 2012 with MDX and DAX, Wrox publisher, 2012. Chps. 4-6}}. | | Monreale | |17. | 18.11 11:00-12:45 | Cube deployment. Measure setup, Calculated Members, Excel power pivot integration. ROLAP, MOLAP, HOLAP definition and setup. Cache management.| | {{ :mds:lbi:foodmartexplorative.xlsx |}} {{ :mds:lbi:foodmart_monreale_complete.zip |}}| | | Monreale | |18. | 22.11 11:00-12:45 | Introduction to MDX | | {{ :mds:lbi:2021-mdxquery-demo-partial.mdx.zip |}}| **MDX:** 1) [[http://msdn.microsoft.com/en-us/library/bb500184.aspx|documentation]] and a [[https://www.mssqltips.com/sqlservertip/3129/order-and-sort-with-mdx-in-sql-server-analysis-services/|useful guide on ordering]]; 2) S. Harinath ed al. {{ :mds:lbi:ssas2012ch3.pdf |Professional Microsoft SQL Server Analysis Services 2012 with MDX and DAX, Wrox publisher, 2012. Chp. 3.}} | Since the video of this lecture has some issue I'm linking the Video of the last year. It is not exactly equal but very similar. The videos are two because the lectures of these year could not be completely aligned. [[https://web.microsoftstream.com/video/691fe567-b0f9-4994-8ec8-1b4dc5a3d9c4 | Video1]] [[https://web.microsoftstream.com/video/f1392414-f329-4769-9475-cb3457dbd712|Video2]]| Monreale | |19. | 25.11 09:00-11:00 | Practice on MDX | Exercises to be done before next lecture: 1) Queries you have already answered by Excel in Lecture N.17 2) This list of queries: {{ :mds:lbi:mdx-ex.pdf |}} | {{ :mds:lbi:2021-mdxquery-demo.mdx.zip |}} | | [[https://unipiit.sharepoint.com/sites/a__td_524292/Shared%20Documents/General/Recordings/General-20211125_091132-Registrazione%20della%20riunione.mp4?web=1|Video 1]] [[https://unipiit.sharepoint.com/sites/a__td_524292/Shared%20Documents/General/Recordings/General-20211125_103239-Registrazione%20della%20riunione.mp4?web=1|Video 2]] | Monreale| |20. | 29.11 11:00-12:45 | Practice on MDX | | {{ :mds:lbi:lbi.09.mdxpractice.mdx.zip |}} | | [[https://unipiit.sharepoint.com/sites/a__td_524292/Shared%20Documents/General/Recordings/LDS%20Lecture-20211129_111215-Registrazione%20della%20riunione.mp4?web=1|Video 1]] | Pellungrini| |21. | 02.12 09:00-10:45 | Practice on MDX | | {{ :mds:lbi:mdx_exercises_2021.txt.zip |}} {{ :mds:lbi:mdxquerytop.mdx.zip |}} | | | | | |22. | 09.12 09:00-10:45 | Practice on MDX + PowerBI | | | | | | |23. | 13.12 11:00-13:00 ROOM C| Lecture by Microstrategy| | | | | | |24. | 16.12 09:00-10:45 | Lecture only ONLINE dedicated to Project discussion with groups| | | | | | ====== Exams ====== **PROJECT ** A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting. The project has to be performed by a team of 2 students (at most 3 after asking authorization for that to the teachers). **Project to be delivered within 31 December 2021 ** * First part of the project consists in the **assignments** described here: {{ :mds:lbi:lds_project_2021_part_1.pdf |}} * A note about the first part of the project: for the 'language' attribute in the geography table, you should search for the necessary information elsewhere. Some examples are: http://download.geonames.org/export/dump/countryInfo.txt and http://www.fullstacks.io/2016/07/countries-and-their-spoken-languages.html * Second part of the project consist in the **assignments** described here:{{ :mds:lbi:lds_project_2021_part_2.pdf |}} * Third part of the project consist in the **assignments** described here: {{ :mds:lbi:lds_project_2021_part_3.pdf |}} * Remember to re-submit all three parts of the project with your third part, as specified in the document above. * **Dataset:** {{ :mds:lbi:data2021.zip |}} * **Deadline**: First deadline - 15 Nov 2021 22 Nov 2021 * **Deadline**: Second deadline - 10 Dec 2021 * **Deadline**: Third deadline - 31 Dec 2021 * **Hardware dataset**: {{ :mds:lbi:hardware.zip |}} **Project to be delivered during the exam sessions ** Students who did not deliver the above project within 31 December 2021 need to ask by email a new project to the teachers. The project that will be assigned will require about 2 weeks of work and after the delivery it will be discussed during the oral exam. For those students, the oral exams will also cover some practical parts that could not be included in the project. ** Please write to both teachers!** ===== Exam sessions ===== ^ Session ^ Date ^ Time ^ Room ^ Notes ^ Marks ^ =====Past Editions ===== * [[LDS 2020-2021]] * [[LDS 2019-2020]] * [[LDS 2018-2019]] * [[LBI 2017-2018]]