Queste sono le differenze tra la revisione selezionata e la versione attuale della pagina.
Entrambe le parti precedenti la revisione Revisione precedente Prossima revisione | Revisione precedente | ||
dm:start [14/10/2021 alle 12:54 (18 mesi fa)] Mirco Nanni [News] |
dm:start [21/03/2023 alle 07:55 (8 giorni fa)] (versione attuale) Riccardo Guidotti [Second Semester (DM2 - Data Mining: Advanced Topics and Applications)] |
||
---|---|---|---|
Linea 9: | Linea 9: | ||
ga(' | ga(' | ||
ga(' | ga(' | ||
- | ga(' | + | ga(' |
- | | + | |
ga(' | ga(' | ||
- | ga(' | + | ga(' |
setTimeout(" | setTimeout(" | ||
</ | </ | ||
Linea 51: | Linea 50: | ||
</ | </ | ||
</ | </ | ||
- | ====== Data Mining A.A. 2021/22 ====== | + | ====== Data Mining A.A. 2022/23 ====== |
===== DM1 - Data Mining: Foundations (6 CFU) ===== | ===== DM1 - Data Mining: Foundations (6 CFU) ===== | ||
Linea 61: | Linea 60: | ||
* [[dino.pedreschi@unipi.it]] | * [[dino.pedreschi@unipi.it]] | ||
- | * **Mirco Nanni** | + | * **Riccardo Guidotti** |
- | * KDDLab, | + | * KDDLab, |
- | * [[http://www-kdd.isti.cnr.it]] | + | * [[https:// |
- | * [[mirco.nanni@isti.cnr.it]] | + | * [[riccardo.guidotti@di.unipi.it]] |
Teaching Assistant | Teaching Assistant | ||
- | * **Salvatore Citraro** | + | * **Francesco Spinnato** |
- | * KDDLab, | + | * KDDLab, |
- | * [[http://www-kdd.isti.cnr.it]] | + | * [[https:// |
- | * [[salvatore.citraro@phd.unipi.it]] | + | * [[francesco.spinnato@sns.it]] |
===== DM2 - Data Mining: Advanced Topics and Applications (6 CFU) ===== | ===== DM2 - Data Mining: Advanced Topics and Applications (6 CFU) ===== | ||
Linea 79: | Linea 78: | ||
* [[riccardo.guidotti@di.unipi.it]] | * [[riccardo.guidotti@di.unipi.it]] | ||
+ | Teaching Assistant | ||
+ | * **Francesco Spinnato** | ||
+ | * KDDLab, Scuola Normale Superiore | ||
+ | * [[https:// | ||
+ | * [[francesco.spinnato@sns.it]] | ||
====== News ====== | ====== News ====== | ||
- | * [14.10.2021] On Monday 18.10.2021, the class is cancelled | + | * **[23.02.2023]** Spinnato Booking Agenda: [[https:// |
- | * [06.09.2021] The first lesson will be held on 16/09/2021. | + | * **[20.02.2023]** Project Groups [[https:// |
+ | * [23.11.2022] In order to recover from skipped and suspended lectures we signal | ||
+ | * [15.09.2022] Project Groups [[https:// | ||
+ | * [15.09.2022] MS Teams [[https://teams.microsoft.com/ | ||
+ | * [15.09.2022] Lectures will be in presence only. Registrations of the lectures of past years can be found at the bottom of this web page. | ||
+ | |||
====== Learning Goals ====== | ====== Learning Goals ====== | ||
* DM1 | * DM1 | ||
Linea 91: | Linea 99: | ||
* Classification | * Classification | ||
* Pattern Mining and Association Rules | * Pattern Mining and Association Rules | ||
- | | + | |
* DM2 | * DM2 | ||
* Outlier Detection | * Outlier Detection | ||
- | * Regression | + | * Dimensionality Reduction |
- | * Advanced Classification | + | * Regression |
+ | * Advanced Classification | ||
* Time Series Analysis | * Time Series Analysis | ||
- | * Sequential Pattern Mining | ||
- | * Advanced Clustering | ||
* Transactional Clustering | * Transactional Clustering | ||
- | | + | |
====== Hours and Rooms ====== | ====== Hours and Rooms ====== | ||
Linea 110: | Linea 117: | ||
^ Day of Week ^ Hour ^ Room ^ | ^ Day of Week ^ Hour ^ Room ^ | ||
- | | Monday | + | | Monday |
- | | Thursday | + | | Thursday |
**Office hours - Ricevimento: | **Office hours - Ricevimento: | ||
- | * Prof. Pedreschi: Monday 16:00 - 18:00, Online | + | * Prof. Pedreschi |
- | * Prof. Nanni: appointment | + | * Monday 16:00 - 18:00 |
+ | * Online | ||
+ | * Prof. Guidotti | ||
+ | * Wednesday 15-17 or Appointment | ||
+ | * Room 363 Dept. of Computer Science or MS Teams | ||
| | ||
Linea 125: | Linea 136: | ||
^ Day of Week ^ Hour ^ Room ^ | ^ Day of Week ^ Hour ^ Room ^ | ||
- | | Monday | + | | Monday |
- | | | + | | |
**Office Hours - Ricevimento: | **Office Hours - Ricevimento: | ||
- | * Room 268 Dept. of Computer Science | + | |
- | * Tuesday: 15-17, Room: MS Teams | + | |
- | * Appointment by email | + | |
====== Learning Material -- Materiale didattico ====== | ====== Learning Material -- Materiale didattico ====== | ||
Linea 140: | Linea 150: | ||
* Pang-Ning Tan, Michael Steinbach, Vipin Kumar. **Introduction to Data Mining**. Addison Wesley, ISBN 0-321-32136-7, | * Pang-Ning Tan, Michael Steinbach, Vipin Kumar. **Introduction to Data Mining**. Addison Wesley, ISBN 0-321-32136-7, | ||
* [[http:// | * [[http:// | ||
- | * I capitoli | + | * I capitoli |
* Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. **GUIDE TO INTELLIGENT DATA ANALYSIS.** Springer Verlag, 1st Edition., 2010. ISBN 978-1-84882-259-7 | * Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. **GUIDE TO INTELLIGENT DATA ANALYSIS.** Springer Verlag, 1st Edition., 2010. ISBN 978-1-84882-259-7 | ||
* Laura Igual et al.** Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications**. 1st ed. 2017 Edition. | * Laura Igual et al.** Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications**. 1st ed. 2017 Edition. | ||
Linea 154: | Linea 164: | ||
===== Software===== | ===== Software===== | ||
- | * Python - Anaconda (3.7 version!!!): Anaconda is the leading open data science platform powered by Python. [[https:// | + | * Python - Anaconda (>3.7): Anaconda is the leading open data science platform powered by Python. [[https:// |
* Scikit-learn: | * Scikit-learn: | ||
* Pandas: pandas is an open source, BSD-licensed library providing high-performance, | * Pandas: pandas is an open source, BSD-licensed library providing high-performance, | ||
+ | |||
+ | Other softwares for Data Mining | ||
* [[http:// | * [[http:// | ||
* [[http:// | * [[http:// | ||
Linea 165: | Linea 177: | ||
===== First Semester (DM1 - Data Mining: Foundations) ===== | ===== First Semester (DM1 - Data Mining: Foundations) ===== | ||
- | ^ ^ Day ^ Room ^ Topic ^ Learning | + | ^ ^ Day ^ Time ^ Room ^ Topic ^ Learning |
- | |1.| 16.09.2021 | + | |01.| 15.09.2022 | 11-13 |A1| Overview, Intro, KDD and CRIPS. | {{ :dm:00_dm1_introduction_2022_23.pdf | Intro}} | Pedreschi/ |
- | |2.| 20.09.2021 | + | | | 19.09.2022 | 11-13 | | No Lecture | | | |
- | |3.| 23.09.2021 | + | |02.| 22.09.2022 | 11-13 |A1| Project Guideliens & Intro to Python | {{ : |
- | |4.| 27.09.2021 | + | | | 26.09.2022 | 11-13 | | No Lecture | |
- | |5.| 30.09.2021 | + | |03.| 29.09.2022 | 11-13 |A1| Data Understanding |
- | |6.| 04.10.2021 | + | |04.| 03.10.2022 | 11-13 |A1| Data Understanding & Data Preparation |
- | |7. | 07.10.2021 | + | |05.| 06.10.2022 | 11-13 |A1| Lab. Data Understanding | {{ :dm:data_understanding.zip | Data Und Python}} | Spinnato/Guidotti | |
- | | | < | + | | | 10.10.2022 | 11-13 | | No Lecture | |
- | |8. | 14.10.2021 | + | |06.| 13.10.2022 | 11-13 |A1| Data Preparation, Similarity |
- | | | < | + | |07.| 17.10.2022 | 11-13 |A1| Intro Clustering, K-Means | {{ : |
- | |9. | 21.10.2021 | + | |08.| 20.10.2022 | 11-13 |A1| K-Means | {{ :dm: |
- | |10. | 25.10.2021 11:00-12:45 | Aula Fib C | Clustering: density-base methods & exercises| | | Nanni | | + | |09.| 24.10.2022 | 11-13 |A1| Hierarchical & Density-based |
- | |11. | 28.10.2021 | + | |10.| 27.10.2022 | 11-13 |A1| Lab. Clustering | {{ :dm:clustering.zip | Clustering |
+ | | | 30.10.2022 | 11-13 | | No Lecture | | ||
+ | |11.| 03.11.2022 | 11-13 |A1| Exercises Clustering | ||
+ | |12.| 07.11.2022 | 11-13 |A1| Intro Classification | {{ :dm:08_dm1_classification_intro_2022_23.pdf | Intro Classification}}, {{ :dm:09_dm1_knn_2022_23.pdf | kNN}} | Guidotti | | ||
+ | |13.| 10.11.2022 | 11-13 |A1| Eval Measures, Exercises kNN | {{ :dm:08_dm1_classification_intro_2022_23.pdf | Intro Classification}}, {{ :dm:09_dm1_knn_2022_23.pdf | kNN}} | Guidotti | | ||
+ | |14.| 14.11.2022 | 11-13 |A1| Decision Tree | {{ : | ||
+ | |15.| 17.11.2022 | 11-13 |A1| Decision Tree, Exercises DT | {{ :dm: | ||
+ | |16.| 22.11.2022 | 11-13 |A1| Decision Tree | {{ : | ||
+ | |17.| 24.11.2022 | 11-13 |A1| Naive Bayes Classifier | {{ : | ||
+ | |18.| 28.11.2022 | 11-13 |A1| Lab. Classification | {{ :dm: | ||
+ | |19.| 01.12.2022 | ||
+ | |20.| 05.12.2022 | 11-13 |A1| Pattern Mining | {{ :dm: | ||
+ | |21.| 07.12.2022 | 14-16 |A1| Pattern Mining | ||
+ | | | ||
+ | |22.| 12.12.2022 | 11-13 |A1| Exercises Apriori | {{ :dm: | ||
+ | |23.| 14.12.2022 | 14-16 |A1| Pattern Mining (FP-Growth) | ||
+ | |24.| 15.12.2022 | 11-13 |A1| Lab. Pattern Mining | {{ :dm: | ||
===== Second Semester (DM2 - Data Mining: Advanced Topics and Applications) ===== | ===== Second Semester (DM2 - Data Mining: Advanced Topics and Applications) ===== | ||
- | ^ ^ Day ^ Room ^ Topic ^ Learning | + | ^ ^ Day ^ Room ^ Topic ^ Learning |
- | |1.| ??.02.2022 ??:00-??:00 | link teams | Introduction, CRIPS, KNN | {{ :dm:00_dm2_intro_2021.pdf | Intro}}, {{ :dm:01_dm2_crispdm_2021.pdf | CRISP}}, {{ :dm:02_dm2_knn_2021.pdf | KNN}} | Guidotti |link registrazione | + | | 01.| 20.02.2023 09:00--11:00 | |
+ | | 02.| 21.02.2023 09: | ||
+ | | 03.| 27.02.2023 09: | ||
+ | | 04.| 28.02.2023 09: | ||
+ | | 05.| 06.03.2023 09: | ||
+ | | 06.| 07.03.2023 09: | ||
+ | | 07.| 13.03.2023 09: | ||
+ | | 08.| 14.03.2023 09: | ||
+ | | 09.| 20.03.2023 09: | ||
+ | | 10.| 21.03.2023 09: | ||
+ | | | 27.03.2023 09: | ||
+ | | | 28.03.2023 09: | ||
====== Exams ====== | ====== Exams ====== | ||
- | ===== Exam DM1 ====== | + | ** How and Where: ** |
+ | The exam will take place in oral mode only at the teacher' | ||
+ | The exam will be held online on the 420AA Data Mining course channel only at the request of the | ||
+ | student in accordance with current legislation. | ||
- | The exam is composed of two parts: | + | ** When: ** |
- | + | The dates relating to the start of the three exams are/will be published on the online platform | |
- | | + | https:// |
- | + | various orals. The dates and slots to take the exam will be published on the course page by the end of | |
- | * A **project**, that consists | + | May. Each student |
- | === Project 1 === | + | In the event that the oral exam is not passed, it will not be possible to take it for 20 days. If the project |
- | - Assigned: 30/ | + | |
- | - MidTerm Deadline: **21/ | + | |
- | - Final Deadline: **TBD** (complete | + | |
- | - Data: choose between {{ : | + | |
+ | ** What: ** | ||
+ | The oral test will evaluate the practical understanding of the algorithms. The exam will evaluate three aspects. | ||
+ | - Understanding of the theoretical aspects of the topics addressed during the course. The student may be required to write on formulas or pseudocode. During the explanations, | ||
+ | - Understanding of the algorithms illustrated during the course and their practical implementation. You will be asked to perform one or more simple exercises. The text will be shown on the teacher' | ||
+ | - Discussion of the project with questions from the teacher regarding unclear aspects, | ||
+ | questionable steps or choices. | ||
+ | ** Final Mark: ** for 12-credit exam, the final mark will be obtained as the | ||
+ | average mark of DM1 and DM2. | ||
+ | **Exam Booking Periods** | ||
+ | * Exam portal link: [[https:// | ||
+ | * 1st Appello: 11/12/2022 00:00 - 05/01/2023 23:59 | ||
+ | * 2nd Appello: 01/01/2023 00:00 - 26/01/2023 23:59 | ||
- | ===== Exam DM part II (DMA) ====== | ||
- | |||
- | ** Exam Rules** | ||
- | * Rules for DM2 exam available {{ : | ||
- | |||
- | **Exam Booking Periods** | ||
- | * 3rd Appello: ??/??/2022 00:00 - ??/??/2022 23:59 | ||
- | * 4th Appello: ??/??/2022 00:00 - ??/??/2022 23:59 | ||
- | * 5th Appello: ??/??/2022 00:00 - ??/??/2022 23:59 | ||
- | |||
**Exam Booking Agenda** | **Exam Booking Agenda** | ||
- | * Agenda Link: ??? | + | * Agenda Link: [[https://agende.unipi.it/ |
- | * 3rd Appello: starts ??/??/2022 | + | * 1st Appello: starts |
- | * 4th Appello: starts | + | * 2nd Appello: starts |
- | * 5th Appello: starts | + | ===== Exam DM1 ====== |
- | * Important! if you book in the agenda in data in days between ??/??/2022 and ??/??/2022 you MUST be registered for the 3rd appello, if you book in the agenda in data in days between ??/??/2022 and ??/??/2022 you must be registered for the 4th appello, if you book in the agenda in data in days after ??/??/2022 you must be registered for the 5th appello. | + | |
- | + | ||
- | The link to the agenda for booking a slot for the exam is displayed at the end of the registration. | + | |
- | During the exam the camera must remain open and you must be able to share your screen. For the exam could be required the usage of the Miro platform (https:// | + | |
The exam is composed of two parts: | The exam is composed of two parts: | ||
- | * A **project**, that consists in employing | + | * An **oral exam**, that includes: (1) discussing |
- | * An **oral exam**, that includes: (1) discussing topics presented during the classes, including | + | * A **project**, that consists in exercises requiring the use of data mining tools for analysis of data. Exercises include: data understanding, |
+ | |||
+ | * **Dataset** | ||
+ | - Assigned: 15/ | ||
+ | - MidTerm Submission: **28/ | ||
+ | - Final Submission: **31/ | ||
+ | - Dataset: {{: | ||
+ | - Link original pages: [[https:// | ||
- | | + | ** DM1 Project Guidelines |
- | * Data can be downloaded here ??? | + | See {{ :dm:dm1_project_guidelines_22_23.pdf | Project Guidelines}}. |
- | * Submission Draft 1: ??/??/2022 23:59 Italian Time (we expect Module 1 and Module 2) | + | |
- | * Submission Draft 2: ??/??/2022 23:59 Italian Time (we expect Module 3) | + | |
- | * Final Submission: one week before the oral exam. | + | |
- | ** Project Guidelines ** | ||
- | * **Module 1 - Introduction, | ||
- | - Explore and prepare the dataset. You are allowed to take inspiration from the associated GitHub repository and figure out your personal research perspective (from choosing a subset of variables to the class to predict…). You are welcome in creating new variables and performing all the pre-processing steps the dataset needs. | ||
- | - Define one or more (simple) classification tasks and solve it with Decision Tree and KNN. You decide the target variable. | ||
- | - Identify the top 1% outliers: adopt at least three different methods from different families (e.g., density-based, | ||
- | - Analyze the value distribution of the class to predict with respect to point 2; if it is unbalanced leave it as it is, otherwise turn the dataset into an imbalanced version (e.g., 96% - 4%, for binary classification). Then solve the classification task using the Decision Tree or the KNN by adopting various techniques of imbalanced learning. | ||
- | - Draw your conclusions about the techniques adopted in this analysis. | ||
- | * **Module 2 - Advanced Classification Methods** | ||
- | - Solve the classification task defined in Module 1 (or define new ones) with the other classification methods analyzed during the course: Naive Bayes Classifier, Logistic Regression, Rule-based Classifiers, | ||
- | - Besides the numerical evaluation draw your conclusions about the various classifiers, | ||
- | - Select two continuous attributes, define a regression problem and try to solve it using different techniques reporting various evaluation measures. Plot the two-dimensional dataset. Then generalize to multiple linear regression and observe how the performance varies. | ||
- | * **Module 3 - Time Series Analysis** | + | |
- | - Select the feature(s) you prefer and use it (them) as a time series. You can use the temporal information provided by the authors’ datasets, but you are also welcome in exploring the .mp3 files to build your own dataset of time series according to your purposes. You should prepare a dataset on which you can run time series clustering; motif/ | + | ===== Exam DM2 ====== |
- | - On the dataset created, compute clustering based on Euclidean/ | + | |
- | - Analyze the dataset for finding motifs and/or anomalies. Visualize and discuss them and their relationship with other features. | + | |
- | - Solve the classification task on the time series dataset(s) and evaluate each result. In particular, you should use shapelet-based classifiers. Analyze the shapelets retrieved and discuss if there are any similarities/ | + | |
- | * **Module 4 - Sequential Patterns and Advanced Clustering** | + | The exam is composed |
- | - Sequential Pattern Mining: Convert the time series into a discrete format (e.g., by using SAX) and extract the most frequent sequential patterns (of at least length 3/4) using different values of support, then discuss the most interesting sequences. | + | |
- | - Advanced Clustering: On a dataset already prepared for one of the previous tasks in Module 1 or Module 2, run at least one clustering algorithm presented in the advanced clustering lectures (e.g. X-Means, Bisecting K-Means, OPTICS). Discuss the results that you find analyzing the clusters and reporting external validation measures (e.g SSE, silhouette). | + | |
- | - Transactional Clustering: By using categorical features, or by turning a dataset with continuous variables into a dataset with categorical variables (e.g. by using binning), run at least one clustering algorithm presented in the transactional clustering lectures (e.g. K-Modes, ROCK). Discuss the results that you find analyzing the clusters and reporting external validation measures (e.g SSE, silhouette). | + | |
- | * **Module 5 - Explainability (optional)** | + | * An **oral exam**, that includes: |
- | - Try to use one or more explanation methods | + | |
+ | * A **project**, | ||
+ | |||
+ | * **Dataset** | ||
+ | - Assigned: 20/02/2023 | ||
+ | - MidTerm Submission: **20/ | ||
+ | - Final Submission: **31/ | ||
+ | - Dataset: {{ : | ||
+ | - Link original pages: [[https:// | ||
- | + | ** DM2 Project Guidelines ** | |
- | + | See {{ : | |
- | N.B. When " | + | |
Linea 272: | Linea 300: | ||
===== Exam Sessions ===== | ===== Exam Sessions ===== | ||
- | ^ Session ^ Date ^ Time | + | ^ Session ^ Date ^ Room ^ Notes ^ Marks ^ |
- | |1.|16.01.2019| 14:00 - 18:00| [[https://teams.microsoft.com/l/team/19%3aeebd8a88148d433582ca36bc54d6e441%40thread.tacv2/conversations?groupId=adba5ac4-f242-40be-b8aa-e375da1d4f2c& | + | |1.|10.01.2023| | Please, use the system for registration: https:// |
+ | |2.|31.01.2023| | Please, use the system for registration: | ||
+ | |3.|?? | ||
+ | |4.|??.??.2023| | Please, use the system for registration: | ||
+ | |5.|?? | ||
+ | |6.|?? | ||
===== Past Exams ===== | ===== Past Exams ===== | ||
* Past exams texts can be found in old pages of the course. Please do not consider these exercises as a unique way of testing your knowledge. Exercises can be changed and updated every year and will be published together with the slides of the lectures. | * Past exams texts can be found in old pages of the course. Please do not consider these exercises as a unique way of testing your knowledge. Exercises can be changed and updated every year and will be published together with the slides of the lectures. | ||
Linea 291: | Linea 323: | ||
* Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics {{: | * Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics {{: | ||
* Peter Sondergaard, | * Peter Sondergaard, | ||
- | |||
* Towards Effective Decision-Making Through Data Visualization: | * Towards Effective Decision-Making Through Data Visualization: | ||
====== Previous years ===== | ====== Previous years ===== | ||
+ | * [[dm.2021-22ds]] | ||
* [[dm.2020-21]] | * [[dm.2020-21]] | ||
- | * [[dm.2019-20]] | + | |
- | | + | * [[dm.2018-19]] |
- | | + | * [[dm.2017-18]] |
* [[dm.2016-17]] | * [[dm.2016-17]] | ||
* [[dm.2015-16]] | * [[dm.2015-16]] | ||
Linea 305: | Linea 337: | ||
* [[dm.2012-13]] | * [[dm.2012-13]] | ||
* [[dm.2011-12]] | * [[dm.2011-12]] | ||
- | * [[dm.2010-11]] | ||
- | * [[dm.2009-10]] | ||
- | * [[dm.2008-09]] | ||
- | * [[dm.2007-08]] | ||
- | * [[dm.2006-07]] | ||
- | * [[PhDWorkshop2011]] | ||
- | * [[SNA.Ingegneria2011]] | ||
- | * [[SNA.IMT.2011]] | ||
- | * [[MAINS.SANTANNA.2011-12]] | ||
- | * [[MAINS.SANTANNA.DM4CRM.2012]] | ||
- | * [[MAINS.SANTANNA.DM4CRM.2016]] | ||
- | * [[MAINS.SANTANNA.DM4CRM.2017 | Data Mining for Customer Relationship Management 2017]] | ||
- | * [[MAINS.SANTANNA.DM4CRM.2018]] | ||
- | * [[MAINS.SANTANNA.DM4CRM.2019]] | ||
- | * [[SDM2018 | Instructions for camera ready and copyright transfer]] | ||
- | * [[DM-SAM | Storie dell' | ||
- | * [[DM-I40 | Master Industry 4.0]] | ||