Strumenti Utente

Strumenti Sito


dm:start

Data Mining A.A. 2023/24

DM1 - Data Mining: Foundations (6 CFU)

DM2 - Data Mining: Advanced Topics and Applications (6 CFU)

News

  • [24.05.2024] When registering for the oral exam please specify in the notes DM1 if you do not want to do DM2 (that is assumed by default). After having booked it please contact Prof. Pedreschi to agree on the exam date (put Prof. Guidotti and Andrea Fedele in cc). There will be no agenda for DM1.
  • [03.05.2024] Next lecture of DM2 will be as usual on Monday 06/05 from 9 to 11 in room C.
  • [19.01.2024 DM2 Lectures will start on Mon 19/02, only for that lecture the time will be 14-16 instead of 9-11.
  • [13.10.2023] To schedule meeting with the Teaching Assistant you can use: https://calendly.com/andreafedele/
  • [20.09.2023] Recordings of the lectures can be found on the web pages of the course for the years 2020/2021 and 2021/2022 (see links at the bottom of this page)
  • [20.09.2023] Thursday 21 September there will be no lecture.
  • [11.09.2023] Lectures will start on Monday 18 September 2023 at 11.00 room C1.
  • [11.09.2023] Lectures will be in presence only. Registrations of the lectures of past years can be found at the bottom of this web page.
  • [11.09.2023] Project Groups link
  • [11.09.2023] MS Teams link

Learning Goals

  • DM1
    • Fundamental concepts of data knowledge and discovery.
    • Data understanding
    • Data preparation
    • Clustering
    • Classification
    • Pattern Mining and Association Rules
    • Sequential Pattern Mining
  • DM2
    • Outlier Detection
    • Dimensionality Reduction
    • Regression
    • Advanced Classification and Regression
    • Time Series Analysis
    • Transactional Clustering
    • Explainability

Hours and Rooms

DM1

Classes

Day of Week Hour Room
Monday 11:00 - 13:00 C1
Wednesday 11:00 - 13:00 C1

Office hours - Ricevimento:

  • Prof. Pedreschi
    • Monday 16:00 - 18:00
    • Online
  • Prof. Guidotti
    • Tuesday 16:00 - 18:00 or Appointment by email
    • Room 363 Dept. of Computer Science or MS Teams

DM 2

Classes

Day of Week Hour Room
Monday 09:00 - 11:00 C
Wednesday 11:00 - 13:00 C

Office Hours - Ricevimento:

  • Tuesday 15.00-17.00 or Appointment by email
  • Room 363 Dept. of Computer Science or MS Teams

Learning Material -- Materiale didattico

Textbook -- Libro di Testo

  • Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining. Addison Wesley, ISBN 0-321-32136-7, 2006
  • Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. GUIDE TO INTELLIGENT DATA ANALYSIS. Springer Verlag, 1st Edition., 2010. ISBN 978-1-84882-259-7
  • Laura Igual et al. Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications. 1st ed. 2017 Edition.

Slides

Software

  • Python - Anaconda (>3.7): Anaconda is the leading open data science platform powered by Python. Download page (the following libraries are already included)
  • Scikit-learn: python library with tools for data mining and data analysis Documentation page
  • Pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Documentation page

Other softwares for Data Mining

Class Calendar (2023/2024)

First Semester (DM1 - Data Mining: Foundations)

Day Time Room Topic Material Lecturer
01. 18.09.2023 11-13 C1 Overview, Introduction Intro Pedreschi
20.09.2023 11-13 No Lecture
02. 25.09.2023 11-13 C1 Lab. Introduction to Python Python Basic Guidotti
03. 27.09.2023 11-13 C1 Lab. Data Understanding Data Understanding Guidotti
04. 02.10.2023 11-13 C1 Data Understanding Data Understanding Guidotti
05. 04.10.2023 11-13 C1 Data Understanding & Preparation Data Understanding, Data Preparation Pedreschi
06. 09.10.2023 11-13 C1 Data Preparation & Data Similarity Data Preparation, Data Similarity Pedreschi
07. 11.10.2023 11-13 C1 Data Similarity & Lab. Data Understanding Data Similarity, Data Understanding Pedreschi
08. 16.10.2023 11-13 C1 Introduction to Clustering, K-Means Intro_Clustering, K-Means Pedreschi
09. 18.10.2023 11-13 C1 Clustering Validation, Hierarchical Clustering Intro_Clustering, Hierarchical Pedreschi
10. 23.10.2023 11-13 C1 Density-based Clustering Density-based Clustering Pedreschi
11. 25.10.2023 11-13 C1 Lab. Clustering Clustering Guidotti
12. 30.10.2023 11-13 C1 Ex. Clustering ExClustering Guidotti
01.11.2023 11-13 No Lecture
13. 06.11.2023 11-13 C1 Intro Classification, kNN(video) Intro_Classification, kNN Guidotti
14. 08.11.2023 11-13 C1 Naive Bayes, Exercises Naive Bayes Guidotti
15. 13.11.2023 11-13 C1 Model Evaluation Model Evaluation Guidotti
16. 15.11.2023 11-13 C1 Model Evaluation Exercises & Lab Classification Guidotti
20.11.2023 11-13 No Lecture
17. 22.11.2023 11-13 C1 Decision Tree Classifier Decision Tree Pedreschi
18. 27.11.2023 11-13 C1 Decision Tree Classifier Decision Tree Pedreschi
19. 29.11.2023 11-13 C1 Exercises and Lab. Decision Tree Classifier Decision Tree Guidotti
20. 04.12.2023 11-13 C1 Decision Tree Classifier, Exercises and Lab Decision Tree Pedreschi
21. 06.12.2023 11-13 C1 Intro Regression & Lab. Regression Regression, Regression Guidotti
22. 11.12.2023 11-13 C1 Into Pattern Mining and Apriori Pattern Mining Pedreschi
23. 13.12.2023 16-18 C1 Apriori & Lab. Pattern Mining Pattern Mining, Pattern Mining Pedreschi
24. 18.12.2023 11-13 C FP-Growth and Exercises Pattern Mining Guidotti

Second Semester (DM2 - Data Mining: Advanced Topics and Applications)

Day Time Room Topic Material Lecturer
01. 19.02.2024 14-16 C Overview, Rule-based Models Introduction, Guidelines, Rule-based Models Guidotti
21.02.2024 No Lecture
26.02.2024 No Lecture
02. 19.02.2024 11-13 C Sequential Pattern Mining Sequential Pattern Mining, GSP Guidotti
03. 04.03.2024 9-11 C Sequential Pattern Mining Sequential Pattern Mining, GSP Guidotti
04. 06.03.2024 11-13 C Transactional Clustering Transactional Clustering Guidotti
05. 11.03.2024 9-11 C Time Series Similarity Time Series Similarity, TS_Load, TS_Similarity Guidotti
06. 13.03.2024 11-13 C Time Series Approximation Time Series Clustering, TS_Approx_Clustering Guidotti
07. 18.03.2024 9-11 C Time Series Clustering & Motifs Time Series Motifs, TS_Motifs Guidotti
08. 20.03.2024 11-13 C Time Series Classification Time Series Classification, TS_Classification Guidotti
09. 25.03.2024 9-11 C Imbalanced Learning Imbalanced Learning, ImbLearn Guidotti
10. 27.03.2024 11-13 C Dimensionality Reduction Dimensionality Reduction, DimRed Guidotti
11. 03.04.2024 11-13 C Outlier Detection Outlier Detection Guidotti
12. 08.04.2024 9-11 C Outlier Detection Outlier Detection, OutlierDetection Guidotti
13. 10.04.2024 11-13 C Outlier Detection Outlier Detection, OutlierDetection Guidotti
14. 15.04.2024 14-16 C Gradient Descend, MLE GD, MLE Guidotti
15. 17.04.2024 11-13 C Odds, LogOdds, Logistic Regression Odds, LogReg, LogReg Guidotti
16. 22.04.2024 9-11 C Support Vector Machine SVM, SVM Guidotti
17. 24.04.2024 11-13 C Perceptron, Neural Networks Perceptron Guidotti
18. 29.04.2024 9-11 C Deep Neural Networks Deep Neural Networks, NN Guidotti
19. 06.05.2024 9-11 C CNN, RNN, DL-TS, Ensemble Intro DNN, TSC-DNN, Ensemble Guidotti
20. 08.05.2024 11-13 C Ensemble, Boosting, Adaboost Ensemble, LabEnsemble Guidotti
21. 13.05.2024 9-11 C Ensemble-TS, Gradient Boosting Gradient Boosting Machines, LabEnsemble Guidotti
22. 15.05.2024 11-13 C Extreme Gradient Boosting Gradient Boosting Machines, LabEnsemble Guidotti
23. 20.05.2024 9-11 C1 eXplainable Artificial Intelligence XAI, LabXAI Guidotti
24. 22.05.2024 11-13 C1 eXplainable Artificial Intelligence XAI, LabXAI Guidotti

Exams

How and Where: The exam will take place in oral mode only at the teacher's office or classroom previously designated. The exam will be held online on the 420AA Data Mining course channel only at the request of the student in accordance with current legislation.

When: The dates relating to the start of the three exams are/will be published on the online platform https://esami.unipi.it/. Within each session, we will identify dates and slots in order to distribute the various orals. The dates and slots to take the exam will be published on the course page by the end of May. Each student must also register on https://esami.unipi.it/. The examination can only be carried out after the delivery of the project. The project must be delivered one week before when you want to take the exam. Group oral discussions will be preferred in respect of the project groups in order to parallelize any discussion on the project. It is not mandatory to take the oral exam together with the other members of the group. In the event that the oral exam is not passed, it will not be possible to take it for 20 days. If the project is not considered sufficient, it must be carried out again on a new dataset or a very updated version of the current one.

What: The oral test will evaluate the practical understanding of the algorithms. The exam will evaluate three aspects.

  1. Understanding of the theoretical aspects of the topics addressed during the course. The student may be required to write on formulas or pseudocode. During the explanations, the student can use pen and paper.
  2. Understanding of the algorithms illustrated during the course and their practical implementation. You will be asked to perform one or more simple exercises. The text will be shown on the teacher's screen and / or copied to Miro. The student will have to use pen and paper (if online by Miro https://miro.com/ to show how the exercise is solved.
  3. Discussion of the project with questions from the teacher regarding unclear aspects,

questionable steps or choices.

Final Mark: for 12-credit exam, the final mark will be obtained as the average mark of DM1 and DM2.

Exam Booking Periods

  • Exam portal link: here
  • 1st Appello: from 09/01/2024 to 31/12/2024
  • 2nd Appello: from 01/02/2024 to 17/02/2024
  • 3rd Appello: from 05/05/2024 to 30/05/2024
  • 4th Appello: from 02/06/2024 to 27/06/2024
  • 5th Appello: from 19/06/2024 to 14/07/2024
  • 6th Appello:

Exam Booking Agenda

When registering for the oral exam please specify in the notes DM1 if you do not want to do DM2 (that is assumed by default). After having booked for DM1 please contact Prof. Pedreschi to agree on the exam date (put Prof. Guidotti and Andrea Fedele in cc). There will be no agenda for DM1.

Do not forget to make the evaluation of the course!!!

Exam DM1

The exam is composed of two parts:

  • An oral exam, that includes: (1) discussing the project report; (2) discussing topics presented during the classes, including the theory and practical exercises.
  • A project, that consists in exercises requiring the use of data mining tools for analysis of data. Exercises include: data understanding, clustering analysis, pattern mining, and classification (guidelines will be provided for more details). The project has to be performed by min 2, max 3 people. It has to be performed by using Python or any other data mining software. The results of the different tasks must be reported in a unique paper. The total length of this paper must be max 20 pages of text including figures. The paper must be emailed to andrea [dot] fedele [at] phd [dot] unipi [dot] it and riccardo [dot] guidotti [at] unipi [dot] it. Please, use “[DM1 2023-2024] Project” in the subject.
  • Dataset
    1. Assigned: 25/09/2023
    2. MidTerm Submission: 15/11/2023 (+0.5) (half project required, i.e., Data Understanding & Preparation and Clustering)
    3. Final Submission: 31/12/2023 (+0.5) one week before the oral exam (complete project required).
    4. Dataset: STD

DM1 Project Guidelines See Project Guidelines.

Exam DM2

The exam is composed of two parts:

  • An oral exam, that includes: (1) discussing the project report; (2) discussing topics presented during the classes, including the theory and practical exercises.
  • A project, that consists in exercises requiring the use of data mining tools for analysis of data. Exercises include: imbalanced learning, dimensionality reduction, outlier detection, advanced classification/regression methods, time series analysis/clustering/classification (guidelines will be provided for more details). The project has to be performed by min 1, max 3 people. It has to be performed by using Python or any other data mining software. The results of the different tasks must be reported in a unique paper. The total length of this paper must be max 30 pages of text including figures. The paper must be emailed to andrea [dot] fedele [at] phd [dot] unipi [dot] it and riccardo [dot] guidotti [at] unipi [dot] it. Please, use “[DM2 2023-2024] Project” in the subject.
  • Dataset
    1. Assigned: 19/02/2024
    2. MidTerm Submission: 07/05/2024 (Modules 1 and 2 (for TS classification non DL-based models))
    3. Final Submission: one week before the oral exam (complete project required, also with DL-based models for TS classification).
    4. Dataset: STD

DM2 Project Guidelines See Project Guidelines.

Past Exams

  • Past exams texts can be found in old pages of the course. Please do not consider these exercises as a unique way of testing your knowledge. Exercises can be changed and updated every year and will be published together with the slides of the lectures.

Reading About the "Data Scientist" Job

… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the “sexiest” around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them.

Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.

  • Data, data everywhere. The Economist, Feb. 2010 download
  • Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011 link
  • Welcome to the yotta world. The Economist, Sept. 2011 download
  • Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, Sept 2012 link
  • Il futuro è già scritto in Big Data. Il SOle 24 Ore, Sept 2012 link
  • Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics download
  • Peter Sondergaard, Gartner, Says Big Data Creates Big Jobs. Oct 22, 2012: YouTube video
  • Towards Effective Decision-Making Through Data Visualization: Six World-Class Enterprises Show The Way. White paper at FusionCharts.com. download

Previous years

dm/start.txt · Ultima modifica: 24/05/2024 alle 11:03 (24 ore fa) da Riccardo Guidotti