Questa è una vecchia versione del documento!
Big Data Analytics A.A. 2021/22
All lectures will be provided also remotely, through the Teams team named “599AA 21/22 - BIG DATA ANALYTICS [WDS-LM]”
Instructors:
Luca Pappalardo
Fosca Giannotti
KDD Laboratory, ISTI-CNR, Università di Pisa, and Scuola Normale Superiore, Pisa
-
-
-
Tutor:
Timetable
Dataset assignment: datasets have been assigned to teams, find your dataset here https://bit.ly/2YalEtI
Instructions for MidTerm 1: The first mid-term presentation (Data Understanding and Project Proposal) will be on October 20th (half of the teams) and October 22nd (rest of the teams).
presentation: prepare a presentation describing the data understanding and a proposal of the problem you aim to solve. Motivate your decisions and choices (e.g., which variables you deleted, how you deal with missing values and noise, the new variables you created, if you integrated your data with external datasets, etc.). The presentation should last max. 20 minutes (+ 10 minutes questions) and must be done running “live” a Colab notebook;
code: provide the link to the notebook on Jovian with the code you used for all computations and plots. Document adequately your notebooks using the markdown language. The notebook should be runnable without errors on Google Colab, so put in some blocks instructions to install additional libraries (if any) and instructions on the format the datasets should have in order to run the code correctly.
-
Instructions for MidTerm 2: The second mid-term presentation (model(s) implementation and evaluation) will be on November 17th (half of the teams) and November 19th (rest of the teams).
presentation: he models you tried (e.g., Decision Trees, SVMs, etc.), the baselines used as a comparison (e.g., DummyClassifiers), how you perform the hyper-parameter tuning, the evaluation technique used (i.e., holdout, repeated holdout, cross-validation), and the metrics chosen to evaluate the performance of the models. Motivate your decisions and choices (e.g., which evaluation metrics you chose, how you deal with the unbalancing of the dataset). The presentation should last max. 20 minutes (+ 10 minutes questions) and must be done running “live” a Colab notebook;
code: provide the link to the notebook on Jovian with the code you used for all computations and plots. Document adequately your notebooks using the markdown language. The notebook should be runnable without errors on Google Colab, so put in some blocks instructions to install additional libraries (if any) and instructions on the format the datasets should have in order to run the code correctly. Upload also the link to the notebook of the previous mid-term, with the modifications suggested.
-
Instructions for MidTerm3: The third mid term presentation (model interpretation and explanation) will be on December 15th and December 17th.
Paper presentation: they are scheduled on December 1st, 3rd, 10th.
-
-
Each student will present, during a talk of 7 minutes at most, a paper on Big Data Analytics. During the presentation (with slides), you should highlight the following aspects: the data set used, the feature engineering and/or selection (if any), the problem addressed, the models/algorithms used to solve the problem, and finally the explanations of the model constructed (if any).
Learning goals
In our digital society, every human activity is mediated by information technologies, hence leaving digital traces behind. These massive traces are stored in some, public or private, repository: phone call records, movement trajectories, soccer-logs, and social media records are all examples of “Big Data”, a novel and powerful “social microscope” to understand the complexity of our societies. The analysis of big data sources is a complex task, involving the knowledge of several technological and methodological tools.
This course has three objectives:
introducing to the emergent field of big data analytics and social mining;
introducing to the technological scenario of big data, like programming tools to analyze big data, query NoSQL databases, and perform predictive modeling;
guide students to the development of an open-source and reproducible big data analytics project, based on the analysis of real-world datasets.
Module 1: Big Data Analytics and Social Mining
In this module, analytical methods and processes are presented through exemplary cases studies in challenging domains, organized according to the following topics:
Module 2: Big Data Analytics Technologies
This module will provide to the students the technologies to collect, manipulate and process big data. In particular, the following tools will be presented:
Python for Data Science
The Jupyter Notebook: developing open-source and reproducible data science
MongoDB: fast querying and aggregation in NoSQL databases
GeoPandas: analyze geo-spatial data with Python
Scikit-learn: machine learning in Python
Keras: deep learning in Python
Module 3: Laboratory for Interactive Project Development
During the course, teams of students will be guided in the development of a big data analytics project. The projects will be based on real-world datasets covering several thematic areas. Discussions and presentation in class, at different stages of the project execution, will be performed.
1st Mid Term: Data Understanding and Project Formulation
2nd Mid Term: Model(s) construction and evaluation
3rd Mid Term: Model interpretation/explanation
Exam: Final Project results
Calendar
15/09 (Mod. 1) Introduction to the course, The Big Data scenario lesson1_introduction_to_the_course_2021.pdf
17/09 (Mod. 2) Python for Data Science and the Jupyter Notebook: developing open-source and reproducible data science
22/09 (Mod. 2) Data Exploration and Understanding practice in Python
24/09 (Mod. 3) Presentation of datasets for the project bda21_22_datasets_1_.pdf
29/09 (Mod. 2) Scikit-learn: programming tools for data mining (part 1) https://jovian.ai/jonpappalord/classification
01/10 (Mod. 2) Scikit-learn: programming tools for data mining (part 2) https://jovian.ai/jonpappalord/clustering
6/10 (Mod. 2) Geopandas and scikit-mobility: managing geographic data in Python (part 1)
8/10 (Mod. 2) Geopandas and scikit-mobility: managing geographic data in Python (part 2)
13/10 (Mod. 1) Case study 1: Injury prediction and how to deal with unbalanced datasets and perform feature selection: bda_2122_injury_forecasting.pdf
15/10 (Mod. 2) Feature selection in Python
20/10 (Mod. 3) MidTerm1
BigData-Islanders
WeMine
cpu_in_flames
22/10 (Mod. 3) MidTerm1
How I Met Your Big Data
SLM
The Missing Values
27/10 (Mod. 3) Comments and discussion on first Mid Term 1 tips_mid_1_bda2122.pdf
29/10 (Mod. 1) Case Study 2: How to use Data Science to nowcast well-being bda_wellbeing.pdf
03/11 (Mod. 1) Case Study 3: Performance evaluation in sports
05/11 NO LESSON
10/11 (Mod. 2) Interpretations and Explanations 1: https://jovian.ai/jonpappalord/explanations
12/11 (Mod. 2) Interpretations and Explanations 2: https://jovian.ai/jonpappalord/explanations2
17/11 (Mod. 3) Mid Term2
How I Met Your Big Data
WeMine
The Missing Values
19/11 (Mod.3) Mid Term2
BigData-Islanders
SLM
cpu_in_flames
01/12 (Mod. 3) Paper presentations
03/12 (Mod. 3) Paper presentations
cpu_in_flames
The Missing Values
10/12 (Mod. 3) Paper presentations
How I met your Big Data
WeMine
15/12 (Mod. 3) Mid Term 3
BigData-Islanders
SLM
cpu_in_flames
17/12 (Mod. 3) Mid Term 3
How I Met Your Big Data
WeMine
The Missing Values
Exam (Appelli)
Previous Big Data Analytics websites