Strumenti Utente

Strumenti Sito


bigdataanalytics:bda:start

Questa è una vecchia versione del documento!


<html> <!– Google Analytics –> <script type=“text/javascript” charset=“utf-8”> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-34685760-1', 'auto', 'personalTracker', {'allowLinker': true}); ga('personalTracker.require', 'linker'); ga('personalTracker.linker:autoLink', ['pages.di.unipi.it', 'enforce.di.unipi.it', 'didawiki.di.unipi.it'] ); ga('personalTracker.require', 'displayfeatures'); ga('personalTracker.send', 'pageview', 'ruggieri/teaching/bda/'); setTimeout(“ga('send','event','adjusted bounce rate','30 seconds')”,30000); </script> <!– End Google Analytics –> <!– Capture clicks –> <script> jQuery(document).ready(function(){ jQuery('a[href$=“.pdf”]').click(function() { var fname = this.href.split('/').pop(); ga('personalTracker.send', 'event', 'BDA', 'PDFs', fname); }); jQuery('a[href$=“.r”]').click(function() { var fname = this.href.split('/').pop(); ga('personalTracker.send', 'event', 'BDA', 'Rs', fname); }); jQuery('a[href$=“.zip”]').click(function() { var fname = this.href.split('/').pop(); ga('personalTracker.send', 'event', 'BDA', 'ZIPs', fname); }); jQuery('a[href$=“.mp4”]').click(function() { var fname = this.href.split('/').pop(); ga('personalTracker.send', 'event', 'BDA', 'Videos', fname); }); jQuery('a[href$=“.flv”]').click(function() { var fname = this.href.split('/').pop(); ga('personalTracker.send', 'event', 'BDA', 'Videos', fname); }); }); </script> </html> ====== Big Data Analytics A.A. 2020/21 ====== WARNING: All lectures of the First Semester of the academic year 2020/21, until 31/12/2020, will be provided exclusively remotely, through the Teams team named “599AA 20/21 - BIG DATA ANALYTICS [WDS-LM]” (https://bit.ly/35yJ65c). ATTENZIONE: Tutte le lezioni frontali del Primo Semestre dell’a.a. 2020/21, fino al 31/12/2020, verranno erogate esclusivamente in modalità a distanza, attraverso il canale Teams “599AA 20/21 - BIG DATA ANALYTICS [WDS-LM]” (https://bit.ly/35yJ65c). Instructors - Docenti: * Luca Pappalardo, Fosca Giannotti * KDD Laboratory, Università di Pisa and ISTI-CNR, Pisa * http://www-kdd.isti.cnr.it * luca [dot] pappalardo [at] isti [dot] cnr [dot] it * fosca [dot] giannotti [at] isti [dot] cnr [dot] it Timetable (http://bit.ly/unipi_timetable_2020) * Monday 16:15 - 18:00 Aula WDS/1 * Tuesday 16:15 - 18:00 Aula WDS/1 Team Registration: build up teams of 3 or 4 students and register your team here, by September 27th: https://forms.gle/rbsV4dF6RuAnCBWz9 For students without a team: send an email to Luca Pappalardo to notify that you are without a team by September 30th. Only for the registered teams, express your preference for the datasets by September 30th https://forms.gle/HVheaScCgQJw4o616 Dataset assignment: at thie following link, each team can find the dataset assigned for the project –> https://bit.ly/33eTfC9 Instructions for mid term 1: The first mid term presentation (data understanding and project proposal) will be on October 19th (BigProblem, Global, MMG, I TeamIDI) and October 20th (Bei Dati Acrobatici, Malucs, AMS Group). * presentation: prepare a presentation describing the data understanding and a proposal of the problem you want to solve. Motivate your decisions and choices (e.g., which variables you delete, how you deal with missing values and noise, the new variables you created, if you integrated your data with external datasets, etc.). The presentation should last 20 minutes (+ 10 minutes questions). The presentation must be sent through the google form (see below) in pdf format; * report: the report must be done in latex, using this template: latex_template_bda.zip. It must be a maximum of 5 pages long. Summarize the data understanding and describe and motivate your project proposal. A zipped folder (.zip file) containing the .tex file, the .cls file, the .pdf file, and the files of all figures must be sent through the google form. In the report, put the name of the title of your project and the names of the members of your team. * code: the python code in .ipynb format (Jupyter Notebook) or .py format used to generate the computations and the plots must be sent through the google form. Please document adequately your notebooks using the markdown language. * Google form: upload the material by October 18th using this form: https://forms.gle/h2SAKFmkdXv4itiU6 * name the files using the format midterm1_teamname_type, where teamname is the name of the team (do not use spaces, use lowercase only), type is the type of the file (i.e., presentation, report, or code). Examples: midterm1_iteamidi_presentation.pdf, midterm1_beidatiacrobatici_report.zip, midterm1_amsgroup_code.ipynb Instructions for mid term 2: The second mid term presentation (model(s) implementation and evaluation) will be on November 16th (BigProblem, Global, MMG, I TeamIDI) and November 17th (Bei Dati Acrobatici, Malucs, AMS Group). * presentation: prepare a presentation describing the models you tried (e.g., Decision Trees, SVMs, etc.), the baselines used as a comparison (e.g., DummyClassifiers), how you perform the hyper-parameter tuning, the evaluation technique used (i.e., holdout, repeated holdout, cross validation), and the metrics chosen to evaluate the performance of the models. Motivate your decisions and choices (e.g., which evaluation metrics you chose, how you deal with the unbalancing of the dataset). The presentation should last 20 minutes (+ 10 minutes questions). The presentation must be sent through the google form (see below) in pdf format; * report: the report must be done in latex, using the same template as the first mid term: latex_template_bda.zip. It must be a maximum of 10 pages long, including the part of the report regarding mid term 1. Summarize the model construction and evaluation and motivate your choices. A zipped folder (.zip file) containing the .tex file, the .cls file, the .pdf file, and the files of all figures must be sent through the google form. In the report, put the name of the title of your project and the names of the members of your team. * code: the python code in .ipynb format (Jupyter Notebook) or .py format used to generate the computations and the plots must be sent through the google form. Please document adequately your notebooks using the markdown language. * Google form: upload the material by November 15th using this form: https://forms.gle/kr3uq2PyyMqraRn78 * name the files using the format midterm2_teamname_type, where teamname is the name of the team (do not use spaces, use lowercase only), type is the type of the file (i.e., presentation, report, or code). Examples: midterm2_iteamidi_presentation.pdf, midterm2_beidatiacrobatici_report.zip, midterm2_amsgroup_code.ipynb Paper presentation: * each student will present, during a talk of most 7 minutes, a paper on Big Data Analytics. The presentations of the papers are scheduled on November 23rd and 24th. The presentation should last 7 minutes (+ 3 minutes questions). * Express your preference for 5 papers here: https://forms.gle/B9rCmpJ8jnQ4vzWN8. We'll take into account your preference as much as possible. Fill the form by Nov 3rd. I'll assign you the paper within Nov 5th. * During the presentation (with slides) you should highlight the following aspects: the data set used, the feature engineering and/or selection (if any), the problem addressed, the models/algorithms used to solve the problem, and finally the explanations of the model constructed (if any). * The paper assigned to each student, and the date of presentation, are here: https://bit.ly/2I10Uw2 Instructions for mid term 3: The third mid term presentation (model(s) interpretation and explanation) will be on December 7th (BigProblem, Global, MMG, I TeamIDI) and December 8th (Bei Dati Acrobatici, Malucs, AMS Group). * presentation: prepare a presentation in which you show how to interpret the model(s) you created and how to explain the reasoning the model.Example: if you use a decision tree (or similar) you can show the feature importance and show its structure to describe the rules it is composed of; if you use a (logistic or linear) regressor you can show the value of the computed weights; if you use a geometric model you can perform a dimensionality reduction and try to interpret it in two or three dimensions; if you use a non-interpretable model (e.g., a neural network) you can use explanability models (e.g., Shap, LIME). Provide also examples of how to interpret the model on specific records that are correctly classified and records that are incorrectly classified by your model. The presentation should last 20 minutes (+ 10 minutes questions). The presentation must be sent through the google form (see below) in pdf format; * report: the report must be done in latex, using the template as the other mid terms. It must be a maximum of 15 pages long. Extend/modify the previous report. A zipped folder (.zip file) containing the .tex file, the .cls file and the files of all figures/plots must be sent through the google form. In the report, put the name of the title of your project and the names of the members of your team. * code: the python code in .ipynb format (Jupyter Notebook) or .py format used to generate the computations and the plots must be sent through the google form. Please document adequately your notebooks using the markdown language. * Google form: upload the material by December 6th using this form: https://forms.gle/65rRgYiNwz9ofdkn8 * name the files using the format midterm3_teamname_type, where teamname is the name of the team (do not use spaces, use lowercase only), type is the type of the file (i.e., presentation, report, or code). Examples: midterm3_iteamidi_presentation.pdf, midterm3_beidatiacrobatici_report.zip, midterm3_amsgroup_code.ipynb Examples of projects from past years: * Credit Risk Prediction, final report: credit_risk_prediction_heloc_case.pdf * Ted Talks, final report: dataworms_project_ted_talks_report.pdf LESSON OF DECEMBER 8th: December 8th is a holiday in Italy. We can decide whether to have the lesson that day anyway, or to move it to another 2-hours slot next week. Please fill this doodle by Fri Dec 4th (only students whose project presentation slot is on Dec 8th should fill it): https://doodle.com/poll/nxbia82mf3e48bqw?utm_source=poll&utm_medium=link ====== Learning goals ====== In our digital society, every human activity is mediated by information technologies, hence leaving digital traces behind. These massive traces are stored in some, public or private, repository: phone call records, movement trajectories, soccer-logs and social media records are all examples of “Big Data”, a novel and powerful “social microscope” to understand the complexity of our societies. The analysis of big data sources is a complex task, involving the knowledge of several technological and methodological tools. This course has three objectives: * introducing to the emergent field of big data analytics and social mining; * introducing to the technological scenario of big data, like programming tools to analyze big data, query NoSQL databases, and perform predictive modeling; * guide students to the development of a open-source and reproducible big data analytics project, based on the analyis of real-world datasets. ====== Module 1: Big Data Analytics and Social Mining ====== In this module, analytical methods and processes are presented thought exemplary cases studies in challenging domains, organized according to the following topics: * The Big Data Scenario and the new questions to be answered * Sport Analytics: - Soccer data landscape and injury prediction - Analysis and evolution of sports performance * Mobility Analytics - Mobility data landscape and mobility data mining methods - Understanding Human Mobility with vehicular sensors (GPS) - Mobility Analytics: Novel Demography with mobile-phone data * Social Media Mining - The social media data landscape: Facebook, Linked-in, Twitter, Last_FM - Sentiment analysis. example from human migration studies - Discussion on ethical issues of Big Data Analytics * Well-being&Now-casting - Nowcasting influenza with retail market data - Predicting well-being from human mobility patterns * Paper presentations by students ====== Module 2: Big Data Analytics Technologies ====== This module will provide to the students the technologies to collect, manipulate and process big data. In particular the following tools will be presented: * Python for Data Science * The Jupyter Notebook: developing open-source and reproducible data science * MongoDB: fast querying and aggregation in NoSQL databases * GeoPandas: analyze geo-spatial data with Python * Scikit-learn: machine learning in Python * Keras: deep learning in Python ====== Module 3: Laboratory for Interactive Project Development ====== During the course, teams of students will be guided in the development of a big data analytics project. The projects will be based on real-world datasets covering several thematic areas. Discussions and presentation in class, at different stages of the project execution, will be performed. * 1st Mid Term: Data Understanding and Project Formulation * 2nd Mid Term: Model(s) construction and evaluation * 3rd Mid Term: Model interpretation/explanation * Exam: Final Project results ====== Calendar ====== 14/09 (Mod. 1) Introduction to the course, The Big Data scenario lesson1_introduction_to_the_course_bda2021.pdf 15/09 (Mod. 2) Python for Data Science and the Jupyter Notebook: developing open-source and reproducible data science * How to install Jupyter notebook: https://jupyter.readthedocs.io/en/latest/install.html * Python notebooks: http://bit.ly/bda2021_notebooks_1 21/09 No Lesson (Election Day in Italy) 22/09 (Mod. 3) Presentation of datasets for projects bda20_21_datasets_1_.pdf 28/09 (Mod. 2) Scikit-learn: programming tools for data mining (part 1): http://bit.ly/bda_notebooks_2 29/09 * (Mod. 2) Scikit-learn: programming tools for data mining (part 2): http://bit.ly/bda_notebooks_2 * (Mod. 1) Reproducing and Explaining Human Evaluations of Soccer Performance with Artificial Intelligence evaluting_soccer_performance_1_.pdf 05/10 No Lesson (SocInfo2020 conference) 06/10 No Lesson (SocInfo2020 conference) 12/10 (Mod. 2) Geopandas and scikit-mobility: managing geographic data in Python (part 1) bda2021_geopandas.zip 13/10 (Mod. 2) Geopandas and scikit-mobility: managing geographic data in Python (part 2) https://github.com/scikit-mobility/tutorials/tree/master/mda_masterbd2020 19/10 (Mod. 3) 1st Mid Term - first group of teams 20/10 (Mod. 3) 1st Mid Term - second group of teams 26/10 (Mod. 3) Discussion and group working on projects 27/10 (Mod. 3) Discussion and group working on projects 02/11 (Mod. 1) Nowcasting well-being with big data bda_wellbeing.pdf 03/11 (Mod. 1) Injury prediction in sports with AI bda_2020_injury_forecasting.pdf 09/11 (Mod. 3) Discussion and group working on projects 10/11 (Mod. 1) Trustworthy data mining and Explainable AI parti1.explainableai-10.11.2020.pdf 16/11 (Mod. 3) 2nd Mid Term - first group of teams 17/11 (Mod. 3) 2nd Mid Term - second group of teams 23/11 (Mod. 3) Discussion and group working on projects 24/11 - No Lesson 30/11 (Mod. 3) Paper presentation 01/12 (Mod. 3) Paper presentation 07/12 (Mod. 3) 3rd Mid Term - first group of teams 08/12 (Mod. 3) 3rd Mid Term - second group of teams ===== Exam (Appelli) ===== * January 14th, 2021 * February 4th, 2021 ====== Previous Big Data Analytics websites ====== Big Data Analytics A.A. 2019/20 Big Data Analytics A.A. 2018/19 Big Data Analytics A.A. 2017/18 Big Data Analytics A.A. 2016/17 Big Data Analytics A.A. 2015/16

bigdataanalytics/bda/start.1607338711.txt.gz · Ultima modifica: 07/12/2020 alle 10:58 (4 anni fa) da Luca Pappalardo

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki