====== Big Data Analytics A.A. 2017/18 ====== Instructors - Docenti: * **Fosca Giannotti, Roberto Trasarti** * KDD Laboratory, Università di Pisa ed ISTI - CNR, Pisa * [[http://www-kdd.isti.cnr.it]] * [[fosca.giannotti@isti.cnr.it]] * [[roberto.trasarti@isti.cnr.it]] ====== Learning goals -- Obiettivi del corso ====== Objective In our digital society, every human activity is mediated by information technologies. Therefore, every activity leaves digital traces behind, that can be stored in some repository. Phone call records, transaction records, web search logs, movement trajectories, social media texts and tweets, Every minute, an avalanche of “big data” is produced by humans, consciously or not, that represents a novel, accurate digital proxy of social activities at global scale. Big data provide an unprecedented “social microscope”, a novel opportunity to understand the complexity of our societies, and a paradigm shift for the social sciences. Objective of the course is twofold: an introduction to the emergent field of big data analytics and social mining, aimed at acquiring and analyzing big data from multiple sources to the purpose of discovering the patterns and models of human behavior that explain social phenomena and an introduction to the technological scenario of scalable analytics. === Intro lectures === Lecture 1: Course Presentation, Course organization, Big Data Landscape: Opportunities, risks, big data sources, challenges. Slides:[[https://goo.gl/WztPDg]] === Technologies lectures: === Lecture 1: Overview/Recall parallel computing. Slides: [[https://goo.gl/eCwz7G]] Lecture 2: Introduction to Hadoop and Map-Reduce Patterns. Slides: [[https://goo.gl/kukSQx]] [[https://goo.gl/efVLKD]] Lecture 3: HDFS and Spark (LAB). Slides [[https://goo.gl/eD5p6c]] Lecture 4-5-6: Data Analytics with Spark (LAB) (Last slides of Lecture 3 with exercises) [[https://goo.gl/AQJXhD]] Lecture 7-8-9: Data Mining with Spark and Mllib (LAB) Slides: [[https://goo.gl/HJEQwT]], Materials: [[https://goo.gl/VxAEhi]] === Methodological scenarios lectures: === Lecture 1-2: What is possible to observe with Mobile Phone Data? Formulation of novel questions to be answered: estimating population, understanding city dynamics, estimating unemployment or gender Distribution, Wellbeing; The complexity of feature construction; Model Construction; new mining algorithms; validation strategies. Slides: [[https://goo.gl/fULiAu]], [[https://goo.gl/UZEPdu]] Lecture 3-4: What is possible to observe with GPS data? Formulation of novel questions to be answered: Understanding Human Mobility; the complexity of feature construction, new Model Construction, ew mining algorithms; validation strategies. Slides: [[https://goo.gl/ztUvLd]] Lecture 5-6: What is possible to observe with Social Media Data? Formulation of novel questions to be answered: Understanding Sentiment, Wellbeing, Happyness; the complexity of feature construction, new Model Construction, ew mining algorithms; validation strategies. Lecture 7: What is possible to observe with IoT Data? Formulation of novel questions to be answered: Understanding performance in Sport; the complexity of feature construction, new Model Construction, ew mining algorithms; validation strategies. ====== Datasets ====== The datasets overview: [[https://goo.gl/fyAjth]] The datasets folder: [[https://goo.gl/nPd6HT]] Solutions for the tech midterms are in the exercises folder of the datasets. ====== Calendar ====== **18/09** - (Intro) Course Presentation, Big Data Landscape **22/09** - (Tech) Overview/Recall parallel computing **25/09** - (Method) What is possible to do observe with Mobile Phone Data? (i) **29/09** - (Method) What is possible to do observe with Mobile Phone Data? (ii) **02/10** - (Tech) Introduction to Hadoop e Design Pattern (Lab) **06/10** - Cancelled! **09/10** - (Tech) Managing HDFS and Introduction to Spark (Lab) and Datasets Presentation **13/10** - (Tech) Data Analytic with Spark (Lab) **16/10** - (Tech) Data Analytic with Spark (Lab) //**20-23/10** - No Class (Time to practice!)// **27/10** - (Tech) Data Analytic with Spark (Lab) //**30/10** Mid-term Tech I - 16,30 starts, you will have 1 hour and 30 minutes// **6/11** - (Tech) Data Mining with Spark and Mllib (Lab) (i) **10/11** - (Method) What is possible to do observe with GPS data? (i) **13/11** - (Tech) Data Mining with Spark and Mllib (Lab) (ii) **17/11** - (Method) What is possible to do observe with GPS data? (ii) //**20/11** - Discussing the final project proposal - Collective discussion (not evaluated)// **24/11** - (Tech) Data Mining with Spark and Mllib (Lab) (iii) **27/11** - (Method) What is possible to do observe with Social Media Data? (i) **01/12** - (Method) What is possible to do observe with Social Media Data? (ii) **4/12** - (Method) What is possible to do observe with GPS data? (iii) **11/12** - Cancelled due weather **15/12** - Discussing the final project proposal - Collective discussion (not evaluated) and (Method) What is possible to do observe with IoT data: examples from sport ? //**18/12** Mid-term Tech II// //**12/01** - 14,00 @ CNR (Entrance 20 - Room C36b) - Mid-term Tech part I and/or II (2° chance, send an e-mail before 07/01 if you want do it)// //**22/01 - 16/02** Final Project and Discussion: 14,00 @ CNR (Entrance 20 - Room C40)// ===== Exam ===== The two mid-terms will be 40% of the final grade, the remaining 60% is the evaluation of the Project and the Discussion (prepare some Slides to present your project). There is the possibility to do the a final test about technologies if the Mid-Terms are not sufficient. The following table describe the expected content of a project: {{:bigdataanalytics:bda:project.png?800|}} ===== Laboratories ===== Student should bring their own laptop (especially for technology lectures). Software & Links * Python website: http://www.python.it/download/ (Install the 2.x. Do not install 3.x). Instructions [[https://goo.gl/yBRjkG]] * Installing Hadoop single node on your machine (without VM): https://goo.gl/KGME9t (Linux/OS) https://goo.gl/7Bkcnr (Win) * Spark http://spark.apache.org/downloads.html (Can be installed without hadoop) Virtual Machines: * http://hortonworks.com/products/hortonworks-sandbox/#install (hortonworks VM root/hadoop http://127.0.0.1:8888 or ssh root@127.0.0.1 -p 2222) * http://www.cloudera.com/downloads/quickstart_vms/5-8.html (Cloudera VM cloudera/cloudera) * https://www.virtualbox.org/ (Virtual Box - Virtual Machine manager)