====== Big Data Analytics A.A. 2016/17 ====== Instructors - Docenti: * **Fosca Giannotti, Roberto Trasarti** * KDD Laboratory, Università di Pisa ed ISTI - CNR, Pisa * [[http://www-kdd.isti.cnr.it]] * [[fosca.giannotti@isti.cnr.it]] * [[roberto.trasarti@isti.cnr.it]] **Exam days: 23-24 January / 16-20 February (Send an email to register! and do not forget the report 7 days before)** ====== Learning goals -- Obiettivi del corso ====== Objective In our digital society, every human activity is mediated by information technologies. Therefore, every activity leaves digital traces behind, that can be stored in some repository. Phone call records, transaction records, web search logs, movement trajectories, social media texts and tweets, Every minute, an avalanche of “big data” is produced by humans, consciously or not, that represents a novel, accurate digital proxy of social activities at global scale. Big data provide an unprecedented “social microscope”, a novel opportunity to understand the complexity of our societies, and a paradigm shift for the social sciences. Objective of the course is twofold: an introduction to the emergent field of big data analytics and social mining, aimed at acquiring and analyzing big data from multiple sources to the purpose of discovering the patterns and models of human behavior that explain social phenomena and an introduction to the technological scenario of scalable analytics. ====== Course structure ====== The course is organized into three intertwined modules: **Module1**: Big Data Analytics and Social Mining: The focus is on what can be learnt from big data in different domains: mobility and transportation, urban planning, demographics, economics, social relationships, opinion and sentiment, etc.; and on the analytical and mining methods that can be used **Module2**: Scalable Data Analytics Technologies. The focus is on managing the pipeline of the analytical process to build scalable, robust data science applications: introduction to Hadoop, Spark and Mahout. Managing scalability: real case examples. **Module3**: Students Activities.Students are requested to actively participate with individual seminars and team projects. ===== Software & Links ===== * Python website: http://www.python.it/download/ (Install the 2.x. Do not install 3.x). Instructions [[https://goo.gl/yBRjkG]] * Scrapy webpage: http://scrapy.org/ * Installing Hadoop single node on your machine (without VM): https://goo.gl/KGME9t (Linux/OS) https://goo.gl/7Bkcnr (Win) * Spark http://spark.apache.org/downloads.html (Can be installed without hadoop) Virtual Machine: * http://hortonworks.com/products/hortonworks-sandbox/#install (hortonworks VM root/hadoop http://127.0.0.1:8888 or ssh root@127.0.0.1 -p 2222) * http://www.cloudera.com/downloads/quickstart_vms/5-8.html (Cloudera VM cloudera/cloudera) * https://www.virtualbox.org/ (Virtual Box - Virtual Machine manager) Other resources: * [[http://www.d4d.orange.com/fr/content/download/43453/406503/version/1/file/D4DChallengeSenegal_Book_of_Abstracts_Scientific_Papers.pdf| D4D challenge paper]] * [[https://drive.google.com/file/d/0B_IBzUGc9jCPM3NYbnJickpVXzA/view?usp=sharing|Tirocinio Curriculare]] (< 29 anni) (http://kdd.di.unipi.it/dichiarazione.pdf) ====== Hours - Orario e Aule ====== Monday 16:00 - 18:00 Aula Fib N1 Friday 9:00 - 11:00 Aula Fib L1 ^ ^ Day ^ Topic ^ Materials ^ Notes ^ Instructor ^ |1.| Mon 26/09 | Course Presentation; Module1: Big Data Landscape: Opportunities, risks, big data sources, challenges. | http://goo.gl/b2syFA | 1st student assignment: “Big Data, Data Analyst, Crowdsourcing, Crowdsensing” (at least one)| Giannotti/Trasarti | |2.| Fri 30/09 | Module2: Introduction to Hadoop | https://goo.gl/0UiFg8 | | Trasarti | |3.| Mon 3/10 | Module1: Big Data Analytics scenario: New questions to be answered | | Round table | Giannotti | |4.| Fri 7/10 | Understanding dynamic of society with Mobile Phone Traces | https://goo.gl/kEt3m3 | Projects assignment https://goo.gl/DkGMJg | Giannotti | |5.| Mon 10/10 | Module2: Design Patterns | https://goo.gl/ksJQDJ | 2nd student assignment: Papers & Technologies | Trasarti | |6.| Fri 14/10 | Module2: Analyzing Big Data with Spark | https://goo.gl/On4B77 https://goo.gl/luhYzB | | Trasarti | |--| Mon 17/10 | | | | | |--| Fri 21/10 | | | | | |7.| Mon 24/10 | Module2: Data Mining with Spark | https://goo.gl/IWWkkc | | Trasarti | |8.| Fri 28/10 | Module3: Project formulation | Formulations here: https://goo.gl/0iyKAM | | Trasarti | |9.| Mon 7/11 | Module2: Technologies highlights I | | Student seminars (technologies 1/6) Technologies here: https://goo.gl/9tU1bT | Trasarti | |10| Fri 11/11 | Module2: Technologies highlights II | | Student seminars (technologies 2/6) & Master Big Data presentation, [[http://masterbigdata.it/it/event/registration/midterm2016 |register here!]] | Trasarti | |11| Mon 14/11 | Module3: Mid-term Project presentations | Slides here: https://goo.gl/2rCX4l | | Trasarti | |12| Fri 18/11 | Module1: Understanding Human Mobility with Big data | | | Giannotti | |13| Mon 21/11 | Module1: Novel Demography with Phone Data | | Resources: https://goo.gl/LiQkbn | Giannotti | |14| Fri 25/11 | Module1: Deep Learning | https://goo.gl/WUNR4S | Round table | Giannotti | |15| Mon 28/11 | Module2: Realizing a scalable sociometer (ASAP Project) | https://goo.gl/2kQiBJ | Student seminars (papers 3/6) Papers here: https://goo.gl/gKm4w2| Trasarti | |16| Fri 02/12 | Module2: Realizing a classifier for GPS traces (Navionics Project)| https://goo.gl/esAhxd | Student seminars (papers 4/6) | Trasarti | |--| Mon 05/12 | | | | | |17| Fri 09/12 | Module1: Social media mining - Sentiment analysis | | Student seminars (papers 5/6) | Giannotti | |18| Mon 12/12 | | | Student seminars (papers 6/6) | Trasarti | |19| Fri 16/12 | Module3: Student pre-final Project Presentations | https://goo.gl/rerdUv | Groups 1,2,4,5,9| Giannotti | |20| Mon 19/12 | Module3: Student pre-final Project Presentations | https://goo.gl/rerdUv | Groups 3,6,7,8 | Giannotti| ====== Exam ====== The exam is composed by three parts: * The **assignments** during the course (papers and technologies) (20%) * A **project**, The work done should be summarized in a report (max. 10 pages), to be sent to the teachers at least a week before the oral exam (project discussion). (50%) * An **oral exam**, the discussion of the project with a group presentation (30 minutes for all the group); (30%). In addition the students may ask for some additional questions on the course content to improve his/her grade. === Round Table === Each team of 2 students prepares 3 minute presentation (~1-2 slides) about the round table topic. The presentation must be sent before the round table and should contain: the names of the students, the resources used (Web pages, books, papers, etc) and students' opinion about the topic. Grades: (A) Excellent (B) Very Good (C) Good (D) Sufficient (E) Not Sufficient === Papers & Technologies assignment === ^ Title ^ Students ^ Seminar day ^ Grade ^ | 1 - Potential and Pitfalls of Domain-Specific Information Extraction at Web Scale |Giannella Raffaele|3| B | | 2 - Petuum: A New Platform for Distributed Machine Learning on Big Data |Francesco Scigliuzzo|3| D/E | | 3 - Forecasting Fine-Grained Air Quality Based on Big Data |Francesco La Perna|3| A+| | 4 - Untangling performance from success|Filippo Delle Macchie|6| A/B | | 5 - Panther: Fast Top-k Similarity Search on Large Networks |Giacinto Trafficante|3| A| | 6 - The Effectiveness of Marketing Strategies in Social Media: Evidence from Promotional Events |Maurizio Quintini|3| A/B | | 9 - Online Topic-based Social Influence Analysis for the Wimbledon Championships |Matteo Borghi|4| A| | 10 - E-commerce in Your Inbox: Product Recommendations at Scale |Nunzio Spontella |4| B| | 11 - Gender and Interest Targeting for Sponsored Post Advertising at Tumblr |Nicolò Dossena|4| A/B| | 12 - Traffic Measurement and Route Recommendation System for Mass Rapid Transit (MRT) |Tommaso Furlan|5| B| | 13 - Discovering Collective Narratives of Theme Parks from Large Collections of Visitors’ Photo Streams |Benedetta Iavarone|6| A| | 14 - Early Identification of Violent Criminal Gang Members |Baltakiene Margarita|5| A| | 16 - Building Discriminative User Profiles for Large-scale Content Recommendation |Rossi Maria Teresa|6| A/B| |17 - An analytical framework to nowcast well-being using mobile phone data. |Pietro Gianluca Calamia|6|A| | 18 - Mobile Communication Signatures of Unemployment |Ada Gentile|6| A/B| | 19 - Dataveillance and the False-Positive Paradox |Fabrizio Rizzi|6| A| | 20 - On the Dominant Role of Returners’ Human Mobility Networks on Urban Energy Consumption |Giuseppe Di Modugno|6|A| | 22 - Do Street Fairs Boost Local Businesses? A Quasi-Experimental Analysis Using Social Network Data |Emiliano Fuccio|6|B| | 23 - No place to hide? The ethics and analytics of tracking mobility using mobile phone data |Martina Miliani|6|A| | T1 - Hive: https://hive.apache.org/ |Antonio Loconte|1|A/B| | T2 - Scala: http://www.scala-lang.org/ |Simona Ortolani|1|B| | T4 - HBase: http://hbase.apache.org/ |Maria Francesca Montisci|2|B| | T5 - Flume: https://flume.apache.org/ |Cristian Criscolo|2|D| | T7 - Oozie: http://oozie.apache.org/ |Lapo Chirici|6|A+| | T9 - ZooKeeper: https://zookeeper.apache.org/ |Andrea Meini|1|B/C| | T11 - Julia: http://julialang.org/ |Maurizio Deidda|2|D| | T12 - Docker: https://www.docker.com/|Alessandro Romano|1|A| Repository for the papers: http://goo.gl/5BQ50o Each student prepares a presentation of 10 minutes (~5 slides) for papers or 15 minutes (~8 slides) for technologies: * Paper presentations should contain: Data description, Problem statement, Data manipulation, The analytical process and Validation * Technologies presentations should contain: Technology objectives, Features provided, Limitations, Examples of usage and Documentation references At the end of the presentation there will be 5 minutes of discussion and questions. The students must use this link https://goo.gl/7QzR2V to express their preferences (do not change already taken papers or technologies) and they will be allocated in one of the seminar days. Deadline for expressing preferences: 13/10. === Project requirements === Each team of 3-4 students select a dataset from the proposed ones and should formulate the objectives of their project. After that there will be two more presentations about the progress of the work during the course: formulation, mid and pre-final. Those presentations are intend to be used to receive feedbacks from the other students and instructors in order to improve the final project. During the exam the final presentation of the project will be done by the team. The final project presentation/report should include: * Formulation of the problem to be solved (also inspired by the proposed papers) * Data acquisition/pre-processing and data exploration * Formulation of the problem to be solved in terms of data mining problem * Implementation of the proposed solution in a big data platform * Model construction and validation * Discussion of result exploitation * Ethical and Privacy issues To express your preferences about the dataset and the composition of the groups please use the following link: https://goo.gl/N9WiD1 ***Remember to add your email address to receive the NDA to sign, only when i receive the NDA from all the student in a group i will share the dataset!*** === Shared Folder === All the documents, presentations or other materials produced by the students must be uploaded in the following shared folder. Create a folder with your surname(s) and put the files inside it. https://drive.google.com/drive/folders/0B_IBzUGc9jCPUmNCY0xudUtiR28?usp=sharing ====== Big Data Analytics 2015/216 website ====== http://didawiki.di.unipi.it/doku.php/bigdataanalytics/bda/bda2015