Laboratory on Algorithms for Big Data 2014/2015

Some useful infos on the course:

Teacher: Paolo Ferragina and Rossano Venturini
CFU: 6
Period: First semester
Language: English
Lectures schedule: Tuesday 9-11 (N1) and Thursday 16-18 (C).
Question time: After lectures or by appointment.
Allowed Degree: Master degree in Computer Science, or Computer Science and Networking, or Business Informatics
News: about this course will be distributed via a Twitter-channel.
Official lectures schedule: The schedule and content of the lectures is available below and with the official registro.

Goals and opportunities for students

The course consists of a first part of lectures describing advanced algorithms and data structures (3 CFU), and a laboratory in the second part (3 CFU) in which the students will deploy these techniques to develop a software project. The students will select their projects among a set of proposals by some IT companies which are challenging from an algorithmic perspective. These companies will also contribute to identify/construct significant datasets that will help in testing the proposed algorithmic solutions.

The course will provide the opportunity of

facing with difficult algorithmic problems of practical interest involving big data;
evaluating the impact of efficient algorithmic solutions in the design of software managing big data;
implementing advanced software by using powerful and sophisticated libraries;
getting in touch with some companies for internships, scholarships, or thesis proposals.

Exam

Date	Room	Text
22/01/2015, 09:00	L1	text
13/02/2015, 09:00	L1	text
05/06/2015, 09:00	L1	text
29-06-2015, 09:00	L1	text
20-07-2015, 09:00	L1	text
10-09-2015, 09:00	L1	text

The exam consists of three parts: Project 70%, Written/oral test 20%, Project presentation 10%. Students can attend the written/oral test before the presentation/development of the project.

As far as the two projects are concerned, we list below the datasets for each of them. These are warm-up datasets of moderate size, yet sufficient to start designing your solutions and be concerned with time/space efficiency issues. We will provide you larger datasets to test the final codes you'll produce.

Project 1. The dataset consists of the following parts:

Dictionary of English terms (3.5Mb, 354.984 terms);
Dictionary of titles of Wikipedia articles (116 Mb, 6.971.668 titles). This dataset has been modified by introducing errors;
Queries obtained by introducing errors in the dictionary of terms above (2.126.355 queries);
Queries obtained by taking the Wikipedia titles (376.304 queries).

Few related papers.

Project 2. The dataset consists of 2.8Gb of 7zipped web pages drawn from the UK-GOV2 collection, and in WARC format. Some techniques that you can use for your project are described in these papers. Software libraries are indicated here, plus few additional softwares from students listed below:

WARC parser (by Andrea Cardaci).

Background

If you wish to refresh your mind on basic Algorithms and Data Structures, we suggest you to look at the well-known book Introduction to Algorithms by Cormen-Leiserson-Rivest-Stein, third edition.

Moreover, you can look at the content of the Algorithm Engineering course by Paolo Ferragina. Anyway, depending on the topics addressed in the lectures you'll be pointed out to the suitable chapters of Ferragina's notes.

Lectures

We'll often do not project slides for teaching, but use mainly the old-fashioned blackboard. Most of the content of the course will be covered by notes, sometime we'll use parts of papers/books.

Date	Lecture	Biblio
24/09/2014	Introduzione al corso: temi, progetti e svolgimento dell'esame.	progetti assegnati, slides
25/09/2014	(Compact) Trie, Ternary Search Trie, e Patricia Trie.	Sezioni 7.1, 7.4 e 7.5 di queste note. Questo articolo su Dr.Dobb's.
30/09/2014	Strutture dati succinte: rank/select, Elias-Fano, alberi (LOUDS e BP).	Lucidi sufficienti per lo studio.
02/10/2014	Strutture dati per l'indicizzazione compressa di testi: Suffix Array, Trasformata di Burrows-Wheeler, FM-index (count, locate, extract).	Lucidi sufficienti per lo studio.
07/10/2014	Compressed Permuterm Index e Dictionary search with prefix, suffix, substring, prefixSuffix match. Hashing: chaining (with time/space complexity), Universal hashing, Minimal ordered perfect hash.	Lucidi1 e Lucidi2 sufficienti per lo studio.
09/10/2014	Alberi succinti DFUDS. Descrizione Progetto 1 e possibili soluzioni.	Lucidi
14/10/2014	Permuting e Sorting: I/O-bottleneck, binary mergesort, multi-way mergesort.	Lucidi e capitolo 5 note di P. Ferragina (no 5.1.2, e no 5.2 in poi).
16/10/2014	Near-duplicate document detection: problem definition and comments, Karp-Rabin-fingerprint, Shingling, Jaccard similarity of sets, document sketches, locality sensitive hashing, the detection process.	Slides e parte di capitolo.
21/10/2014	An introduction to data compression: Gamma/Delta codes, Huffman code, Lempel-Ziv 1977, and Burrows-Wheeler transform.	Slides. Sect. 9.1 of these notes, Sect. 10.1 of these notes, and Sect. 11.1 of these notes.
23/10/2014	Discussion on the projects.
28/10/2014	Clustering: soft/hard clustering, bottom-up and top-down approaches, various metrics, K-means. Discussion on Project 2	Slides on clustering. Slides on the second project, and papers to read.
30/10/2014	Discussion on the projects.
11/11/2014	Discussion on the projects: To generate k random positions, p(i), one could use the computation p(i) = h1(x) + i h2(x), where x is a random number and h1/h2 are MurmurHash functions.
13/11/2014	Discussion on the projects.
18/11/2014	Discussion on the projects: Venturini's office.
20/11/2014	Discussion on the projects: Venturini's office.
25/11/2014	Discussion on the projects: Venturini's office.
27/11/2014	Discussion on the projects: Ferragina's office.
02/12/2014	Discussion on the projects: Ferragina's office.
04/12/2014	Discussion on the projects: Ferragina's office.
09/12/2014	Discussion on the projects: Ferragina's office.
11/12/2014	Discussion on the projects: Venturini's office.
16/12/2014	Discussion on the projects: Ferragina's office.

DidaWiki

Strumenti Utente

Strumenti Sito

Indice

Laboratory on Algorithms for Big Data 2014/2015

Goals and opportunities for students

Exam

Background

Lectures

Strumenti Pagina