Strumenti Utente

Strumenti Sito


magistraleinformatica:lad:lad14:start

Questa è una vecchia versione del documento!


Laboratory on Algorithms for Big Data 2014/2015

Some useful infos on the course:

  • CFU: 6
  • Period: First semester
  • Language: English
  • Lectures schedule: Tuesday 9-11 (N1) and Thursday 16-18 (C).
  • Question time: After lectures or by appointment.
  • Allowed Degree: Master degree in Computer Science, or Computer Science and Networking, or Business Informatics
  • News: about this course will be distributed via a Twitter-channel.
  • Official lectures schedule: The schedule and content of the lectures is available below and with the official registro.


Goals and opportunities for students

The course consists of a first part of lectures describing advanced algorithms and data structures (3 CFU), and a laboratory in the second part (3 CFU) in which the students will deploy these techniques to develop a software project. The students will select their projects among a set of proposals by some IT companies which are challenging from an algorithmic perspective. These companies will also contribute to identify/construct significant datasets that will help in testing the proposed algorithmic solutions.

The course will provide the opportunity of

  • facing with difficult algorithmic problems of practical interest involving big data;
  • evaluating the impact of efficient algorithmic solutions in the design of software managing big data;
  • implementing advanced software by using powerful and sophisticated libraries;
  • getting in touch with some companies for internships, scholarships, or thesis proposals.

Exam

Date Room Text
22/01/2015, 09:00 L1 text
13/02/2015, 09:00 L1 text

The exam consists of three parts: Project 70%, Written/oral test 20%, Project presentation 10%. Students can attend the written/oral test before the presentation/development of the project.

As far as the two projects are concerned, we list below the datasets for each of them. These are warm-up datasets of moderate size, yet sufficient to start designing your solutions and be concerned with time/space efficiency issues. We will provide you larger datasets to test the final codes you'll produce.

Project 1. The dataset consists of the following parts:

  • Dictionary of English terms (3.5Mb, 354.984 terms);
  • Dictionary of titles of Wikipedia articles (116 Mb, 6.971.668 titles). This dataset has been modified by introducing errors;
  • Queries obtained by introducing errors in the dictionary of terms above (2.126.355 queries);
  • Queries obtained by taking the Wikipedia titles (376.304 queries).

Few related papers.

Project 2. The dataset consists of 2.8Gb of 7zipped web pages drawn from the UK-GOV2 collection, and in WARC format. Some techniques that you can use for your project are described in these papers. Software libraries are indicated here, plus few additional softwares from students listed below:

Background

If you wish to refresh your mind on basic Algorithms and Data Structures, we suggest you to look at the well-known book Introduction to Algorithms by Cormen-Leiserson-Rivest-Stein, third edition.

Moreover, you can look at the content of the Algorithm Engineering course by Paolo Ferragina. Anyway, depending on the topics addressed in the lectures you'll be pointed out to the suitable chapters of Ferragina's notes.

Lectures

We'll often do not project slides for teaching, but use mainly the old-fashioned blackboard. Most of the content of the course will be covered by notes, sometime we'll use parts of papers/books.

Date Lecture Biblio
24/09/2014 Introduzione al corso: temi, progetti e svolgimento dell'esame. progetti assegnati, slides
25/09/2014 (Compact) Trie, Ternary Search Trie, e Patricia Trie. Sezioni 7.1, 7.4 e 7.5 di queste note. Questo articolo su Dr.Dobb's.
30/09/2014 Strutture dati succinte: rank/select, Elias-Fano, alberi (LOUDS e BP). Lucidi sufficienti per lo studio.
02/10/2014 Strutture dati per l'indicizzazione compressa di testi: Suffix Array, Trasformata di Burrows-Wheeler, FM-index (count, locate, extract). Lucidi sufficienti per lo studio.
07/10/2014 Compressed Permuterm Index e Dictionary search with prefix, suffix, substring, prefixSuffix match. Hashing: chaining (with time/space complexity), Universal hashing, Minimal ordered perfect hash. Lucidi1 e Lucidi2 sufficienti per lo studio.
09/10/2014 Alberi succinti DFUDS. Descrizione Progetto 1 e possibili soluzioni. Lucidi
14/10/2014 Permuting e Sorting: I/O-bottleneck, binary mergesort, multi-way mergesort. Lucidi e capitolo 5 note di P. Ferragina (no 5.1.2, e no 5.2 in poi).
16/10/2014 Near-duplicate document detection: problem definition and comments, Karp-Rabin-fingerprint, Shingling, Jaccard similarity of sets, document sketches, locality sensitive hashing, the detection process. Slides e parte di capitolo.
21/10/2014 An introduction to data compression: Gamma/Delta codes, Huffman code, Lempel-Ziv 1977, and Burrows-Wheeler transform. Slides. Sect. 9.1 of these notes, Sect. 10.1 of these notes, and Sect. 11.1 of these notes.
23/10/2014 Discussion on the projects.
28/10/2014 Clustering: soft/hard clustering, bottom-up and top-down approaches, various metrics, K-means. Discussion on Project 2 Slides on clustering. Slides on the second project, and papers to read.
30/10/2014 Discussion on the projects.
11/11/2014 Discussion on the projects: To generate k random positions, p(i), one could use the computation p(i) = h1(x) + i h2(x), where x is a random number and h1/h2 are MurmurHash functions.
13/11/2014 Discussion on the projects.
18/11/2014 Discussion on the projects: Venturini's office.
20/11/2014 Discussion on the projects: Venturini's office.
25/11/2014 Discussion on the projects: Venturini's office.
27/11/2014 Discussion on the projects: Ferragina's office.
02/12/2014 Discussion on the projects: Ferragina's office.
04/12/2014 Discussion on the projects: Ferragina's office.
09/12/2014 Discussion on the projects: Ferragina's office.
11/12/2014 Discussion on the projects: Venturini's office.
16/12/2014 Discussion on the projects: Ferragina's office.
magistraleinformatica/lad/lad14/start.1424358680.txt.gz · Ultima modifica: 19/02/2015 alle 15:11 (9 anni fa) da Paolo Ferragina