Questa è una vecchia versione del documento!
Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual as well as any unstructured domain. In the lectures, we will:
Week Schedule | ||
---|---|---|
Day | Time Slot | Room |
Monday | 11:00 - 13:00 | L |
Tuesday | 9:00 - 11:00 | L1 |
The exam will consist of a written test plus an oral discussion on the exercises.
Date | Room | Text | Notes |
---|---|---|---|
29 oct 2018, 16:00-18:00 | C1, L1 | First Midterm exam: text, results, solution | Students that got a rank >= 16 can participate to the second midterm exam. Friday 9th November, hr 16:00, room L1, correction of the written exam. |
19 dec 2018, 16:00-18:00 | A1, L1 | Second Midterm exam: text, results, solution | |
18 jan 2018, 09:00-11:00 | A1 | text, results, solution | |
15 feb 2018, 09:00-11:00 | A1 | text, results, solution |
Date | Argument | Refs |
---|---|---|
17.09.2018 | Introduction to the course: modern IR, not just search engines! Boolean retrieval model. Matrix document-term. Inverted list: dictionary + postings. How to implement an AND, OR and NOT queries, and their time complexities. | Slides. Chapter 1 of [MRS] |
18.09.2018 | Web search engine: its structure, difficulties in their design and their epochs. The Web graph: some useful structural properties (such as Bow Tie). Crawling: problems and algorithmic structure. An example: Mercator. | Slides. Sections 19.1, 19.2, 19.4, 20.1, 20.2 of [MRS]. |
24.09.2018 | Few useful algorithmic techniques for crawling the Web (and not only that!): Bloom Filter and Spectral Bloom Filter. | Slides. For doubts on Bloom Filter see paper. |
25.09.2018 | Other useful algorithmic techniques for crawling the Web (and not only that!): Consistent Hashing. Web graph properties and its compressed storage. | Slides of the previous lecture. Sect 19.1 and 19.2, 20.3 and 20.4 of [MRS]. |
01.10.2018 | Locality-sensitive hashing: basics, hamming distance, proof of the probability bounds. | Slides |
02.10.2018 | Exact-duplicate documents: Karp-Rabin's rolling hash (with properties and error probability). Near-duplicate documents: Shingling, Jaccard similarity, min-hashing (with prob property), LSH on integer vectors, cosine-similarity among vectors of real-features. | slides. Sect 19.6 of [MRS] |
08/10/2018 | The issue of hierarchical memories: I/O-model. Index construction: multi-way mergesort, BSBI and SPIMI. Sketch on MapReduce. Distributed indexing: Term-based vs Doc-based partitioning. Dynamic indexing: two indexes, a cascade of indexes. | slides. Chapter 4 of [MRS]. |
09/10/2018 | Compressed storage of documents: LZ-based compression. Storage and Transmission of single/group of file(s): Delta compression (Zdelta), File Synchronization (rsync, zsync), and Set Reconciliation. | Slides. Reading a paper. |
15/10/2018 | Parsing: tokenization, normalization, lemmatization, stemming, thesauri. Statistical properties of texts: Zipf law: classical and generalized, Heaps law, Luhn's consideration. | Slides. Sect. 2.1, 2.2 and 5.1 of [MRS]. |
16/10/2018 | Keyword extraction: statistical, chi-square test, Rake tool. Exact search: hashing. Prefix search: compacted trie, front coding, 2-level indexing. | |
22/10/2018 | Exercises. | |
23/10/2018 | Exercises. | |
Midterm exam | ||
05/11/2018 | Edit distance via brute-force approach, or Dynamic Programming (possibly weighted). Overlap measure with k-gram index. Edit distance with k-gram index. One-error match. Wild-card queries (permuterm, k-gram). Phonetic match. Context-sensitive match. | Slides (new version). Chap 3 of [MRS]. |
06/11/2018 | Query processing: soft-AND, skip pointers, caching, phrase queries. Tiered index. Posting list compression, codes: gamma, variable bytes (t-nibble), PForDelta and Elias-Fano. | Slide query, Slide integer compressors. Sect. 2.3 and 2.4 and 5.3 of [MRS] and Ferragina's notes (only the coders presented in class). |
09/11/2018 | Correction of the midterm exam | |
12/11/2018 | Rank and Select data structures, two approaches: the case of B untouched and extra o(B) bits, and the case of Elias-Fano's approach with B compressed. | Slides. |
13/11/2018 | Succinct representation of binary trees and its navigational operations: heap like notation and LOUDS. | Slides. |
19/11/2018 | Text-based ranking: dice, jaccard, tf-idf. Vector space model. Storage of tf-idf and use for computing document-query similarity. Fast top-k retrieval: high idf, champion lists, many query-terms, fancy hits, clustering. | Slides. Sect 6.2 and 6.3 and 7 from [MRS]. |
20/11/2018 | Exact Top-K: WAND and blocked-WAND. Relevance feedback, Rocchio, pseudo-relevance feedback, query expansion. Performance measures: precision, recall, F1 and user happiness. | Chap 8 and 9 of [MRS] |
26/11/2018 | Random Walks. Link-based ranking: pagerank, topic-based pagerank, personalized pagerank, CoSim rank. HITS. | Slides. Chap 21 of [MRS] |
27/11/2018 | Exercises | |
3/12/2018 | Projections to smaller spaces: Latent Semantic Indexing (LSI). Random Projections: Johnson-Linderstauss Lemma and its applications. | Slides. Chap 18 from [MRS]. |
4/12/2018 | Semantic-annotation tools: basics, Wikipedia structure, TAGME and other annotators. Various approaches to text representation and their applications. Exercises | Slides |
10/12/2018 | A glimpse on Sport Analytics, and discussion of possible topics for a master thesis. | Slides and Thesis ideas. |
11/12/2018 | Exercises. |