Strumenti Utente

Strumenti Sito


magistraleinformatica:ir:ir21:start

Information Retrieval - Academic Year 2020/2021

General Information

  • Teacher : Paolo Ferragina
  • Course ID: 289AA
  • CFU: 6 (first semester)
  • Language: English
  • Question time: Monday 15-17, or appointment (given COVID-19 situation this will occur via video-conference in the virtual room of the course)
  • Official Lecture's Log: Here it is the registro.
  • News about this course will be distributed via a Telegram channel


Goals

Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual as well as any unstructured domain. In the lectures, we will:

  • study and analyze the main components of a modern search engine: Crawler, Parser, Compressor, Indexer, Query resolver, Query and Document annotator, Results Ranker;
  • dig into some basic algorithmic techniques which are now ubiquitous in any IR application for data compression, indexing and sketching;
  • describe few other IR tools which are used either as a component of a search engine or as independent tools and build up the previous algorithmic techniques, such as: Classification, Clustering, Recommendation, Random Sampling, Locality Sensitive Hashing.


Schedule of the Lectures

Week Schedule
Day Time Slot Room
Monday 11:00 - 13:00 Room A1 (Polo Fibonacci), and the virtual room of the course
Tuesday 9:00 - 11:00 Room A1 (Polo Fibonacci), and the virtual room of the course

Exams

The exam will consist of a written test plus an oral discussion.

As last year, I'm planning to have one midterm exam in November and one in December. The two midterms (if >= 18 as vote) will be combined with an oral exam to be given in January, which can increase or decrease the final vote (by few points, given that they will somewhat “average” with the vote reported in the written exam).

Date Room Text Notes
09/11/21, start at 09:00 room A1 and virtually text, results, solution The midterm exam will consist of a set of exercises, and will last for 45mins. The part of the program for the exercises will be detailed in the list of lectures below.
14/12/2021, start at 09:00 room C and virtual text, results, solution The FinalTerm exam will have the same structure as the other one, but can participate only the students who got >=18 rank in the first MidTerm. Students have to register at the following form by December 7th, 2021.
17/01/2022, start at 09:00 room C1 text, results, solution
07/02/2022, start at 09:00 room C1 text, results, solution

Materials for study

  • [MRS] C.D. Manning, P. Raghavan, H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008. [ link ]
  • Some copies of papers or notes (linked below).


Lectures

Students that are not able to attend the lectures, can refer to the video-lectures of the last academic year of the last academic year (click on “Videos”, and then sort them by name). For the video-lectures of this year, please look at the following agenda.

Date Argument Refs
13.09.2021 Introduction to the course: modern IR, not just search engines! Boolean retrieval model. Matrix document-term. Inverted list: dictionary + postings. How to implement an AND, OR and NOT queries, and their time complexities. Slides and Video
Chapter 1 of [MRS]
14.09.2021 Skip pointers, Zone indexes, Web search engine: its structure, difficulties in their design and their epochs. The Web graph: some useful structural properties (such as Bow Tie). Slides and video.
Sections 19.1, 19.2, 19.4 of [MRS].
20.09.2021 Crawling: problems and algorithmic structure. An example: Mercator. The bloom filter: definition, time/space complexity and error bound. Slide and video.
Sections 20.1, 20.2 of [MRS].
For doubts on Bloom Filter see paper.
21.09.2021 Spectral Bloom Filter. video
27.09.2021 Consistent Hashing. Web graph compression: properties of the web, power laws, and compressing the adjacency lists. Sect 19.1 and 19.2 of [MRS] and this page and note for consistent hashing.
Sect 20.3 and 20.4 of [MRS].
video
28.09.2021 Locality-sensitive hashing: basics, hamming distance, proof of the probability bounds. Use in an off-line and in an on-line setting. Comparison between LSH and K-means for the clustering problem. Slides and video
04.10.2021 Exact-duplicate documents: Karp-Rabin's rolling hash (with properties and error probability). Near-duplicate documents: Shingling, Jaccard similarity, min-hashing (with prob property), LSH on integer vectors. Cosine-similarity among vectors of real-features. Slides and video.
Sect 19.6 of [MRS]
05.10.2021 Exercises on LSH and shingling. Video.
Video of the last year (part 1 and part 2).
11.10.2021 The issue of hierarchical memories: I/O-model. Index construction: multi-way mergesort, BSBI and SPIMI. Sketch on MapReduce. Distributed indexing: Term-based vs Doc-based partitioning. Dynamic indexing: two indexes, a cascade of indexes. Slides.
Chapter 4 of [MRS].
Video part 1 and part 2
12.10.2021 Parsing: tokenization, normalization, lemmatization, stemming, thesauri. Statistical properties of texts: Zipf law: classical and generalized, Heaps law, Luhn's consideration. Dictionary compression: Front coding. Slides one and two.
Sect. 2.1, 2.2, 5.1 and 5.2 of [MRS].
Video part 1 and part 2
19.10.2021 Exact search: hashing. Prefix search: compacted trie, front coding, 2-level indexing. Edit distance via brute-force approach, or Dynamic Programming (possibly weighted). One-error match. Overlap measure with k-gram index. Edit distance with k-gram index. Wild-card queries (permuterm, k-gram). Phonetic match. Context-sensitive match. Slides and video (part 1 and part 2).
Chap 3 of [MRS].
25.10.2021 Keyword extraction: statistical, chi-square test, Rake tool. Query processing: soft-AND, skip pointers, caching, phrase queries. Tiered index. Slide 1 and Slide 2.
Sect. 2.3 and 2.4 of [MRS]
Video.
26.10.2021 Exercises Video
02.11.2021 Exercises Video
08.11.2021 Exercises Video
09.11.2021 MidTerm Exam (see above for details, and topics are the ones up to this point in the Agenda).
15.11.2021 Compressed storage of documents: LZ-based compression. Storage and Transmission of single/group of file(s): Delta compression (Zdelta), File Synchronization (rsync, zsync). Suggest reading a paper.
Slides, Video.
16.11.2021 Posting list compression, codes: gamma, variable bytes (t-nibble), PForDelta, Elias-Fano indexing. Text-based ranking: dice, jaccard, tf-idf. Vector space model and cosine similarity doc-doc and query-doc. Sect. 5.3 of [MRS] and Ferragina's notes (only the coders presented in class).
Slides and Video.
22.11.2021 Storage of tf-idf and use for computing document-query similarity. Fast top-k retrieval: high idf, champion lists, many query-terms, fancy hits, clustering. Sect 6.2 and 6.3 and 7 from [MRS].
Slides
23.11.2021 Exact Top-K: WAND and blocked-WAND. Relevance feedback, Rocchio, pseudo-relevance feedback, query expansion. Performance measures: precision, recall, F1 and user happiness. Sect 8.1-8.3 and 9 [MRS].
Video
29.11.2021 Elastic Search, with lab: Students are required to bring their own laptop in class, with already installed Docker and then the image of ElasticSearch via docker pull (i.e. first step of “Pulling the image”).
30.11.2021 Random Walks. Link-based ranking: pagerank, topic-based pagerank, personalized pagerank. Chap 21 of [MRS]. Slides
06.12.2021 HITS. Projections to smaller spaces: Latent Semantic Indexing (LSI). Chap 18 from [MRS].
Slides.
07.12.2021 Exercises
13.12.2021 Exercises
14.12.2021 FinalTerm exam. Topics will be the ones that we have dealt with after the MidTerm exam. Students have to register at the following form by December 7th, 2021.
magistraleinformatica/ir/ir21/start.txt · Ultima modifica: 26/11/2021 alle 13:35 (36 ore fa) da Paolo Ferragina