Strumenti Utente

Strumenti Sito


magistraleinformatica:ir:ir22:start

Information Retrieval - Academic Year 2022/2023

General Information

  • Teacher : Paolo Ferragina
  • Course ID: 289AA
  • CFU: 6 (first semester)
  • Language: English
  • Question time: Monday 15-17 by appointment. Meeting will occur via video-conference in the virtual room of the course.
  • Official Lecture's Log: Here it is the registro.
  • News about this course will be distributed via a Telegram channel


Goals

Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual as well as any unstructured domain. In the lectures, we will:

  • study and analyze the main components of a modern search engine: Crawler, Parser, Compressor, Indexer, Query resolver, Query and Document annotator, Results Ranker;
  • dig into some basic algorithmic techniques which are now ubiquitous in any IR application for data compression, indexing and sketching;
  • describe few other IR tools which are used either as a component of a search engine or as independent tools and build up the previous algorithmic techniques, such as: Classification, Clustering, Recommendation, Random Sampling, Locality Sensitive Hashing.


Schedule of the Lectures

Week Schedule
Day Time Slot Room
Monday 11:00 - 13:00 Room C (Polo Fibonacci)
Tuesday 9:00 - 11:00 Room C (Polo Fibonacci)

Exams

The exam will consist of a written test which will include two parts: exercises and “oral” questions. To pass it you should report a sufficient score to both of them.

Date Room Text Notes
00/00/21, start at 09:00 room C text, results, solution We will see whether we will schedule a midterm exam

Materials for study

  • [MRS] C.D. Manning, P. Raghavan, H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008. [ link ]
  • Some copies of papers or notes (linked below).


Lectures

Video-lectures of last year are available at the link and they are linked just for reference, if you wish to re-check something you listened in class. This year, lectures are in presence and the program of the course could be different.

Date Argument Refs
19.09.2022 Introduction to the course: modern IR, not just search engines! Boolean retrieval model. Matrix document-term. Inverted list: dictionary + postings. How to implement an AND, OR and NOT queries, and their time complexities. Slides.
Chapter 1 of [MRS]
20.09.2022 Skip pointers, Zone indexes, Web search engine: its structure, difficulties in their design and their epochs. The Web graph: some useful structural properties (such as Bow Tie). Slides.
Sections 19.1, 19.2, 19.4 of [MRS].
27.09.2022 Crawling: problems and algorithmic structure. An example: Mercator. The bloom filter: definition, time/space complexity and error bound. Spectral Bloom Filter. Slide.
Sections 20.1, 20.2 of [MRS].
For doubts on Bloom Filters see paper.
28.09.2022 Consistent Hashing. Web graph compression: properties of the web, power laws, and compressing the adjacency lists. Sect 19.1 and 19.2 of [MRS] and this page and note for consistent hashing.
Sect 20.3 and 20.4 of [MRS].

Lectures (of the last year)

Date Argument Refs
00.00.2022 Locality-sensitive hashing: basics, hamming distance, proof of the probability bounds. Use in an off-line and in an on-line setting. Comparison between LSH and K-means for the clustering problem. Slides.
00.00.2022 Exact-duplicate documents: Karp-Rabin's rolling hash (with properties and error probability). Near-duplicate documents: Shingling, Jaccard similarity, min-hashing (with prob property), LSH on integer vectors. Cosine-similarity among vectors of real-features. Slides.
Sect 19.6 of [MRS]
00.00.2022 Exercises on LSH and shingling.
00.00.2022 The issue of hierarchical memories: I/O-model. Index construction: multi-way mergesort, BSBI and SPIMI. Sketch on MapReduce. Distributed indexing: Term-based vs Doc-based partitioning. Dynamic indexing: two indexes, a cascade of indexes. Slides.
Chapter 4 of [MRS].
00.00.2022 Parsing: tokenization, normalization, lemmatization, stemming, thesauri. Statistical properties of texts: Zipf law: classical and generalized, Heaps law, Luhn's consideration. Dictionary compression: Front coding. Slides.
Sect. 2.1, 2.2, 5.1 and 5.2 of [MRS].
00.00.2022 Exact search: hashing. Prefix search: compacted trie, front coding, 2-level indexing. Edit distance via brute-force approach, or Dynamic Programming (possibly weighted). One-error match. Overlap measure with k-gram index. Edit distance with k-gram index. Wild-card queries (permuterm, k-gram). Phonetic match. Context-sensitive match. Slides.
00.00.2022 Keyword extraction: statistical, chi-square test, Rake tool. Query processing: soft-AND, skip pointers, caching, phrase queries. Tiered index. Slide.
Sect. 2.3 and 2.4 of [MRS].
00.00.2022 Compressed storage of documents: LZ-based compression. Storage and Transmission of single/group of file(s): Delta compression (Zdelta), File Synchronization (rsync, zsync). Suggest reading a paper.
Slides.
00.00.2022 Posting list compression, codes: gamma, variable bytes (t-nibble), PForDelta, Elias-Fano indexing. Text-based ranking: dice, jaccard, tf-idf. Vector space model and cosine similarity doc-doc and query-doc. Sect. 5.3 of [MRS] and Ferragina's notes (only the coders presented in class).
Slides.
00.00.2022 Storage of tf-idf and use for computing document-query similarity. Fast top-k retrieval: high idf, champion lists, many query-terms, fancy hits, clustering. Sect 6.2 and 6.3 and 7 from [MRS].
Slides
00.00.2022 Exact Top-K: WAND and blocked-WAND. Relevance feedback, Rocchio, pseudo-relevance feedback, query expansion. Performance measures: precision, recall, F1 and user happiness. Sect 8.1-8.3 and 9 [MRS].
00.00.2022 Random Walks. Link-based ranking: pagerank, topic-based pagerank, personalized pagerank. Application to Text Summarization. Chap 21 of [MRS]. Slides.
00.00.2022 HITS. Projections to smaller spaces: Latent Semantic Indexing (LSI). Sketch of the ideas underlying Entity Linkers and Knowledge Graphs. Chap 18 from [MRS].
Slides.
00.00.2022 Elastic Search, with lab: Students are required to bring their own laptop in class, with already installed Docker and then the image of ElasticSearch via docker pull (i.e. first step of “Pulling the image”).
00.00.2022 Exercises
00.00.2022 Exercises
00.00.2022 FinalTerm exam. Topics will be the ones that we have dealt with after the MidTerm exam. Students have to register at the following form by December 7th, 2021.
magistraleinformatica/ir/ir22/start.txt · Ultima modifica: 27/09/2022 alle 08:20 (5 giorni fa) da Paolo Ferragina