Strumenti Utente

Strumenti Sito


magistraleinformatica:ir:ir20:start

Information Retrieval - Academic Year 2020/2021

General Information

  • Teacher : Paolo Ferragina
  • Course ID: 289AA
  • CFU: 6 (first semester)
  • Language: English
  • Question time: Monday 15-17, or appointment (given COVID-19 situation this will occur via video-conference in the virtual room of the course)
  • Official Lecture's Log: Here it is the registro.
  • News about this course will be distributed via a Telegram channel


Goals

Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual as well as any unstructured domain. In the lectures, we will:

  • study and analyze the main components of a modern search engine: Crawler, Parser, Compressor, Indexer, Query resolver, Query and Document annotator, Results Ranker;
  • dig into some basic algorithmic techniques which are now ubiquitous in any IR application for data compression, indexing and sketching;
  • describe few other IR tools which are used either as a component of a search engine or as independent tools and build up the previous algorithmic techniques, such as: Classification, Clustering, Recommendation, Random Sampling, Locality Sensitive Hashing.


Schedule of the Lectures

Week Schedule
Day Time Slot Room
Monday 11:00 - 13:00 Virtual Room of the course
Tuesday 9:00 - 11:00 Virtual Room of the course (as above)


Exams

The exam will consist of a written pre-test plus an oral discussion.

News: We will have one midterm exam in November and one in December, this will be done virtually via MS forms with a set of questions and a short time to answer. At the end of the midterm, you'll have also to upload the papers in which you sketched the calculations/simulations that allowed you to get your solutions/answers. The two midterms will be combined with an oral exam to be given in December/January.

Date Room Text Notes
16.11.2020, start at 11:00 virtual MidTerm exam: text, results, solution (add to ex 3 the dyn prog) The exam will consists of questions to be solved in 45 mins, students have to submit their draft paper with the explanation for the provided answers. There will be a final term exam in December and the votes of the two exams will be averaged. In January, the students who got an average >= 18 will need to pass an oral exam, which can increase or decrease the final vote (by few points).
15.12.2020, start at 9:00 virtual Final Term exam: text, results, solution The exam will consist of questions to be solved in 45 mins, students have to submit their draft paper with the explanation for the provided answers. In January, the students who got an average >= 18 will need to pass an oral exam, which can increase or decrease the final vote (by few points).
The date of the oral exam is 8 and 11 January 2021, students have to register themselves at the following agenda (order will be defined at the beginning of the time slot). Students can attend the oral in any next exam date of the Winter exam session..
15.01.2021, start at 11:00 virtual (details on esami.unipi.it) text, results, solution Students that have to perform the written exam have to connect to the virtual room of the course at the date and hour specified here on the side.
Students that have passed previous written exams or the midterms and wish to take the oral exam have to register at agende.
All communications will occur also via Telegram channel of the course (see above).
05.02.2021, start at 11:00 virtual (details on esami.unipi.it) text, results, solution Students that have to perform the written exam have to connect to the virtual room of the course at the date and hour specified here on the side.
Students that have passed previous written exams or the midterms have to book the agenda for their participation to the oral exam. Moreover, these students have also “register” on the ESAMI portal writing in the field “notes” that they have to give only the oral exam: “only oral”.
All communications will occur also via Telegram channel of the course (see above).
09.06.2021, start at 14:00 virtual (details on esami.unipi.it) text, results, solution The written exam will occur at room O, in Polo Fibonacci. Students that cannot physically attend, are allowed to take the written exam via Teams in the virtual room of the course (as occurred in the previous months) during the same time slot as the physical exam.
Oral exams may also occur virtually or physically (in room O too).
Please specify in the “notes” when you subscribe to the exam whether you'll attend physically or virtually, distinguishing both the cases of oral and of written exam.
Students that have to perform just the oral exam have ALSO to “register” at the written exam-date on the ESAMI portal by writing in the field “notes”: “only oral” and specify whether they'll attend it physically or virtually. Then they have to book their oral also on the agenda.
All communications will occur also via Telegram channel of the course (see above).
09.07.2021, start at 09:00 virtual (details on esami.unipi.it) text, results, solution see above.
08.09.2021, start at 14:00 presence+virtual (details on esami.unipi.it) text, solution see above.

Materials for study

  • [MRS] C.D. Manning, P. Raghavan, H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008. [ link ]
  • Some copies of papers or notes (linked below).


Lectures

Date Argument Refs
14.09.2020 Introduction to the course: modern IR, not just search engines! Boolean retrieval model. Matrix document-term. Inverted list: dictionary + postings. How to implement an AND, OR and NOT queries, and their time complexities. Slides.
Chapter 1 of [MRS]
15.09.2020 Skip pointers, Zone indexes, Web search engine: its structure, difficulties in their design and their epochs. The Web graph: some useful structural properties (such as Bow Tie). Slides.
Sections 19.1, 19.2, 19.4 of [MRS].
21.09.2020 Lecture canceled because of elections
22.09.2020 Crawling: problems and algorithmic structure. An example: Mercator. The bloom filter: definition, time/space complexity and error bound. Slide.
Sections 20.1, 20.2 of [MRS].
For doubts on Bloom Filter see paper.
28.09.2020 Spectral Bloom Filter. Some other uses of Bloom Filter: Finding the k-topmost frequent items, or speeding up virus detectors, or speeding up DBs. Consistent Hashing. Exercises on Bloom Filter and Consistent hashing. Slides of the previous lecture.
Sect 19.1 and 19.2 of [MRS] and this page and note for consistent hashing.
29.09.2020 Web graph compression: properties of the web, power laws, and compression the adjacency lists. Locality-sensitive hashing: basics, hamming distance, proof of the probability bounds. Sect 20.3 and 20.4 of [MRS]. Slides and notes on LSH
05/10/20 Exact-duplicate documents: Karp-Rabin's rolling hash (with properties and error probability). Near-duplicate documents: Shingling, Jaccard similarity, min-hashing (with prob property), LSH on integer vectors. Slides.
Sect 19.6 of [MRS]
06/10/20 Cosine-similarity among vectors of real-features. Exercises on LSH and shingling.
12/10/20 The issue of hierarchical memories: I/O-model. Index construction: multi-way mergesort, BSBI and SPIMI. Slides.
Chapter 4 of [MRS]
13/10/20 Sketch on MapReduce. Distributed indexing: Term-based vs Doc-based partitioning. Dynamic indexing: two indexes, a cascade of indexes. Compressed storage of documents: LZ-based compression. Slides.
Chapter 4 of [MRS].
19.10.2020 Storage and Transmission of single/group of file(s): Delta compression (Zdelta), File Synchronization (rsync, zsync). Reading a paper.
20.10.2020 Exercises
26.10.2020 Parsing: tokenization, normalization, lemmatization, stemming, thesauri. Statistical properties of texts: Zipf law: classical and generalized, Heaps law, Luhn's consideration. Keyword extraction: statistical, chi-square test, Rake tool. Slides.
Sect. 2.1, 2.2 and 5.1 of [MRS].
27.10.2020 Exact search: hashing. Prefix search: compacted trie, front coding, 2-level indexing. Edit distance via brute-force approach, or Dynamic Programming (possibly weighted). One-error match. Slide.
02.11.2020 Overlap measure with k-gram index. Edit distance with k-gram index. Wild-card queries (permuterm, k-gram). Phonetic match. Context-sensitive match. Chap 3 of [MRS].
Topics for the first midterm exam stop here.
03.11.2020 Query processing: soft-AND, skip pointers, caching, phrase queries. Tiered index. Posting list compression, codes: gamma, variable bytes (t-nibble), PForDelta. Sect. 2.3 and 2.4 and 5.3 of [MRS] and Ferragina's notes (only the coders presented in class). Slides 1 and slides 2.
09.11.2020 Exercises for midterm exam.
10.11.2020 Text-based ranking: dice, jaccard, tf-idf. Vector space model. Storage of tf-idf and use for computing document-query similarity. Sect 6.2 and 6.3 and 7 from [MRS].
Slides
16.11.2020 Midterm exam, fully online
17.11.2020 No lecture
23.11.2020 Fast top-k retrieval: high idf, champion lists, many query-terms, fancy hits, clustering. Exact Top-K: WAND and blocked-WAND. Chap 8 and 9 of [MRS]
24.11.2020 Relevance feedback, Rocchio, pseudo-relevance feedback, query expansion. Performance measures: precision, recall, F1 and user happiness.
30.11.2020 Random Walks. Link-based ranking: pagerank, topic-based pagerank, personalized pagerank. Chap 21 of [MRS]. Slides
01.12.2020 HITS. Projections to smaller spaces: Latent Semantic Indexing (LSI). Chap 18 from [MRS]. Slides.
07.12.2020 Exercises
11.12.2020 Friday at 18:00 :: Exercises
15.12.2020 Final term exam, fully online
magistraleinformatica/ir/ir20/start.txt · Ultima modifica: 18/08/2022 alle 08:16 (21 mesi fa) da Paolo Ferragina