magistraleinformatica:ir:ir22:start

**Teacher**: Paolo Ferragina**Course ID**: 289AA**CFU:**6 (first semester)**Language:**English**Question time:**Monday 15-17 by appointment. Meeting will occur via video-conference in the virtual room of the course.**Official Lecture's Log:**Here it is the registro.- News about this course will be distributed via a Telegram channel

Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual as well as any unstructured domain. In the lectures, we will:

- study and analyze the main components of a modern search engine: Crawler, Parser, Compressor, Indexer, Query resolver, Query and Document annotator, Results Ranker;

- dig into some basic algorithmic techniques which are now ubiquitous in any IR application for data compression, indexing and sketching;

- describe few other IR tools which are used either as a component of a search engine or as independent tools and build up the previous algorithmic techniques, such as: Classification, Clustering, Recommendation, Random Sampling, Locality Sensitive Hashing.

Week Schedule | ||
---|---|---|

Day | Time Slot | Room |

Monday | 11:00 - 13:00 | Room C (Polo Fibonacci) |

Tuesday | 9:00 - 11:00 | Room C (Polo Fibonacci) |

The exam will consist of a **written test which will include two parts: exercises and “oral” questions**. To pass it you should report a sufficient score to both of them.

Date | Room | Text | Notes |
---|---|---|---|

00/00/21, start at 09:00 | room C | text, results, solution | We will see whether we will schedule a midterm exam |

**[MRS]**C.D. Manning, P. Raghavan, H. Schutze.*Introduction to Information Retrieval*. Cambridge University Press, 2008. [ link ]- Some copies of papers or notes (linked below).

Video-lectures of last year are available at the link and they are linked just for reference, if you wish to re-check something you listened in class. This year, lectures are in presence and the program of the course could be different.

Date | Argument | Refs |
---|---|---|

19.09.2022 | Introduction to the course: modern IR, not just search engines! Boolean retrieval model. Matrix document-term. Inverted list: dictionary + postings. How to implement an AND, OR and NOT queries, and their time complexities. | Slides. Chapter 1 of [MRS] |

20.09.2022 | Skip pointers, Zone indexes, Web search engine: its structure, difficulties in their design and their epochs. The Web graph: some useful structural properties (such as Bow Tie). | Slides. Sections 19.1, 19.2, 19.4 of [MRS]. |

27.09.2022 | Crawling: problems and algorithmic structure. An example: Mercator. The bloom filter: definition, time/space complexity and error bound. Spectral Bloom Filter. | Slide. Sections 20.1, 20.2 of [MRS]. For doubts on Bloom Filters see paper. |

28.09.2022 | Consistent Hashing. Web graph compression: properties of the web, power laws, and compressing the adjacency lists. | Sect 19.1 and 19.2 of [MRS] and this page and note for consistent hashing. Sect 20.3 and 20.4 of [MRS]. |

Date | Argument | Refs |
---|---|---|

00.00.2022 | Locality-sensitive hashing: basics, hamming distance, proof of the probability bounds. Use in an off-line and in an on-line setting. Comparison between LSH and K-means for the clustering problem. | Slides. |

00.00.2022 | Exact-duplicate documents: Karp-Rabin's rolling hash (with properties and error probability). Near-duplicate documents: Shingling, Jaccard similarity, min-hashing (with prob property), LSH on integer vectors. Cosine-similarity among vectors of real-features. | Slides. Sect 19.6 of [MRS] |

00.00.2022 | Exercises on LSH and shingling. | |

00.00.2022 | The issue of hierarchical memories: I/O-model. Index construction: multi-way mergesort, BSBI and SPIMI. Sketch on MapReduce. Distributed indexing: Term-based vs Doc-based partitioning. Dynamic indexing: two indexes, a cascade of indexes. | Slides. Chapter 4 of [MRS]. |

00.00.2022 | Parsing: tokenization, normalization, lemmatization, stemming, thesauri. Statistical properties of texts: Zipf law: classical and generalized, Heaps law, Luhn's consideration. Dictionary compression: Front coding. | Slides. Sect. 2.1, 2.2, 5.1 and 5.2 of [MRS]. |

00.00.2022 | Exact search: hashing. Prefix search: compacted trie, front coding, 2-level indexing. Edit distance via brute-force approach, or Dynamic Programming (possibly weighted). One-error match. Overlap measure with k-gram index. Edit distance with k-gram index. Wild-card queries (permuterm, k-gram). Phonetic match. Context-sensitive match. | Slides. |

00.00.2022 | Keyword extraction: statistical, chi-square test, Rake tool. Query processing: soft-AND, skip pointers, caching, phrase queries. Tiered index. | Slide. Sect. 2.3 and 2.4 of [MRS]. |

00.00.2022 | Compressed storage of documents: LZ-based compression. Storage and Transmission of single/group of file(s): Delta compression (Zdelta), File Synchronization (rsync, zsync). | Suggest reading a paper. Slides. |

00.00.2022 | Posting list compression, codes: gamma, variable bytes (t-nibble), PForDelta, Elias-Fano indexing. Text-based ranking: dice, jaccard, tf-idf. Vector space model and cosine similarity doc-doc and query-doc. | Sect. 5.3 of [MRS] and Ferragina's notes (only the coders presented in class). Slides. |

00.00.2022 | Storage of tf-idf and use for computing document-query similarity. Fast top-k retrieval: high idf, champion lists, many query-terms, fancy hits, clustering. | Sect 6.2 and 6.3 and 7 from [MRS]. Slides |

00.00.2022 | Exact Top-K: WAND and blocked-WAND. Relevance feedback, Rocchio, pseudo-relevance feedback, query expansion. Performance measures: precision, recall, F1 and user happiness. | Sect 8.1-8.3 and 9 [MRS]. |

00.00.2022 | Random Walks. Link-based ranking: pagerank, topic-based pagerank, personalized pagerank. Application to Text Summarization. | Chap 21 of [MRS]. Slides. |

00.00.2022 | HITS. Projections to smaller spaces: Latent Semantic Indexing (LSI). Sketch of the ideas underlying Entity Linkers and Knowledge Graphs. | Chap 18 from [MRS]. Slides. |

00.00.2022 | Elastic Search, with lab: Students are required to bring their own laptop in class, with already installed Docker and then the image of ElasticSearch via docker pull (i.e. first step of “Pulling the image”). | |

00.00.2022 | Exercises | |

00.00.2022 | Exercises | |

00.00.2022 | FinalTerm exam. Topics will be the ones that we have dealt with after the MidTerm exam. Students have to register at the following form by December 7th, 2021. |

magistraleinformatica/ir/ir22/start.txt · Ultima modifica: 27/09/2022 alle 08:20 (5 giorni fa) da Paolo Ferragina