Strumenti Utente

Strumenti Sito


magistraleinformatica:ir:ir11:start

Questa è una vecchia versione del documento!


Information Retrieval - Academic Year 2011/2012

General Information

  • Teacher : Paolo Ferragina
  • Course ID: 289AA
  • CFU: 6 (first semester)
  • Language: English
  • Lectures Schedule: Wednesday 9-11 and 16-18, both in room N1
  • Question time: by appointment.
  • Official Lecture's Log: The schedule and content of the lectures is available with the official registro.
  • News about this course will be distributed via a Tweeter-channel

Goals

Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual, html or XML data collections. In particular, we will:

  • describe the main components of a modern search engine: Crawler, Parser, Compressor, Indexer, Query resolver, Results Ranker, Results Classifier/Clusterer;
  • present and use in the Lab some interesting Open-Source Tools for IR applications, such as Lucene and Web graph;
  • introduce some basic algorithmic techniques which are now ubiquitous in any IR application for data classification, compression, clustering, projection, and sketching.

Exam

Project assigned during the course, plus an oral discussion concerning with the project and the course.

Date Text of the exam
07/12/2011, hr 9, room N1 pre-exam

Books, notes, ...

  • C.D. Manning, P. Raghavan, H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008. [ link ]
  • Chapter 2 on “Text compression” of Managing Gigabytes, I.H. Witten and A. Moffat and T.C. Bell, Morgan Kauffman, Second edition, 1999.

Content of the Lectures

Date Argument Refs
21/09/2011, morning Boolean retrieval model. Matrix document-term. Inverted list: dictionary + postings. How to implement an AND, OR and NOT queries, and their time complexities. The issue of hierarchical memories: I/O-model. Chapter 1 of [MRS]
21/09/2011, afternoon Parsing: tokenization, normalization, lemmatization, stemming, thesauri. Statistical properties of texts: Zipf law: classical and generalized, Heaps law, Luhn's consideration. Sect. 2.1, 2.2 and 5.1 of [MRS]. Slides of this day
28/09/2011, morning Index construction: multi-way mergesort, BSBI and SPIMI, distributed indexing, dynamic indexing. Chapter 4 of [MRS]
28/09/2011, afternoon Query processing: skip pointers, caching, phrase queries, wild-cards (permuterm, k-gram, dynamic programming) Sect. 2.3, 2.4, 3.2, 3.3 of [MRS]. Slides of this day
05/10/2011, morning Zone indexes. Posting list compression, codes: gamma, delta, variable bytes, PForDelta. Document compression: Entropy, Prefix-free codes, Huffman. Chapter 6.1, 5.3 of [MRS], and this paper.
05/10/2011, afternoon Document compression: Arithmetic coding, Lempel-Ziv77, gzip. Exact string search: hashing with chaining, cuckoo hashing. Prefix string search: trie, front-compression, 2-level indexing. pag. 21-36, 52-56, 74-79 of [MG]. Sect 3.1 and 5.2 from [MRS]. Paper on cuckoo, just description (no proof). Slides of this day
12/10/2011, morning Bloom filter. Text-based ranking: dice/jaccard measures, tf-idf, vector space model, cosine similarity. Sect 6.2, 6.3 from [MRS]. Paper on bloom filter.
12/10/2011, afternoon Top-k retrieval: high idf, champion lists, many-terms, fancy hits, clustering. Relevance feedback (Rocchio), pseudo-relevance feedback, query expansion. Quality of a search engine: precision, recall, F1. A sketch of recommendation systems and XML search engines. Chap 7, 8, 9, 10 from [MRS]. Slides of this day
19/10/2011, afternoon Web search engines: difficulties for their design, the web-graph and its properties, how to compute the SCC I/O-efficiently, the size of the web and the estimation of the relative sizes of SE, the storage of the web-graph. Chap 19.1 and 19.2, 19.5, 20.4 from [MRS]. Slides of this day
26/10/2011, morning Crawling and consisten hashing. Link-based ranking: PageRank, HITS, and weighted variants. Chap 20.1, 20.2 and 21 from [MRS].
26/10/2011, afternoon Latent Semantic Indexing and Random projections. Exact and near-document duplication: shingling, fingerprinting and min-wise permutations. Chap 18 from [MRS]. Slides of this day
27/10/2011, afternoon Lucene in action, and the project on AIRPIM Slides on lucene and Slides on the Airpim's project
09/11/2011, morning The other project on Twitter. Slides on the Twitter's project, also in pdf.
09/11/2011, afternoon Lab at home: to collaborate in the project development with your colleagues please use the following facebook-group
16/11/2011, morning Discussion on the Twitter's project and the assignment of papers to InfoUma students.
16/11/2011, afternoon Lab at home
23/11/2011, morning Lab at home
23/11/2011, afternoon Lab at home
30/11/2011, morning Lab at home
30/11/2011, afternoon Discussion on the projects, current state
magistraleinformatica/ir/ir11/start.1322506540.txt.gz · Ultima modifica: 28/11/2011 alle 18:55 (13 anni fa) da Paolo Ferragina