====== Information Retrieval - Academic Year 2011/2012 ====== ===== General Information ===== * ** Teacher **: [[http://www.di.unipi.it/~ferragin/|Paolo Ferragina]] * **Course ID**: 289AA * **CFU:** 6 (first semester) * **Language:** English * **Lectures Schedule:** Wednesday 9-11 and 16-18, both in room N1 * **Question time:** by appointment. * **Official Lecture's Log:** The schedule and content of the lectures is available with the official [[http://unimap.unipi.it/registri/dettregistriNEW.php?re=61153::::&ri=4673|registro]]. * News about this course will be distributed via a [[http://twitter.com/FerraginaTeach | Tweeter-channel]] ===== Goals ===== Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual, html or XML data collections. In particular, we will: * describe the main components of a modern search engine: Crawler, Parser, Compressor, Indexer, Query resolver, Results Ranker, Results Classifier/Clusterer; * present and use in the Lab some interesting Open-Source Tools for IR applications, such as Lucene and Web graph; * introduce some basic algorithmic techniques which are now ubiquitous in any IR application for data classification, compression, clustering, projection, and sketching. ===== Exam ===== Project assigned during the course, plus an oral discussion concerning with the project and the course. ^ Date ^ Text of the exam ^ Results ^ | 07/12/2011 | {{:magistraleinformatica:ir:ir11:ir111207.doc|pre-exam}} | | | 11/01/2012 | {{:magistraleinformatica:ir:ir11:ir120111.doc|text}}| | | 1/02/2012 | {{:magistraleinformatica:ir:ir11:ir120201.doc|text}} | | | 8/06/2012 | {{:magistraleinformatica:ir:ir11:ir120608.doc|text}} | | | 23/07/2012 | {{:magistraleinformatica:ir:ir11:ir120723.doc|text}} | | | 03/09/2012 | {{:magistraleinformatica:ir:ir11:ir120903.doc|text}} | | ===== Books, notes, ... ===== * C.D. Manning, P. Raghavan, H. Schutze. //Introduction to Information Retrieval//. Cambridge University Press, 2008. [ [[http://nlp.stanford.edu/IR-book/information-retrieval-book.html| link]] ] * Chapter 2 on "Text compression" of //Managing Gigabytes//, I.H. Witten and A. Moffat and T.C. Bell, Morgan Kauffman, Second edition, 1999. ===== Content of the Lectures ===== ^ Date ^ Argument ^ Refs ^ | | | | | 21/09/2011, morning | Boolean retrieval model. Matrix document-term. Inverted list: dictionary + postings. How to implement an AND, OR and NOT queries, and their time complexities. The issue of hierarchical memories: I/O-model. | Chapter 1 of [MRS] | | 21/09/2011, afternoon | Parsing: tokenization, normalization, lemmatization, stemming, thesauri. Statistical properties of texts: Zipf law: classical and generalized, Heaps law, Luhn's consideration. | Sect. 2.1, 2.2 and 5.1 of [MRS]. {{:magistraleinformatica:ir:ir11:lecture01.ppt|Slides}} of this day | | 28/09/2011, morning | Index construction: multi-way mergesort, BSBI and SPIMI, distributed indexing, dynamic indexing. | Chapter 4 of [MRS] | | 28/09/2011, afternoon | Query processing: skip pointers, caching, phrase queries, wild-cards (permuterm, k-gram, dynamic programming) | Sect. 2.3, 2.4, 3.2, 3.3 of [MRS]. {{:magistraleinformatica:ir:ir11:lecture02.ppt|Slides}} of this day | | 05/10/2011, morning | Zone indexes. Posting list compression, codes: gamma, delta, variable bytes, PForDelta. Document compression: Entropy, Prefix-free codes, Huffman. | Chapter 6.1, 5.3 of [MRS], and this {{:magistraleinformatica:ir:ir11:reading-integercodes.pdf|paper}}. | | 05/10/2011, afternoon | Document compression: Arithmetic coding, Lempel-Ziv77, gzip. Exact string search: hashing with chaining, cuckoo hashing. Prefix string search: trie, front-compression, 2-level indexing. | pag. 21-36, 52-56, 74-79 of [MG]. Sect 3.1 and 5.2 from [MRS]. Paper on {{:magistraleinformatica:ir:ir11:reading-cuckoo.pdf|cuckoo}}, just description (no proof). {{:magistraleinformatica:ir:ir11:lecture03.ppt|Slides}} of this day | | 12/10/2011, morning | Bloom filter. Text-based ranking: dice/jaccard measures, tf-idf, vector space model, cosine similarity. | Sect 6.2, 6.3 from [MRS]. Paper on {{:magistraleinformatica:ir:ir11:reading-bloomfilter.pdf|bloom filter}}.| | 12/10/2011, afternoon | Top-k retrieval: high idf, champion lists, many-terms, fancy hits, clustering. Relevance feedback (Rocchio), pseudo-relevance feedback, query expansion. Quality of a search engine: precision, recall, F1. A sketch of recommendation systems and XML search engines. | Chap 7, 8, 9, 10 from [MRS]. {{:magistraleinformatica:ir:ir11:lecture04.ppt|Slides}} of this day | | 19/10/2011, afternoon | Web search engines: difficulties for their design, the web-graph and its properties, how to compute the SCC I/O-efficiently, the size of the web and the estimation of the relative sizes of SE, the storage of the web-graph. | Chap 19.1 and 19.2, 19.5, 20.4 from [MRS]. {{:magistraleinformatica:ir:ir11:lecture05.ppt|Slides}} of this day | | 26/10/2011, morning | Crawling and consistent hashing. Link-based ranking: PageRank, HITS, and weighted variants. | Chap 20.1, 20.2 and 21 from [MRS]. | | 26/10/2011, afternoon | Latent Semantic Indexing and Random projections. Exact and near-document duplication: shingling, fingerprinting and min-wise permutations. | Chap 18 from [MRS]. {{:magistraleinformatica:ir:ir11:lecture06.ppt|Slides}} of this day | | 27/10/2011, afternoon | Lucene in action, and the project on AIRPIM | {{:magistraleinformatica:ir:ir11:lucene.pptx|Slides on lucene}} and {{:magistraleinformatica:ir:ir11:project1.pptx|Slides on the Airpim's project}} | | 09/11/2011, morning | The other project on Twitter. | {{:magistraleinformatica:ir:ir11:twitterprojectir11.pptx|Slides on the Twitter's project}}, also in {{:magistraleinformatica:ir:ir11:twitterprojectir11.pdf|pdf}}. | | 09/11/2011, afternoon | Lab at home: to collaborate in the project development with your colleagues please use the following [[https://www.facebook.com/#!/groups/210549442348679/ | facebook-group]] | | | 16/11/2011, morning | Discussion on the Twitter's project and the assignment of papers to InfoUma students. | | | 16/11/2011, afternoon | Lab at home | | | 23/11/2011, morning | Lab at home | | | 23/11/2011, afternoon | Lab at home | | | 30/11/2011, morning | Lab at home | | | 30/11/2011, afternoon | Discussion on the projects, current state | | | 07/12/2011, morning | Exercises on the theory part | | | 07/12/2011, afternoon | Lab at home | |