====== Information Retrieval - Academic Year 2014/2015 ====== ===== General Information ===== * ** Teacher **: [[http://www.di.unipi.it/~ferragin/|Paolo Ferragina]] * **Course ID**: 289AA * **CFU:** 6 (first semester) * **Language:** English * **Lectures Schedule:** tuesday 14-16 (N1), thursday 14-16 (C) * **Question time:** by appointment. * **Official Lecture's Log:** Here it is the [[http://unimap.unipi.it/registri/registri.php?ri=9142&tmplt=principale.tpl&aa=2015|registro]]. * News about this course will be distributed via a [[http://twitter.com/FerraginaTeach | Tweeter-channel]] ===== Goals ===== Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual, html or XML data collections. In particular, we will: * describe the main components of a modern search engine: Crawler, Parser, Compressor, Indexer, Query resolver, Results Ranker, Results Classifier/Clusterer; * present and use in the Lab some interesting Open-Source Tools for IR applications, such as Lucene and Web graph; * introduce some basic algorithmic techniques which are now ubiquitous in any IR application for data classification, compression, clustering, projection, and sketching. ===== Exam ===== The exam will consist of a written test, plus an oral discussion on the exercises. ^ Date ^ Room ^ Text ^ | 16-01-2015 | C1 | {{:magistraleinformatica:ir:ir14:ir150116.docx|text}} | | 09-02-2015 | A1 | {{:magistraleinformatica:ir:ir14:ir150209.docx|text}} | | 05-06-2015 | L1 | {{:magistraleinformatica:ir:ir14:ir150605.docx|text}} | | 29-06-2015 | L1 | {{:magistraleinformatica:ir:ir14:ir150629.docx|text}} | | 20-07-2015 | L1 | {{:magistraleinformatica:ir:ir14:ir150720.docx|text}} | | 10-09-2015 | L1 | {{:magistraleinformatica:ir:ir14:ir150910.docx|text}} | ===== Books ===== * **[MRS]** C.D. Manning, P. Raghavan, H. Schutze. //Introduction to Information Retrieval//. Cambridge University Press, 2008. [ [[http://nlp.stanford.edu/IR-book/information-retrieval-book.html| link]] ] * **[MG]** {{:magistraleinformatica:ir:ir12:reading-mgcompress.pdf|Chapter 2}} "Text compression" of //Managing Gigabytes//, I.H. Witten and A. Moffat and T.C. Bell, Morgan Kauffman, 1999. ===== Content of the Lectures ===== ^ Date ^ Argument ^ Refs ^ | 23-09-2014 | Introduction to the course: modern IR, not just search engines! Boolean retrieval model. Matrix document-term. Inverted list: dictionary + postings. How to implement an AND, OR and NOT queries, and their time complexities. | [[https://dl.dropboxusercontent.com/u/7999075/00-Ratti-video.ppt|Slides]] and Chapt 1 of [MRS] | | 25-09-2014 | The issue of hierarchical memories: I/O-model. Parsing: tokenization, normalization, lemmatization, stemming, thesauri. Statistical properties of texts: Zipf law: classical and generalized, Heaps law, Luhn's consideration. | Sect. 2.1, 2.2 and 5.1 of [MRS]. {{:magistraleinformatica:ir:ir13:01-intro.ppt|Slides of this week}}. | | | 30-09-2014 | Index construction: multi-way mergesort, BSBI and SPIMI. Distributed and dynamic indexing. | Chapter 4 of [MRS]. {{:magistraleinformatica:ir:ir13:02-construction.ppt|Slides.}} | | 02-10-2014 | Query processing: skip pointers (with solution based on dynamic programming), caching, phrase queries. | {{:magistraleinformatica:ir:ir14:03-query-part_a.ppt|Slides}} | | 07-10-2014 | Wild-card queries (permuterm, k-gram). Edit distance: kgram index, dynamic programming. Case of 1-error. Zone indexes. | Chapter 4 and Sect. 3.2 and 3.3 of [MRS]. {{:magistraleinformatica:ir:ir14:03-query-part_b.zip|Slides}} | | 14-10-2014 | Exact search: hashing with chaining, univeral hashing, cuckoo hashing. Prefix search: compacted trie, front coding, 2-level indexing. Exact searching with errors: Bloom filter. | {{:magistraleinformatica:ir:ir12:reading-cuckoo.pdf|Paper}} on cuckoo, just description (no proof). Paper on {{:magistraleinformatica:ir:ir12:reading-bloomfilter.pdf|bloom filter}} (just things looked in class). {{:magistraleinformatica:ir:ir14:04a-dictsearch.ppt|Slides}}.| | 16-10-2014 | Auto-completion search: definition, trie, top-k, cartesian tree, RMQ. | {{:magistraleinformatica:ir:ir14:4b-dictsearch_-_autocompletion_-_part_1.pdf|Slides}}. You should read [[http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/magistraleinformaticanetworking/ae/ae2012/chap8.pdf|Sect 8.4.1]] and Sect 2 and 3 of this [[http://www.cs.sunysb.edu/~bender/newpub/BenderFa00-lca.pdf|paper]]. | | 21-10-2014 | Notion of Entropy and prefix-free codes. Huffman and Arithmetic coding, with comparisons with Entropy and comment on their performance. | Sect. 6.1 of [MRS]. Read pag. 21-36, 52-56 of [MG]. {{:magistraleinformatica:ir:ir14:05a-compression_text.ppt|Slides}}. | | 28-10-2014 | Arithmetic coding: bound on the code length, relation with entropy. LZ77, LZSS, gzip and snappy. Posting list compression, codes: gamma, delta, variable bytes. | Sez 5.3 of [MRS] and pag. 74-79 of [MG]. | | 30-10-2014 | PForDelta. Rank and Select primitives: definition and their use. Elias-Fano code and its use for postings compression. | {{:magistraleinformatica:ir:ir14:05b-compression_integers_vers1.ppt|Slides}}.| | 11-11-2014 | More on Rank and Select on binary arrays. Rank and Select on general arrays: the Wavelet Tree. Binary tree encoding and navigation. | {{:magistraleinformatica:ir:ir14:05c-succinct-tree_and_wavelet.ppt|Slides}} | | 13-11-2014 | Suffix arrays: data structure and search operations. Text mining over suffix arrays. Move-to-Front and Run-length-encoding and Burrows-Wheeler Transform: bzip, how to construct, how to decompress entirely or just a substring. | {{:magistraleinformatica:ir:ir14:05d-bwt_e_fmi.pptx|Slides}} | | 18-11-2014 | Text-based ranking: dice, jaccard, tf-idf. Vector space model. Storage of tf-idf and use for computing document-query similarity. Fast top-k retrieval: high idf, champion lists, many query-terms, fancy hits, clustering. Relevance feedback, Rocchio, pseudo-relevance feedback, query expansion.| Sect 6.2, 6.3 from [MRS]. Chap 7 and 9 from [MRS]. {{:magistraleinformatica:ir:ir14:06-ranking.ppt|Slides}} | | 25-11-2014 | Link-based ranking: pagerank and HITS and weighted variants. | Chap 21 from [MRS]. {{:magistraleinformatica:ir:ir13:08b-web-ranking.ppt|Slides}}. | | 27-11-2014 | Web search engines: difficulties for their design, the web-graph and its properties, how to compute the SCC I/O-efficiently, the size of the web and the estimation of the relative sizes of SE. | Chap. 8, 19.1 and 19.2, 19.5 from [MRS]. {{:magistraleinformatica:ir:ir14:07-theweba.ppt|Slides}} | | 28-11-2014 | The storage of the web graph. Latent Semantic Indexing. | Chap 18 and 20.4 from [MRS]. {{:magistraleinformatica:ir:ir14:09-lsi.ppt|Slides}}. | | 02-12-2014 | Random projections. Crawling and consistent hashing. | Chap 20.1, 20.2 and 20.3. {{:magistraleinformatica:ir:ir14:08a-web-crawling.ppt|Slides}}. | | 03-12-2014 | Recommendation systems and Web advertising. | {{:magistraleinformatica:ir:ir14:11-applications.ppt|Slides}}. | | 04-12-2014 | Semantic-annotation tools: basics and TAGME. | | | 09-12-2014 | Semantic-annotation tools: advanced and some applications. | {{:magistraleinformatica:ir:ir14:12-topic_annotators.pptx|Slides}} | | 10-12-2014 | Extra lecture: 9-11, M1: Clustering: flat, hierarchical, soft, hard. K-means, optimal bisect, hierarchical - max, min, avg, centroid. | Chap 16 and 17 of [MRS], {{:magistraleinformatica:ir:ir14:10-clustering.ppt|slides}}. | | 11-12-2014 | Exercises | | | 12-12-2014 | Q&A time on theory and exercises | | | 16-12-2014 | Exercises | | | 18-12-2014 | Q&A time on theory and exercises (meeting requested by students) | |