Indice
ICT for BI & CRM - Part III: Data Mining 2012
- Dino Pedreschi Università di Pisa, Knowledge Discovery and Data Mining Lab pedre [at] di [dot] unipi [dot] it
News
- Exercises 1 and 2 are online. Deadline for both assigments is December 13, 2011. Send both reports in .pdf format by email to pedre [at] di [dot] unipi [dot] it with the tag [DM-MAINS] in the subject line.
Goals
Organizations and business are overwhelmed by the flood of data continuously collected into their data warehouses and arriving from external sources – the Web above all. Traditional exploratory techniques may fail to make sense of the data, due to its inherent complexity and size. Data mining and knowledge discovery techniques emerged as an alternative approach, aimed at revealing patterns, rules and models hidden in the data, and at supporting the analytical user to develop descriptive and predictive models for a number of business problems, notably in the CRM domain.
Syllabus
- Basic concepts of data mining and the knowledge discovery process.
- Data and data sources.
- Exploratory data analysis.
- Fundamental data mining tasks and methods: clustering, classification and prediction, patterns and association rules.
- Hints on descriptive and predictive analytics for CRM tasks: customer segmentation, churn analysis, promo redemption, product recommendation, market basket analysis.
- Discussion of industrial data mining projects for CRM in retail, both traditional and online.
Textbooks
- Slides (see Calendar).
- Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining. Addison Wesley, ISBN 0-321-32136-7, 2006
- Gordon S. Linoff e Michael J. Berry. Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management. Wiley, 2011.
Reading about the "data analyst" job
Calendar
Date | Topic | Learning material | |
---|---|---|---|
1. | 22.11.2011 - 11:00-13:00 and 16:00-18:00 | Introduction to Data Mining and the Knowledge Discovery Process | slides - Textbook: chapt. 1 |
2. | 23.11.2011 - 09:00-11:00 | Data understanding. Introduction to Weka | slides - Textbook: chapt. 2 (2.1, 2.2) and chapt. 3 (3.1, 3.2, 3.3) |
3. | 28.11.2011 - 11:00-13:00 and 14:00-16:00 | Clustering Analysis | slides - Textbook: chapt. 8 (8.1, 8.2, 8.5) |
4. | 29.11.2011 - 11:00-13:00 and 16:00-18:00 | Classification and predictive analysis | slides - Textbook: chapt. 4 (4.1, 4.2, 4.3, 4.4, 4.5) |
5. | 30.11.2011 - 16:00-18:00 | Pattern discovery and associaltion rule mining | slides - Textbook: chapt. 6 (6.1, 6.2) |
6. | 05.12.2011 - 09:00-13:00 | CRM applications. Big data and social network analysis. Data mining and privacy |
Exercises
- Clustering: Russian Companies dataset. Download the zipped .arff dataset at russiancompanies.zip, describing 1438 Russian companies. The following properties of each company are provided, relative to years 1996 and 1997: number of employees (emp), total amount of wages (wage), total revenues (output), the logarithm of the three previous variables (resp., ln = ln(emp), lw = ln(wage/emp), ly = ln(output)), the production sector (sector: 1 = industry, 2 = constructions, 3 = trade), the type of ownership (owntype: 1 = public, 2 = private, 3 = mixed). Provide a clustering analysis of the dataset with respect to a selected subset of variables, and explain the obtained clusters taking into account also the nominal variables sector and owntype. Describe your findings in a short report (up to 3 three pages of text, excluding figures, either in English or Italian) illustrating the key features of the dataset, how you conducted the clustering analysis, and the interpretation of the obtained clusters.
- Classification: Adult Census dataset. Download the zipped .arff dataset at adult.census.zip, describing demographic information about 32561 persons extracted from US census data. The available attrubutes are: age, workclass, education, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country, and a binary class income attribute (> $50K, < = $50K). Provide a concise, accurate and readable decision tree for the classification problem of predicting the income class variable given (all or some of) the other variables. Describe your findings in a short report (up to 3 three pages of text, excluding figures and charts, either in English or Italian) illustrating the key features of the dataset, how you conducted the classification analysis, and the interpretation of the obtained tree.
Exams
For MAINS master students (one-year degree) the exam of the Data Mining module consists in the evaluation of the two reports of exercises 1 and 2 above. For students of the two-year LM-MAINS degree the exam consists in the evaluation of the two reports of exercises 1 and 2 above, and an individual oral exam devoted to the discussion of aspects emerging from the exercises. The evaluation of the reports is the same for all components of the group (max 3 students oer group). The date of the first oral exam session of the LM-MAINS students will set by appointment, within January 2012.