IT642: Data warehousing and data mining course

Spring 2002

Instructor: Sunita Sarawagi
Timings (as of now): Slot 3, Wed 11:30 - 1 , Thu 11:30-1
Office hours:
TA: B. Anuradha, Pranav

Approximate credit structure:
Midsem exam 20% Mar 6
Endsem exam30% Apr 20-30
Project 25% Announced last week of feb, design report: 7th Mar, final report and demo: end-april
Three short surprise quizzes 10% best two of three quizzes used for grading. All quizzes will be surprise and there will be no compensation for missed quizzes except under very special circumstances.
Two homeworks 15% Announced mid-jan and first-week-feb

Grading for audit students: Audit students need to get at least 60% marks in n-1 of the n quizzes offered in class.

Slides for some of the lectures

Eligibility Requirement

The course is open to IT and CS MTech students upto a limit of 30. The course assumes strong algorithms and programming skills. MTechs of other departments should approach for permission only if they meet the necessary pre-requisites. CS BTechs will be allowed only if the course has relevance to their projects. BTechs of other departments can credit the course under special circumstances. Upto five SOM students can audit the course at their own risk.

Prerequisites

An upper-level undergraduate course(s) in algorithms and data structures, a basic course on probability and statistics. A DBMS course is helpful but not necessary.

Schedule

Topic Classes Dates
Overview: Data warehousing, OLAP and Data mining [ FPSSU96, CD97] 1 Jan
Classification: 7
Decision tree learning: construction, performance [chapter 3Mit97 ,SAM96] 1 Jan
Issues: tree pruning methods, missing values, continuous classes etc [chapter 3,Mit97] Domingos' paper on Occam Razor 1 2 Feb
Instance-based learning [chapter 8,Mit97]: K-NN, Similarity indexing 1
Baysesian learning (Text classification) [chapter 6, Mit97] 1
Support Vector Machines, SVM Applet 1
Neural networks: Tom Mitchell's book Chapter 4 1 Feb
Meta learning: Dietterich's survey article, Bagging, 1 9 Feb
Discussion of Projects 1 1 Feb
Automatic Information extraction: Rule-based methods 1 16 Feb
Evaluating learning methods, choosing between different models (Lift curves, cost-based etc) 1
Clustering: 3
Clustering methods (Chapter from book in IT folder and 6.11.6 from Mitchell's book for the EM algorithm) 1
Ways of scaling clustering algorithms [BFR98], Birch Clustering paper 1
Semi-supervised and active learning<1>
Case study for classification learning: KDD Cup 1 Feb
Interesting itemset mining: 2
Basic framework and algorithms [AMS+] 16 March
Variants for sequential : [ SA96, CSD98 Temporal mining] 21 March
Interesting case study: Intrusion detection [LSM99], Mining telecommunications data 1 27 March
Warehousing: data warehousing overviewCD97 web warehousing 1 28 March
Schema integration and data cleaning, Deduplication, Merge/Purge paper, class notes , active learning in deduplication 1.5 28 Mar/3 Apr
Data marts: Multidimensional databases (OLAP) 1 4 Apr
Advanced topics: Integrating OLAP and mining [Sar99], Online aggregation 1 4/10 Apr
Recap, future and visions. 1 17 April
Project presentations - 18 April
Project demos: - April 20

Reading Material

Useful Text books (copies will be available in the school library):
[Mit97]
T. Mitchell. Machine Learning. McGraw-Hill, 1997.
[ Hand01]
Hand, Smyth, Mannila Principles of Data mining
[ Han00]
Data Mining: Concepts and Techniques by Jiawei Han, Micheline Kamber, Morgan Kaufmann Publishers
[ FPSSU96]
Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy, editors. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.


List of papers:

[AMS+96]
Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, and A. Inkeri Verkamo. Fast Discovery of Association Rules. In Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, chapter 12, pages 307-328. AAAI/MIT Press, 1996. Related Paper
[BFR98]
Scaling EM (Expectation-Maximization) Clustering to Large Databases,Paul Bradley, Usama Fayyad and Cory Reina,Microsoft Research Technical Report MSR-TR-98-35, November 1998 -- Revised February 1999.
[BMS97]
Sergey Brin, Rajeev Motwani, and Craig Silverstein. Beyond market baskets: Generalizing association rules to correlations. In Proc. of the ACM SIGMOD Conference on Management of Data, May 1997.
[CD97]
S. Chaudhuri and U. Dayal. An overview of data warehouse and OLAP technology. ACM SIGMOD Record, March 1997. Paper.
[CSD98]
S. Chakrabarti, S. Sarawagi, and B. Dom. Mining surprising temporal patterns. In Proc. of the Twenty fourth Int'l conf. on Very Large Databases (VLDB), Aug 1998 Paper.
[DL97]
M. Dash and H. Liu. Feature selection for classification. Intelligent Data Analysis, 1(3):131-156, 1997.
[JD88]
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
[LSM99]
W. Lee, S. Stolfo, and K. Monk. Mining in a data flow environment: experiences in network intrusion detection. In KDD, 1999.
[RG98]
Kohavi R and John G. Feature Selection for Knowledge Discovery and Data Mining, Editors Huan Liu and Hiroshi Motoda, chapter The Wrapper Approach. Kluwer International Series in Engineering and Computer Science, 1998.
[SA96]
Ramakrishnan Srikant and Rakesh Agrawal. Mining Sequential Patterns: Generalizations and Performance Improvements. In Proc. of the Fifth Int'l Conference on Extending Database Technology (EDBT), Avignon, France, March 1996. Paper
[SAM96]
John Shafer, Rakesh Agrawal, and Manish Mehta. SPRINT: A scalable parallel classifier for data mining. In Proc. of the 22nd Int'l Conference on Very Large Databases, Bombay, India, September 1996. Paper
[Sar99]
S. Sarawagi. Explaining differences in multidimensional aggregates. In Proc. of the 25th Int'l Conference on Very Large Databases (VLDB), 1999.
[Toi96]
Hannu Toivonen. Sampling large databases for association rules. In Proc. of the 22nd Int'l Conference on Very Large Databases, pages 134-145, Mumbai (Bombay), India, September 1996.




File translated from TEX by TTH, version 2.34.
On 3 Jan 2000, 19:04.