IT 655 Advanced topics in data mining (3 - 0 - 0 - 3)

Beyond record data mining: Prediction with richer structures (sequences, trees, and graphs)

Conventional data mining techniques assume record data that consist of a fixed and uniform set of attributes and where the classification problem is to predict a single label. In this course we will cover papers that go beyond record data to perform learning and inferencing on richer structures like sequences, trees and graphs. The primary focus will be on the application of information extraction where such problems abound. This will be a research-oriented course covering papers from recent data mining and machine learning conferences and journals.

Timings

(Slot 20) Tuesdays 3:30-4:55, Fridays 2:00-3:30 First class will meet on the 27th of Jul. Remember this is a half-semester course. We will do it in the first half of the semester.

Instructor Office hours

Tuesdays: 2:40 to 3:30 PM
Friday: 3:30 to 4:30 PM
I encourage students to exploit my office hours to ask questions about the class or to make general comments about the class. Please seek an appointment if you wish to drop by outside the office hours.

Announcements

Answers to the final exam are available here.
Collect your answer sheets from the TA. Distribution of grades will be posted soon.
Remember, the project deadline is Oct 15th.

Pre-requisites:

Foundations part II or Introduction to probability, statistics and random processes or some other equivalent course. Send email to the instructor if in doubt.

Instructor

Sunita Sarawagi

TA

Manoj Kumar Chinnakotla

Syllabus/Schedule

Review of record-oriented mining techniques: Introduction, Record mining operators 27 Jul
Analyzing sequence data .. Slides 3 Aug
Information Extraction
HMMs for text segmentation 10 Aug
Automatic text segmentation for extracting structured records. Borkar, Deshmukh, and Sarawagi Slides
Maximum entropy taggers 12 Aug
Exploiting Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition, Borthwick et al
A Maximum Entropy Model for Part-Of-Speech Tagging by Adwait Ratnaparkhi
Use of Support Vector Machines in Extended Named Entity Recognition, Takeuchi and Collier 17 Aug
Instructor out of town 19, 24, 26 Aug
Global discriminating models
Shallow parsing with conditional random fields.Sha and Pereira. 31 Aug
Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms, Collins. 3 Sept
Submission of project proposals 6 Sept
Classifying linked objects
Enhanced Hypertext Categorization Using Hyperlinks,Chakrabarti, Dom, Indyk. SIGMOD 1998. 3 Sept
Link-based Classification, Q. Lu and L. Getoor. ICML2003.
Sep 7/8
Discriminative Probabilistic Models for Relational Data, B. Taskar et al
Sep 7/8
Graphical models, Jordan Sep 10
Relational data mining
Learning Probabilistic Relational Models Lise Getoor, Nir Friedman, Daphne Koller and Avi Pfeffer Sep 22
Neville, J., M. Rattigan and D. Jensen (2003). Statistical relational learning: Four claims and a survey. Sep 24
(optional readings) Individuals, relations and structures in probabilistic models James Cussens
Final project presentations and reports due 15 Oct

Grading

Student are expected to read the papers in advance and submit an independent one-page summary of the paper(s) before the class starts (30%)
There will be a single 2 hour exam based on the papers (30%)
Groups of two or three students need to do a final project (40%)

Audit students need to accrue 20% marks through any of the above three mechanism. The preferred way is to write decent summaries of 6-7 papers.

Further readings