Rahul Gupta
Email : grahul _AT_ cse.iitb _DOT_ ac.in
Snail : K R Building, IIT Bombay, Powai, Mumbai - 400076, India
CV (outdated): here
Note: Working at Google Research, Mountain View since Oct 2010.


Research: I have worked on various aspects of Conditional Random Fields (CRF) in my thesis. CRFs are discriminatively trained conditional models, which have been established as the state of the art for various structured text mining tasks such as POS tagging, named entity recognition and segmentation, NP-chunking, and information extraction in general. My thesis aims at tacking three key issues that arise when CRFs are applied to real data on the web:
  1. Creating imprecise databases from CRFs : We present methods for storing the probabilistic output of a CRF in an imprecise database. The CRF has an exponential output model, which does not match with the representation-friendly model of an imprecise DB. We consider the mixture-of-multinomials imprecise data model, and give algorithms to the approximate the CRF's distribution with that of the data model, without computing even the top-1 labeling of the CRF.
  2. Collective inference with cardinality-based clique potentials: Real data often exhibhits content overlap across various data sources. When different CRFs are deployed on these sources, it is natural to demand that their outputs agree on the shared inputs. We present various cardinality-based clique potentials that model this agreement without making inference horribly intractable. We propose that collective inference be done using message passing on the cluster graph, and we present combinatorial algorithms to compute messages for these special clique potentials. Our algorithms provide superior approximation ratios than the alternatives while enjoying a runtime sub-quadratic in the clique size.

    We further show that the cluster message passing framework also captures richer forms of overlap defined by first-order properties of the labelings. We use such properties while applying collective inference to transfer learning across related scopes in the same domain (e.g. different bibliographic styles inside the publications domain), and obtain significant gains over other approaches.

  3. Collective training : Content overlap across different data sources can also be exploited during training. However the nature of the overlap is arbitrary and most of the previous work has focused on just two sources. We use an agreement-based learning framework and its approximations that vary in tractability, robustness against noise, and extent of agreement enforced on the overlapping data. We also present a principled scheme to generate low-noise agreement sets across multiple sources. Through extensive experiments we establish that our method of additively rewarding agreement over maximal segments of text provides the best trade-offs, and also scores over alternatives such as collective inference, staged training, and multi-view learning.

Here is a link to my CV (pdf) that contains a detailed list of problems that I have worked on. Here is my thesis [pdf].
My unofficial research agenda

My brief bio -- enacted by Homer Simpson (flash-plugin needed, click on each slide to move to the next)
Psst! It is not machine learning but these problems that keep me awake at night. Go ahead and read them, just do not be a tattle-tale and tell my guide.

I am known for various things -- as a Youtube stress tester, video games analyzer, cricket-website spider, obscure Wikipedia articles quality assurer, and a movie leecher. In between doing all these things, I find, or try to find time to do my PhD, which began a long long time ago, in a city far far away. In a previous avatar, I was known for selling out to IBM India Research Lab (I was young and I needed the money), and before that, I was a dorky undergrad at IIT, Delhi.

But I am told that website-visitors do not like such informal descriptions, so the rest of the page is in a boringly familiar format. Read on...


Education and experience: 1 : Started in July 2005. I am too lazy to update this info every year.
Courses:
Conference and Journal Publications Thesis / Tech Reports Ancient/other papers
Obligatory academic links: Misc Links:
Hobbies and Interests
Hits since 29/6/2009: