| Email : grahul _AT_ cse.iitb _DOT_ ac.in |
| Snail : K R Building, IIT Bombay, Powai, Mumbai - 400076, India |
| CV (outdated): here |
|Note: Working at Google Research, Mountain View since Oct 2010.||
Research: I have worked on various aspects of Conditional Random Fields (CRF) in my thesis. CRFs are discriminatively trained
conditional models, which have been established as the state of the art for various structured text mining tasks such as POS tagging,
named entity recognition and segmentation, NP-chunking, and information extraction in general. My thesis aims at tacking three
key issues that arise when CRFs are applied to real data on the web:
- Creating imprecise databases from CRFs : We present methods for storing the probabilistic output of a CRF in an imprecise database.
The CRF has an exponential output model, which does
not match with the representation-friendly model of an imprecise DB. We consider the mixture-of-multinomials imprecise data model,
and give algorithms to the approximate the CRF's distribution with that of the data model, without computing even the top-1
labeling of the CRF.
- Collective inference with cardinality-based clique potentials: Real data often exhibhits content overlap across various data sources.
When different CRFs are deployed on these sources, it is natural to demand that their outputs agree on the shared inputs.
We present various cardinality-based clique potentials that model this agreement without making inference horribly intractable. We propose that
collective inference be done using message passing on the cluster graph, and we present combinatorial algorithms to compute messages for these
special clique potentials. Our algorithms provide superior approximation ratios than the alternatives while enjoying a
runtime sub-quadratic in the clique size.
We further show that the cluster message passing framework also captures richer forms of overlap defined by first-order properties of the labelings.
We use such properties while applying collective inference to transfer learning across related scopes in the same domain
(e.g. different bibliographic styles inside the publications domain), and obtain significant gains over other approaches.
- Collective training : Content overlap across different data sources can also be exploited during training.
However the nature of the overlap is arbitrary and most of the previous work has focused on just two sources.
We use an agreement-based learning framework and its approximations that vary in tractability, robustness against noise, and extent
of agreement enforced on the overlapping data.
We also present a principled scheme to generate low-noise agreement sets across multiple sources.
Through extensive experiments we establish that our method of additively rewarding agreement over maximal segments of text
provides the best trade-offs, and also scores over alternatives such as collective inference, staged
training, and multi-view learning.
Here is a link to my CV (pdf) that contains a detailed list of problems that I have worked on. Here is my thesis [pdf].
My unofficial research agenda
My brief bio -- enacted by Homer Simpson (flash-plugin needed, click on each slide to move to the next)
Psst! It is not machine learning but these problems that keep me awake at night. Go ahead and read them, just do not be a tattle-tale
and tell my guide.
I am known for various things -- as a Youtube stress tester, video games analyzer, cricket-website spider, obscure Wikipedia articles quality assurer, and a movie leecher.
In between doing all these things, I find, or try to find time to do my PhD, which began a long long time ago, in a city far far away. In a previous avatar, I was known
for selling out to IBM India Research Lab (I was young and I needed the money), and before that, I was a dorky undergrad at IIT, Delhi.
But I am told that website-visitors do not like such informal descriptions, so the rest of the page is in a boringly familiar format. Read on...
Education and experience:
1 : Started in July 2005. I am too lazy to update this info every year.
- Nth year PhD student1 at the Dept. of Computer Science and Engg., IIT Bombay, India.
My advisor is Prof. Sunita Sarawagi, and my areas of interest are graphical models, text mining and machine learning.
- Research Intern with the Community Systems Group at Yahoo! Research, Santa Clara, California, US. (May 2008 - Aug 2008)
- Technical Staff Member at IBM India Research Lab (June 2001 - Dec 2007)
- Bachelor of Technology (B.Tech) in Computer Science and Engg. from the Indian Institute of Technology (IIT), Delhi. (2001)
My major project was on maintaining 2-edge connectivity in dynamic graphs.
- Intern at the Embedded Systems Group, Dept. of Computer Science, University of Dortmund, Germany. (May - July, 2000)
I implemented the Loop Invariant Code Motion optimization for a C compiler specifically meant for embedded systems.
Conference and Journal Publications
Thesis / Tech Reports
- Rahul Gupta, Sunita Sarawagi and Ajit A. Diwan. Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials. In JMLR Volume 11, 2010. [pdf]
- Rahul Gupta and Sunita Sarawagi. Joint Training for Open-domain Extraction on the Web: Exploiting Overlap when Supervision is Limited. In WSDM 2011. [pdf]
- Rahul Gupta and Sunita Sarawagi. Answering table augmentation queries from unstructured lists on the web. In VLDB, 2009. [pdf]
- Rahul Gupta and Sunita Sarawagi. Domain adaptation of information extraction models. In SIGMOD Record, 2008.[ pdf ]
- Sunita Sarawagi and Rahul Gupta. Accurate max-margin training for structured output spaces. In ICML 2008. [pdf]
- Rahul Gupta, Ajit A. Diwan, and Sunita Sarawagi. Efficient inference with cardinality-based clique potentials. In ICML 2007. [pdf]
- Rahul Gupta and Sunita Sarawagi. Creating probabilistic databases from information extraction models. In VLDB 2006. [pdf]
- Rahul Gupta and Sunita Sarawagi. Map estimation in MRFs via rank aggregation. In Workshop on Open Problems in Statistical Relational Learning (co-presented with Workshop on Learning in Structured Output Spaces), ICML 2006. [pdf]
- Rahul Gupta. Training, Inference and Imprecise Storage
Methods for Collections of Conditional Random Fields. PhD Thesis [pdf].
- Rahul Gupta, Sunita Sarawagi and Ajit A. Diwan. Generalized Collective Inference with Symmetric Clique Potentials. [arXiv page]. [pdf]
- A somewhat outdated survey (written in November 2005) on Conditional Random Fields. [pdf]
- Manish Bhide, Ajay Gupta, Rahul Gupta, Prasan Roy, Mukesh K. Mohania, and Zenita Ichhaporia. LIPTUS: associating structured and unstructured information in a banking environment. In SIGMOD 2007 (Industrial Track). Pages 915-923. [pdf]
- Rajkishore Barik, Christian Grothoff, Rahul Gupta, Vinayaka Pandit and Raghavendra Udupa. Optimal Bitwise Register Allocation using Integer Linear Programming. In LCPC 2006. [pdf]
- T. V. Ashwin, Rahul Gupta and Sugata Ghosal. Adaptable Similarity Search using Non-Relevant Information. In VLDB 2002 [pdf].
- A shorter version : Leveraging non-relevant images to enhance image retrieval performance. In ACM Multimedia 2002 [pdf]
N. Agrawal, R. Ananthanarayanan, R. Gupta, S. Joshi, R. Krishnapuram, S. Negi. The eShopmonitor: A comprehensive data extraction tool for monitoring Web sites. In IBM Journal of Research and Development, Vol 48, No. 5/6, 2004.
- A shorter version: EShopMonitor: A Web Content Monitoring Tool. In ICDE 2004 (Industrial Track). [Publisher page].
Obligatory academic links:
Hobbies and Interests
Not much into playing sports, although I am a big fan of trekking and jogging. If you haven't trekked in the Himalayas, then my friend you might have trekked in the Alps or the Appalachians. Sorry, I meant to say that if you haven't roamed around in the Himalayas, then you have missed something non-trivial.
- An avid watcher and even more enthusiastic reader as far as cricket is concerned.
- I am very pessimistic when it comes to movies. I believe that the best movies are behind us, and we can
expect no more than one good movie per year. Some of my favorites --- Mr. Smith goes to Washington, Rope,
Wages of fear, Once upon a time in the West, It happened one night, The Prestige. Then come the
generally great cult movies
like Matrix, LOTR, Dark Knight. After wearing some protective gear, I would like to state that I think Star Wars is at best an okay-ish sci-fi soap opera.
- PS2 and PC games : AoC, Quake, GTA, Burnout, FIFA, Crysis, Mass Effect, Prince of Persia series, God of War, Assassins Creed, and the endless list goes on..
- And don't forget the golden games from Nintendo et.al. --- Super Mario, Contra, Road Fighter, Islander, Goonies, Tetris, Bomber man...
- Music: Primarily an old-school rock fan.
Particularly like Pink Floyd, Dire Straits, Bruce Springsteen, Guns n Roses, Deep Purple, Led Zeppelin
and Pearl Jam.
- PhD Comics, Slashdot, Wikipedia, Dilbert, Calvin & Hobbes, User Friendly, Xkcd
- South Park, Simpsons, Seinfeld (best show ever. Ever.)
- Books - A. C. Clarke, Tolkien, Asimov, books on cricket, Dave Barry. Three best books ever -- Silmarillion,
2001: A Space Odyssey, Hitchhiker's Guide to the Galaxy.
Hits since 29/6/2009: