WWT: Table queries on the World Wide Web.

Publications

Talks

Datasets

Members

The goal of the WWT project is to answer queries over partially structured record sets on the web.

Publications

Rahul Gupta and Sunita Sarawagi. Joint training for open-domain extraction on the web: Exploiting overlap when supervision is limited. In WSDM, 2011.
[ bib | .pdf ]

We consider the problem of jointly training structured mod- els for extraction from multiple web sources whose records enjoy partial content overlap. This has important applica- tions in open-domain extraction, e.g. a user materializing a table of interest from multiple relevant unstructured sources; or a site like Freebase augmenting an incomplete relation by extracting more rows from web sources. Such applications require extraction over arbitrary domains, so one cannot use a pre-trained extractor or demand a huge labeled dataset. We propose to overcome this lack of supervision by using content overlap across the related web sources. Existing methods of exploiting overlap have been developed under settings that do not generalize easily to the scale and diver- sity of overlap seen on Web sources. We present an agreement-based learning framework that jointly trains the models by biasing them to agree on the agreement regions, i.e. shared text segments. We present alternatives within our framework to trade-off tractability, robustness to noise, and extent of agreement enforced; and propose a scheme of partitioning agreement regions that leads to efficient training while maximizing overall accuracy. Further, we present a principled scheme to discover low-noise agreement regions in unlabeled data across multiple sources. Through extensive experiments over 58 different extrac- tion domains, we establish that our framework provides sig- nificant boosts over uncoupled training, and scores over al- ternatives such as collective inference, staged training, and multi-view learning.

Girija Limaye, Sunita Sarawagi, and Soumen Chakrabarti. Annotating and searching web tables using entities, types and relationships. In Proc. of the 36th Int'l Conference on Very Large Databases (VLDB), 2010.
[ bib | .pdf ]

Tables are a universal idiom to express relational data, even for human consumption. Billions of tables on Web pages express entity references, attributes and relationships. This representation of relational world knowledge is usually considerably better than completely unstructured free-format text. At the same time, unlike manually-created knowledge bases, relational information mined from ``organic'' Web tables need not be constrained by availability of precious editorial time. Unfortunately, in the absence of any formal, uniform schema imposed on Web tables, Web search cannot take advantage of these high-quality sources of relational information. In this paper we propose new machine learning techniques to annotate table cells with entities that they likely mention, table columns with types from which entities are drawn for cells in the column, and relations that pairs of table columns are seeking to express. We propose a new graphical model for making all these labeling decisions for each table simultaneously, rather than make separate local decisions for entities, types and relations. Experiments using the YAGO catalog, DBPedia, tables from Wikipedia, and over 25 million HTML tables from a 500 million page Web crawl uniformly show the superiority of our approach. We also evaluate the impact of better annotations on a prototype relational Web search tool. We demonstrate clear benefits of our annotations beyond indexing tables in a purely textual manner.

Rahul Gupta and Sunita Sarawagi. Answering table augmentation queries from unstructured lists on the web. In Proc. of the 35th Int'l Conference on Very Large Databases (VLDB), 2009.
[ bib | .pdf ]

We present the design of a system for assembling a table from a few example rows by harnessing the huge corpus of information-rich but unstructured lists on the web. We developed a totally unsupervised end to end approach which given the sample query rows - (a) retrieves HTML lists relevant to the query from a pre-indexed crawl of web lists, (b) segments the list records and maps the segments to the query schema using a statistical model, (c) consolidates the results from multiple lists into a unified merged table, (d) and presents to the user the consolidated records ranked by their estimated membership in the target relation.

The key challenges in this task include construction of new rows from very few examples, and an abundance of noisy and irrelevant lists that swamp the consolidation and ranking of rows. We propose modifications to statistical record segmentation models, and present novel consolidation and ranking techniques that can process input tables of arbitrary schema without requiring any human supervision. Experiments with Wikipedia target tables and 16 million unstructured lists show that even with just three sample rows, our system is very effective at recreating Wikipedia tables, with a mean runtime of around 20s.

Datasets

Annotation data: Annotated tables with entity, type, and relationship tags used in the above VLDB 2010 paper.

Talk Slides

Overview: Query-driven relation extraction from the semi-structured Web
Annotating Web tables with entity, type, relationship links
Exploiting overlap for extraction from multiple sources
Extractions from lists

Group members

Sarawagi, Sunita
Rahul Gupta
Girija Limaye
Rakesh Pimplikar
Prashant Barole
Back to Top
Copyright IIT Bombay.