KReSIT Logo Kanwal Rekhi School of Information Technology
IIT Bombay
 Student Projects
 Ongoing Projects
 Past Projects
 Proposed Projects-2005
 Projects Archives
 ASL BE/MCA Projects
 Sponsored Projects

Home > Research > Projects > Students > Previous 

Students' Projects

Web based Named Entity Recognition

Sandesh Tawari


Web-NER aims at extracting entities of interest from web pages. The scale, unstructuredness, and diversity of the web pose challenges to NER on the web-pages. Traditionally, rule based techniques like Wrapper Induction Systems have been used for this task but these techniques are site specific and not robust. We intend to use statistical learning based approaches. The rich HTML structure, that encloses the web content, provides strong visual and spatial cues, in addition to textual information. Further, entities on web-pages are often in spatial relationships. For instance, on web-pages describing products, the product titles are almost always found above the product images. A web-page represents a 2D layout of irregularly placed blocks of varying sizes. Capturing contextual interactions (spatial dependencies) between blocks on such a layout is a challenging task.
In this project, our aim is to build a framework that will assist in entity extraction from web-pages by exploiting textual, visual and spatial properties. We concentrate mostly on entities composed of several sub-entities that are dispersed on a web-page. In our initial attempts, we have used CRFs and SVMs with simple textual, spatial and visual features. For our experiments, we found that SVMs perform better than CRFs

Printer friendly    Comment
  Copyright © 2004 KReSIT, IIT Bombay. All rights reserved sitemap    
  Kanwal Rekhi School of Information Technology, Indian Institute of Technology Bombay, Powai, Mumbai - 400 076.
+91-22-2576 7901/02. Fax: +91-22-2572 0022
Designed by Kamlesh