EPSRC logo

Details of Grant 

EPSRC Reference: EP/J019488/1
Title: Lodie,Web Scale Information Extraction via Linked Open Data
Principal Investigator: Ciravegna, Professor F
Other Investigators:
Researcher Co-Investigators:
Dr A Gentile Dr Z Zhang
Project Partners:
Knowledge Now Ltd Yahoo! Research
Department: Computer Science
Organisation: University of Sheffield
Scheme: Standard Research
Starts: 01 September 2012 Ends: 30 November 2015 Value (£): 540,483
EPSRC Research Topic Classifications:
Information & Knowledge Mgmt
EPSRC Industrial Sector Classifications:
No relevance to Underpinning Sectors
Related Grants:
Panel History:
Panel DatePanel NameOutcome
06 Jun 2012 EPSRC ICT Responsive Mode - Jun 2012 Announced
Summary on Grant Application Form
The World Wide Web provides access to tens of billions of pages. These pages contain information that is largely unstructured and only intended for human readability, however we are reliant on computers "reading" these pages in order to find the information we need. The proposed research intends to develop technologies to radically improve the billions of searches which are performed every day by fulfilling the initial vision, by Tim Berners-Lee, for a Web where the webpage content is readable by both humans and machines. Such a vision, disregarded during the initial development of the Web, has now come back in the form of the Web of Data, or Linked Open Data (LOD), where billions of pieces of information are linked together and made available for automated processing. There is however a lack of interconnection between the information in the webpages and that in LOD. A number of initiatives, like RDFa (supported by W3C) or Microformats (used by schema.org and supported by major search engines) are trying to enable machines to make sense of the information contained in human readable pages by providing the ability to annotate webpage content with links into LOD.

While the current state of the art in Web Information Extraction (IE) relies on domain specific training data or generic extraction patterns, by leveraging LOD the proposed research aims to develop IE methodologies and technologies providing pervasive, user-driven, Web-scale information extraction where the target of the IE is defined by the user information needs and aimed at the billions of available Web documents covering an unlimited number of domains.

In this research we aim to develop models and algorithms to create a continuum between LOD and the human readable Web. The approach will utilise wealth of facts available from LOD and the limited number of pages annotated with RDFa/Microformats to learn to connect unannotated webpage content to the LOD cloud. This will provide the reciprocal advantages of: (i) enabling the search of Web pages via the unambiguous LOD instances and concepts, and (ii) the extension of the LOD with the wealth of information available from webpage content.

The key challenge is the development of efficient, Web-scale, semi-supervised, iterative learning methods able to use the initial "seed" data and annotations, by generating models which exploit: (i) the local and global information regularities (e.g. structured information in tables, as well as pages and site-wide regularities); (ii) the redundancy (or repetition) of information; (iii) any ontological restrictions available in LOD. As the learning methods iterate from known interconnections to infer new connections they must cope with the massive amount of noise generated by the number and variety of documents, domains and facts available.

In addition to publishing the research and its findings the IE methods developed will be tested on the task of extracting information relevant to schema.org (a task currently promoted by large search engines companies such as Google and Bing) as well as in international public evaluations. As part of such evaluations the project will generate at least one publicly available, Web-scale IE task (inclusive of corpora, linked resources, etc.) to enable comparison of research results by other researchers.

The project aims to impact the fields of Natural Language Processing, Machine Learning, Information Retrieval and Web and Semantic Technologies by exploring the extraction of information in Web-scale, user-driven tasks. Success in the project will enable new ways of both creating/using the LOD and providing a paradigm shift in the way information can be retrieved from the Web; away from a reliance on keywords and towards the search and exploration of the concepts and meaning (semantics) embedded in those words.

Key Findings
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Potential use in non-academic contexts
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Description This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Date Materialised
Sectors submitted by the Researcher
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Project URL:  
Further Information:  
Organisation Website: http://www.shef.ac.uk