EPSRC logo

Details of Grant 

EPSRC Reference: EP/J020664/1
Title: CROSS: Real-time Story Detection Across Multiple Massive Streams
Principal Investigator: Osborne, Dr M
Other Investigators:
Macdonald, Professor C Ounis, Professor I
Researcher Co-Investigators:
Project Partners:
Department: Sch of Informatics
Organisation: University of Edinburgh
Scheme: Standard Research
Starts: 01 May 2012 Ends: 30 April 2013 Value (£): 209,736
EPSRC Research Topic Classifications:
Information & Knowledge Mgmt
EPSRC Industrial Sector Classifications:
Aerospace, Defence and Marine Information Technologies
Related Grants:
Panel History:
Panel DatePanel NameOutcome
09 Feb 2012 Data Intensive Systems (DaISy) Announced
Summary on Grant Application Form
The World is rapidly becoming more and more connected, with people communicating using multiple streams - Social Media, Newswire, Wikipedia etc - on a bewildering range of topics and at a furious rate. Twitter alone receives more than 250 million new posts every day (Tsotsis 2011). This massive interconnection means that content can appear and quickly spread through and across different streams. For example, in the recent London riots, many tweets reported the rioting events as they happened in real-time. However, not all content posted is either of good quality or is factually correct, complicating the job of monitoring such streams for any purpose. An example of this happened when a comedian spread false rumours on Twitter about Osama Bin Laden watching his television show (Lineham 2011). Communication streams are also known to spread rumours, outright misinformation and content with malicious intent. For instance, during the same riots, radicalising posts were spread calling for participation in the so-called "cyber-jihad" (BBC 2011). Systems that can identify such posts is of paramount importance for security monitoring purposes.

On the other hand, not all information spread on mediums such as Twitter are accurate or interesting. This is compounded with the peculiarities of messages on modern social media (short, jargon, social context, etc.) where biased, incomplete, inaccurate and misleading messages are common. The latter makes it extremely challenging to automatically identify events worth monitoring for security purposes in real-time.

We propose a distributed infrastructure to automatically identify important new events (aka stories) in real-time by combining and comparing multiple message streams. The value of such story detection to many applications is clearly increased the faster this can happen. A security agency using our system would be better prepared when dealing with fast moving events as they unfold. Indeed, in this project, the notion of importance will be defined within a security context. Given the fact that streams typically have possible bias and not everything present can be trusted, a key requirement of the system is minimising false positives (uninteresting stories that are discovered). Moreover, the effective management and efficient processing of multiple streams of real-time data poses new technological and scientific challenges:

Challenge 1: Identify interesting new stories and not drown in a sea of false positives, yet reduce the effects of bias and rumour.

Challenge 2: Minimise system latency, such that new stories are detected in real-time with low latency.

We tackle the first challenge from the novel perspective of processing multiple streams and exploiting the fact that stories reported multiple times across several streams can cancel-out stream-specific bias and errors. For example, if a story is true, then it is more likely that it manifests in both Twitter and as an update to a Wikipedia article. Alternatively, a story might appear in Twitter and also appear in a governmental cable. The more often a story occurs within and across streams, the more likely it will be interesting. This is the cornerstone of our proposal, which we tackle by building upon modern first story detection techniques, adapted to account for bias and rumours.

In the second challenge, we ensure low-latency story detection by using a distributed real-time data processing architecture (e.g. S4 or Storm), similar to MapReduce but better suited for real-time operations. Real-time architectures for dealing with massive-scale data are in their infancy, hence CROSS will present a first concrete application, with a corresponding development of best practices for such architectures.

Key Findings
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Potential use in non-academic contexts
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Description This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Date Materialised
Sectors submitted by the Researcher
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Project URL:  
Further Information:  
Organisation Website: http://www.ed.ac.uk