EPSRC logo

Details of Grant 

EPSRC Reference: EP/M005089/1
Title: SIPHS: Semantic interpretation of personal health messages for generating public health summaries
Principal Investigator: Collier, Professor N
Other Investigators:
Researcher Co-Investigators:
Project Partners:
Connecting Orgs for Reg Disease Surv EMBL Group Linguamatics Ltd
Public Health England UCL University of California, San Diego
University of Utah University of Zurich
Department: English and Applied Linguistics
Organisation: University of Cambridge
Scheme: EPSRC Fellowship
Starts: 09 February 2015 Ends: 08 February 2020 Value (£): 971,955
EPSRC Research Topic Classifications:
Artificial Intelligence Bioinformatics
Comput./Corpus Linguistics Information & Knowledge Mgmt
EPSRC Industrial Sector Classifications:
Related Grants:
Panel History:
Panel DatePanel NameOutcome
17 Jul 2014 EPSRC ICT Prioritisation Panel - July 2014 Announced
04 Sep 2014 ICT Fellowships Interview Meeting - 4 Sept 2014 Announced
Summary on Grant Application Form
Open online data such as microblogs and discussion board messages have the potential to be an incredibly valuable source of information about health in populations. Such data has been rapidly growing, is low cost, real-time and seems likely to cover a significant proportion of the demographic. To take two examples, PatientsLikeMe has enjoyed 10% growth and now has over 200,000 users covering over 1500 health conditions; the generic Twitter service is expanding at a rate of 30% annually with over 200 million active users. Going beyond simple keyword search and harnessing this data for public health represents both an opportunity and a challenge to natural language processing (NLP). This fellowship proposal is about helping health experts leverage social media for their own clinical and scientific studies through automatic techniques that encode messages according to a machine understandable semantic representation. There are three major challenges this project seeks to address: (1) knowledge brokering: to develop algorithms to identify and code the informal descriptions of conditions, treatments, medications, behaviours and attitudes to standard ontologies such as the UMLS; (2) knowledge management: to create a structured resource of patient vocabulary used in blog texts and link it to existing coding systems; and (3) adding insight to evidence: to work with domain experts to utilize the coded information to automatically generate meaningful summaries for follow up investigation. At the technological level the fellowship seeks to pioneer new methods for NLP and machine learning (ML). Social media remains a challenging area for NLP for a variety of reasons: short de-contextualised messages, high levels of ambiguity/out of vocabulary words, use of slang and an evolving vocabulary, as well as inherent bias towards sensational topics. The fellowship seeks to harness the progress made so far in NLP for social media analysis in the commercial domain and develop it further to provide meaningful public health evidence. One key aspect not previously addressed is in the clinical coding of patient messages. Although knowledge brokering systems exist for clinical and scientific texts (e.g. the NLM's MetaMap), their performance on social media messages has been poor. The fellowship will utilise the rich availability of ontological resources in biomedicine together with ML on annotated message data to disambiguate informal language. Research will also aim to understanding the communicative function of messages, for example whether the message reports direct experience or is related to news, humour or marketing. If these problems are successfully overcome an important barrier to data integration with other types of clinical data will be removed. The advantage of providing health coding for social media reports is its potential for studying very-large scale cohorts and also in real-time early alerting of aberrations. In the fellowship I will research the potential for multi-variate time series alerting from semantically coded features, working with domain experts to evaluate across a range of metrics (e.g. sensitivity, timeliness, false alerting rates). A variety of approaches will be explored to generate real time risk summaries across social media sources. Two real-world applications have been chosen to take this forwards: early alerting for Adverse drug reactions (ADRs) and Infectious disease surveillance (IDS). Project outcomes will include fundamental technologies as well as open source algorithms, data sets and ontology. An exciting aspect of this fellowship is inter-disciplinary collaboration across stakeholders at all levels: scientists, public health experts and industry. Finally, participation will be opened up to the international community through the release of open source data. Colleagues working on social media technologies will be invited to participate in discussions with users at a new challenge evaluation workshop.
Key Findings
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Potential use in non-academic contexts
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Description This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Date Materialised
Sectors submitted by the Researcher
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Project URL:  
Further Information:  
Organisation Website: http://www.cam.ac.uk