EPSRC logo

Details of Grant 

EPSRC Reference: EP/K024043/1
Title: Statistical Natural Language Processing Methods for Computer Program Source Code
Principal Investigator: Sutton, Dr C
Other Investigators:
Researcher Co-Investigators:
Project Partners:
Microsoft
Department: Sch of Informatics
Organisation: University of Edinburgh
Scheme: Standard Research
Starts: 01 October 2013 Ends: 31 March 2017 Value (£): 375,602
EPSRC Research Topic Classifications:
Artificial Intelligence Comput./Corpus Linguistics
Fundamentals of Computing Software Engineering
EPSRC Industrial Sector Classifications:
Information Technologies
Related Grants:
Panel History:
Panel DatePanel NameOutcome
21 Nov 2012 EPSRC ICT Responsive Mode - Nov 2012 Announced
Summary on Grant Application Form
Complex software systems involve many components and make use of many external libraries. Programmers who work on such software must remember the protocols for using all of those components correctly, and the process of learning to use a new component can be time consuming and a source of bugs.

We believe that there is a major untapped resource that can help address this problem. Billions of lines of code are readily available on the Internet, much of which are of professional quality. Hidden within this code is a large amount of knowledge about good coding practices, for example, about avoiding error-prone constructs or about the best protocol for using a particular library. We envision a new type of programming tool, which could be called data-driven development tools, that aggregate knowledge about programming from a large corpus of mature software projects, for presentation within the development environment. Just as the current generation of IDEs helps developers to manage their code, the next generation of IDEs will help developers to learn how to write better code.

Fortunately, there is a research field that has already developed a large body of sophisticated tools for analyzing large amounts of text: namely, statistical natural language processing. The long-term strategic goal of this project is to develop new natural language processing techniques aimed at analyzing computer program source code, in order to help programmers learn coding techniques from the code of others. There is a large area for research here that has been almost completely unexplored.

As a first step in this research area, in this project we will focus on automatically identifying short code fragments, which we call idioms, that occur repeatedly across different software projects. An example of an idiom is the typical construct for iterating over an array in Java. Although they are ubiquitous in source code, idioms of this form have not to our knowledge been systematically studied, and we are unaware of any techniques for automatically identifying idioms. The main objective of this project is to develop new statistical NLP methods with the goal of automatically identifying idioms from a corpus of source code text. We call this research problem idiom mining, and it is to our knowledge a new research problem.

This is an interdisciplinary project that draws from statistical NLP, machine learning, and software engineering. The research work of this project is primarily in statistical NLP and machine learning, and will involve developing new statistical methods for finding idioms in programming language text.

Key Findings
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Potential use in non-academic contexts
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Impacts
Description This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Summary
Date Materialised
Sectors submitted by the Researcher
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Project URL:  
Further Information:  
Organisation Website: http://www.ed.ac.uk