Details of Grant

EPSRC Reference:

EP/M507076/1

Title:

Data Exploration and Predictive Analytics for Music Publishing

Principal Investigator:

Anagnostopoulos, Dr C

Other Investigators:

Researcher Co-Investigators:

Project Partners:

Department:

Mathematics

Organisation:

Imperial College London

Scheme:

Technology Programme

Starts:

25 September 2014

Ends:

24 January 2016

Value (£):

116,361

EPSRC Research Topic Classifications:

Information & Knowledge Mgmt

Statistics & Appl. Probability

EPSRC Industrial Sector Classifications:

Creative Industries

Related Grants:

Panel History:

Summary on Grant Application Form

The PDRA will liaise with the developers at Sentric Music to ensure a broad array of diverse data sources is linked and

preprocessed in a statistically sound manner, and ensuring the final version of the data are in a format conducive to

machine learning and statistical inference (e.g., unstructured data will need to be pre-parsed into structured data). The

PDRA will need to use a broad suite of "data science" skills to achieve this - including computing skills, as well as statistical

expertise.

The second objective will involve representing the problem from a statistical viewpoint, as a problem of predicting the future

value of a quantity of interest (in this case earnings), on the basis of attributes about the artist and/or their songs, such as

past earnings, genre, fan-base, etc. To choose an appropriate model, two types of considerations come into play: the

format of the data, as well as our expectations about the types of relationships we are trying to capture. We discuss both in

turn.

With regards to data format, this particular application is likely to give rise to a large number of attributes, of various types

(e.g., each song, or artist, will be represented in numeric ways, placed into categories, or rated according to possibly

different scales, etc.). Automatic feature selection techniques will be required to ensure that information-poor attributes are

excluded from consideration to avoid contaminating the results. Moreover, there is a natural hierarchical structure to this

problem, introduced by the relationship between an artist and their songs. Both these aspects challenge off-the-shelf

statistical models, and require a bespoke model.

With regards to the choice of model, it is known that typically in Big Data, as the data set size increases, so does the

heterogeneity in the data, and failing to account for this can lead to over-confident and inaccurate predictions. One solution

is to employ a "divide and conquer" approach by using decision trees, which segment the initial dataset and fit a separate

statistical model in each segment. This approach achieves flexibility without compromising on computational efficiency.

Notably, the output of such models remains interpretable by the end user because it closely resembles the manual

segmentation already used extensively in marketing and, currently, by Sentric. The difference is that the segmentation

rules are extracted from the data in a principled, automatic fashion. Another consideration in choosing the model is the

ability for it to output the confidence of its own predictions. Failure to do so can introduce risks since only confident

predictions should be used for decision-making. Adopting a Bayesian framework is a natural way to achieve this objective.

Our favored approach overall is the framework of Bayesian Dynamic Trees, which combines flexibility, statistical

soundness, scalability using cutting-edge methods, as well as a built-in ability to adapt to data evolution at no extra

computational cost [Anagnostopoulos, 2013]. This framework will have to be extended to handle this problem, to handle the

hierarchical relationship between artists and their songs; the diversity of available attributes; and the need to produce

forecasts over possibly longer-time horizons.

Finally, the PRDA will supervise and contribute to the deployment of the model within Sentric, as well as the design of the

User Interface that will be made available to the artists. The former will involve scalability considerations, and the latter will

involve innovation in visualisation, and communication of uncertainty.

Key Findings

This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Potential use in non-academic contexts

This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Impacts

Description	This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Summary
Date Materialised

Sectors submitted by the Researcher

This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Project URL:

Further Information:

Organisation Website:

http://www.imperial.ac.uk