The PDRA will liaise with the developers at Sentric Music to ensure a broad array of diverse data sources is linked and
preprocessed in a statistically sound manner, and ensuring the final version of the data are in a format conducive to
machine learning and statistical inference (e.g., unstructured data will need to be pre-parsed into structured data). The
PDRA will need to use a broad suite of "data science" skills to achieve this - including computing skills, as well as statistical
expertise.
The second objective will involve representing the problem from a statistical viewpoint, as a problem of predicting the future
value of a quantity of interest (in this case earnings), on the basis of attributes about the artist and/or their songs, such as
past earnings, genre, fan-base, etc. To choose an appropriate model, two types of considerations come into play: the
format of the data, as well as our expectations about the types of relationships we are trying to capture. We discuss both in
turn.
With regards to data format, this particular application is likely to give rise to a large number of attributes, of various types
(e.g., each song, or artist, will be represented in numeric ways, placed into categories, or rated according to possibly
different scales, etc.). Automatic feature selection techniques will be required to ensure that information-poor attributes are
excluded from consideration to avoid contaminating the results. Moreover, there is a natural hierarchical structure to this
problem, introduced by the relationship between an artist and their songs. Both these aspects challenge off-the-shelf
statistical models, and require a bespoke model.
With regards to the choice of model, it is known that typically in Big Data, as the data set size increases, so does the
heterogeneity in the data, and failing to account for this can lead to over-confident and inaccurate predictions. One solution
is to employ a "divide and conquer" approach by using decision trees, which segment the initial dataset and fit a separate
statistical model in each segment. This approach achieves flexibility without compromising on computational efficiency.
Notably, the output of such models remains interpretable by the end user because it closely resembles the manual
segmentation already used extensively in marketing and, currently, by Sentric. The difference is that the segmentation
rules are extracted from the data in a principled, automatic fashion. Another consideration in choosing the model is the
ability for it to output the confidence of its own predictions. Failure to do so can introduce risks since only confident
predictions should be used for decision-making. Adopting a Bayesian framework is a natural way to achieve this objective.
Our favored approach overall is the framework of Bayesian Dynamic Trees, which combines flexibility, statistical
soundness, scalability using cutting-edge methods, as well as a built-in ability to adapt to data evolution at no extra
computational cost [Anagnostopoulos, 2013]. This framework will have to be extended to handle this problem, to handle the
hierarchical relationship between artists and their songs; the diversity of available attributes; and the need to produce
forecasts over possibly longer-time horizons.
Finally, the PRDA will supervise and contribute to the deployment of the model within Sentric, as well as the design of the
User Interface that will be made available to the artists. The former will involve scalability considerations, and the latter will
involve innovation in visualisation, and communication of uncertainty.
|