EPSRC logo

Details of Grant 

EPSRC Reference: EP/Y004167/1
Title: Navigating Chemical Space with Natural Language Processing and Deep Learning
Principal Investigator: Pang, Dr J
Other Investigators:
Vulic, Dr I
Researcher Co-Investigators:
Project Partners:
Department: Pharm., Chem. & Environmental Sci., FES
Organisation: University of Greenwich
Scheme: Discipline Hopping Awards
Starts: 01 February 2024 Ends: 31 January 2026 Value (£): 89,566
EPSRC Research Topic Classifications:
Artificial Intelligence Chemical Biology
Computational Linguistics
EPSRC Industrial Sector Classifications:
Pharmaceuticals and Biotechnology
Related Grants:
Panel History:
Panel DatePanel NameOutcome
25 Sep 2023 EPSRC ICT Prioritisation Panel Sept 2023 Announced
Summary on Grant Application Form
Natural language processing (NLP) lies at the intersection between linguistics and computer science which aims to process and analyse human language, typically provided as written text. NLP is now strongly focused on the use of machine learning for challenging tasks with some revolutionary algorithms having been developed in the last few years. They now underpin a wide range of real-life applications, such as ChatGPT, virtual assistants and automatic text completion when we write emails.



Innovative research ideas often come from integrating techniques and concepts across disciplines. For this discipline-hopping grant, we would like to explore how Transformer models, a ground-breaking deep learning algorithm developed by Google in 2017 which fuels majority of the current cutting-edge research in NLP, can be adapted to solve research challenges in chemistry.

Chemical structures are usually three dimensional. However, they are also often converted into sequences, called SMILES. SMILES has a simple vocabulary of chemical elements and bond symbols and a few grammatical rules of how the chemical elements are positioned. Owing to this direct analogy to text sequences, through SMILES it is possible to use NLP algorithms to analyse chemical structures in a similar fashion as they are used to analyse text.



For the proposed research, Dr Pang, a chemist will work with Dr Vulic, an NLP and machine learning expert in order to get up to speed with the latest developments in the field of NLP and to examine their further applicability in her domain of expertise.

We will explore and utilise a concept which is now pervasive in machine learning and NLP, termed transfer learning, which 1) pretrains large general-purpose models, and 2) fine-tunes (i.e., specialises) those general models for specific tasks and applications, where labelled data are expensive to create (as they require expert knowledge and complex annotation protocols) and thus inherently scarce.

Specifically, we will pretrain Transformer models to learn a latent representation of the chemical space defined by tens of millions of SMILES. This learned latent representation can then be used to predict molecular properties for a given chemical structure during fine-tuning. The advantage of this type of approach is that the resulting machine learning models rely less on the so-called labelled data (molecules with experimentally determined properties), which are time-consuming or even impossible to generate in chemistry considering the associated cost and experimental challenges. We will aim to make the Transformer models more computationally efficient and accurate using two latest machine learning techniques, termed sentence encoding and contrastive learning. We hope that this new molecular representation can complement existing molecular representation methods and provide an alternative approach to evaluate molecular structures against their properties, which underpins many research and development tasks in the chemical and pharmaceutical industries.

Key Findings
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Potential use in non-academic contexts
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Impacts
Description This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Summary
Date Materialised
Sectors submitted by the Researcher
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Project URL:  
Further Information:  
Organisation Website: http://www.gre.ac.uk