EPSRC logo

Details of Grant 

EPSRC Reference: EP/W001012/1
Title: Transparent Compression for General-Purpose Programming Languages
Principal Investigator: Pirk, Dr H
Other Investigators:
Researcher Co-Investigators:
Project Partners:
Oracle Corporation
Department: Computing
Organisation: Imperial College London
Scheme: New Investigator Award
Starts: 01 September 2022 Ends: 31 August 2024 Value (£): 293,360
EPSRC Research Topic Classifications:
Fundamentals of Computing Information & Knowledge Mgmt
EPSRC Industrial Sector Classifications:
No relevance to Underpinning Sectors
Related Grants:
Panel History:
Panel DatePanel NameOutcome
21 Jun 2021 EPSRC ICT Prioritisation Panel 22-23 June 2021 Announced
Summary on Grant Application Form
Lossless compression is a key optimisation technique when storing, transmitting or processing datasets of any size and kind. Compression allows managing datasets larger than primary storage medium and reduces the bandwidth need for data access in applications ranging from physics simulations to machine learning models to sensor data management. Some compression schemes allow computation to be performed directly on compressed data, often reducing computational complexity compared to uncompressed data - such schemes are commonly called "lightweight". For example, summing values in a Run-Length-Encoded (RLE) array requires effort in the order of the size of the compressed input, i.e., usually significantly less than the uncompressed input. Similarly, relational selections and joins can be prefiltered on the dictionary of a dictionary-compressed relation or the minimum-value in a frame-of-reference encoded column.

Implementing algorithms to work directly on compressed data is challenging: the computation has to be tightly integrated with the (de-)compression code and inefficiencies in the implementation can easily outweigh benefits in data transfer or processing performance. Consequently, most lightweight compression schemes are bespoke, i.e., developed and tuned for specific, well-understood domains such as relational databases, image processing or linear algebra. While the state-of-the-art is to implement them manually, virtually all lightweight compression schemes can be expressed as sequences of primitive transformations such as Run-Length-Encoding (RLE), dictionary compression or Huffman-coding. Examples of such "compression pipelines" are PFOR-delta (in the Vector DBMS), Vertipaq (in Microsoft SQL Server) or the Compressed-Sparse-Row matrix representation (in many linear algebra packages).

However, there are three fundamental problems with the state of the art: first, there is no principled way to assemble these pipelines. Second, these schemes are tied to a specific data/processing model (relational algebra, linear algebra, etc.). Third, and most importantly, the implementation effort is high as every application needs to implement compression from scratch. Unsurprisingly many applications that could benefit from compression shy away from that implementation effort: in particular for domain-scientists writing code in languages like Python or R, low implementation effort takes precedence over efficiency.

Our vision is to make the benefits of performance-oriented compression available to applications beyond the mentioned few. For that purpose, we will develop an algebraic framework for the representation and optimisation of bespoke compression schemes in general-purpose programming languages. Instead of "weaving" hundreds of lines of compression-related code into an application's logic, developers will express compression schemes as annotations on collections. The backend transparently transforms code that operates on the vector to take advantage of the compression strategy. This allows even non-experts to implement bespoke compression schemes. Simplifying the interface even further, we will implement a fully automated approach that determines the most appropriate compression scheme for a program, dataset and hardware platform using cost-based optimisation rather than requiring to have it explicitly specified.

Key Findings
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Potential use in non-academic contexts
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Impacts
Description This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Summary
Date Materialised
Sectors submitted by the Researcher
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Project URL:  
Further Information:  
Organisation Website: http://www.imperial.ac.uk