How does data store compression speed up data warehouses?

Question

I often see the claim that various data warehouse/analytical database systems derive significant performance benefits from compressing their data stores. On the face of it, though, this seems to be an absurd claim:

Fast datastore reads in a database come from indexing, (in the "array index" sense, not necessarily the SQL INDEX sense,) the ability to selectively read only the relevant data in a random-access manner.
Data compression does not work on individual data points; it is a bulk operation which is highly stateful and of varying efficiency depending on the data being compressed and the history of data that has previously been compressed.
Therefore, indexing into a compressed data store is impossible. Reads will necessarily require a decompression step, possibly over a lot of data you don't care about, in order to get at the data you want.
Therefore, data compression must significantly slow down a database.

This seems obvious, and yet data warehouses claim that compression gives them performance boosts. So... what am I missing?

(Note: Making the data store smaller will obviously result in lower storage costs. But that's not the claim I'm questioning here; data warehouses claim that compression helps them run queries faster and lower compute costs.)

Database storage engines don't handle individual data points, but larger pages. Those can be compressed. You'd have to load the entire page anyways when reading a data point from it. Compression potentially lets you store more data per page. Time-series databases might use delta encoding between records as a kind of compression. Decompression is also very fast compared to network and disk speeds, so is often quite literally a free win. If you can read storage at 1Gbps, compression might allow a full table scan to effectively run at 3Gbps, making better use of your CPU resources. — amon
– amon, Commented Oct 23, 2024 at 19:05
Indexing compressed data is possible, depending on the algorithm. — freakish
– freakish, Commented Oct 23, 2024 at 19:15
I just googled for "OLAP compression", and could not find any claim that OLAP system gain significant performance benefits from compression - those articles and research papers talk about "introducing compression without compromising performance". So please can you give us some references to "the claim you see so often"? Were those claims are only made in some high-gloss marketing flyers, or can you point us to some serious source? — Doc Brown
– Doc Brown, Commented Oct 23, 2024 at 21:18
I recommend to edit these references into the question. What I think is interesting: both of your examples are systems with columnn oriented storage. Following this explanation I think it is easy to understand how compression can speed up queries for large number of rows, because less I/O operationss are required. — Doc Brown
– Doc Brown, Commented Oct 24, 2024 at 5:32

Steve · Accepted Answer · 2024-10-23 19:46:09Z

Therefore, data compression must significantly slow down a database.

Which concludes a masterstroke in logical fallacy!

One of the essential reasons why compression can make things faster, is that the subsystem for durable storage is considerably slower and more serialised than the subsystem which consists of the central processing cores.

Durable storage is also usually not directly byte-addressable and random-access at that granularity, but consists of reading and writing whole blocks of sequential data to and from main memory.

Economising on the volume of raw data that is sent to and brought from storage under these conditions, can therefore improve performance more than avoiding modest amounts of processing upon the data entailed by a compression algorithm.

It's also important to recognise that database engines are not optimised for fastest retrieval of a random single data point, with the entire machinery starting and finishing at idle, but for throughput of data in bulk under sustained load where access patterns are usually far from random.

It is possible for a particular machine or algorithm to be slower at the first whilst being considerably faster at the second because it better facilitates concurrent/overlapping processing.

The benefits of compression are pretty clear for platter drives but I'm less clear on benefits for systems using SSDs and it gets even murkier for me when talking about non-volatile memory designs. — JimmyJames
– JimmyJames, Commented Oct 23, 2024 at 20:16
@JimmyJamessupportsCanada NAND Flash chip geometry dictates the block readout. youtu.be/LsQZnTtAvcI?t=1680 — Basilevs
– Basilevs, Commented Mar 23 at 12:30

JonasH · Accepted Answer · 2024-10-24 09:59:39Z

The benefit will depend on the type of access you are doing. If you only care about random access performance, compression will probably not help much. But other types of access patterns may see a much greater benefit.

You would typically compress data into blocks that can be individually decompressed, and add these blocks to an index so you can quickly find the block containing the required piece of data. Naturally there will be a tradeoff here between larger blocks, and therefore compression efficiency, and the amount of unneeded data that need to be decompressed. So the claim that compressed data cannot be indexed at all is clearly false. Just look at video compression, where seeking is more or less instant.

But the memory hierarchy also matters. IO costs tend to dominate, and reading sequential data tend to be much faster than random access, so the cost of reading more data than needed may not be that significant. Decompression can also be fairly cheap, lz4 claims 4970MB/s/core, and simple delta compression may be even cheaper.

In some cases, compression may allow the data to be stored on smaller but faster SSDs, or even in memory, greatly reducing the IO latency and improving performance.

If there is a pattern to queries, like fetching multiple related rows at the same time, then compression may reduce the amount of needed IO, and therefore help with overall performance. That is ofc assuming that the data can be compressed in the first place. I do not think anyone is claiming that compression is suitable for all kinds of data and use cases.

user1937198 · Accepted Answer · 2024-10-24 21:19:50Z

Data warehouse/analytical database systems are optimizing for aggregate reporting queries that touch many rows to generate their result set. Things like 'total sales for the xth quarter' or 'average time spent on the site of people who bought a baz widget'. These queries tend to have relatively large amounts of data per index key. As such you can benefit from compression without sacrificing indexing, by compressing on a per index key, or group of index keys. Then the index selects which compressed blobs the query needs, and the query processes the entire blob.

Michael Green · Accepted Answer · 2025-03-23 11:47:40Z

Let me first address the assertion that "Reads will necessarily require a decompression step." Some common aggregate expressions can be evaluated without decompression.

Say we have 1,000 rows which include a dollar-value column. For the sake of illustration we'll say the first 600 rows all are $5 and the others are $7.

Uncompressed, to get the total dollar value requires loading 1,000 values and 999 computations. Similarly a count requires touching 1,000 values. These are very common analytical expressions.

With run length encoding the column can be stored as [(600,$5),(400,$7)]. Total value can be calculated as ((600 * 5)+(400 * 7)) and count as (600+400). Many fewer values are loaded and calculations performed. Repeat this across the very many compressed blocks an OLAP system may hold and performance improvement can be noticeable.

Usually the system will capture summary information per block, per column, as it is compressed. Minimum, maximum, total, row count & null count are typical. Using this metadata, stored in the DBMS catalog, some results can be found without even reading table data.

Second, the assertion that "Fast datastore reads in a database come from ... the ability to selectively read only the relevant data in a random-access manner" is not true. Snowflake holds data internally in a columnar format. It does not allow for the creation of secondary indexes at all. It is widely held to be a capable OLAP system.

Further there are a class of DBMS built around GPUs. They (generally) forgo any indexing, instead relying on the device's large bandwidth and parallel processing to treat all access as a table scan. These have not gained wide popularity, but they exist.

Thirdly the assertion "Indexing into a compressed data store is impossible" is false. SQL Server, for one, permits clustered columnstore indexes (cci). "Clustered" means they constitute the on-disk sequencing of base table rows. As columnstores they are compressed. Secondary BTree indexes can be defined on a table that is a cci.

Rows in a compressed block are still in some sequence. Each set of column values in the block are an array. Any compression on that array retains the original sequencing. A secondary index can, at its leaf level, hold a pointer to the block and offset within the block. Compression algorithm design makes it simple to return the ith value for each compressed column.

Compressed data is typically 20% the size of uncompressed. This means loading data takes one fifth as much time. The corollary being five times as much data can be retained in RAM ready to answer the next query, improving performance.

Having multiple column-values adjacent in memory facilitates movement en masse through the memory hierarchy and onto cores, where they are amenable to SIMD techniques. When a predicate eliminates rows this can be recorded in an related bitmap. Each predicate can be evaluated in parallel on separate cores and the bitmaps ANDed together at the end.

To be honest, this information doesn't strike me as particularly useful, because aggregates are typically computed in a grouped form. So unless compression blocks match up exactly with the GROUP BY key for a query, such block-level summations don't provide any useful information. — Mason Wheeler
– Mason Wheeler, Commented Mar 17 at 20:27

gnasher729 · Accepted Answer · 2024-10-24 20:07:50Z

-2

A new Mac can write to SSD at about 3GB/sec. Swapping is made faster by compressing data to be swapped so less data is written and then less data read back. Plus the effect of having more free memory. Compression is done in hardware. I think the processor has a not very well documented “compress 1024 byte” instruction.

Now the cheapest and biggest SSDs are much slower than that, and compression is much more powerful.

answered Oct 24, 2024 at 20:07

gnasher729

49.4k4 gold badges71 silver badges137 bronze badges

"Compression is done in hardware. I think the processor has a not very well documented 'compress 1024 byte' instruction." Do you have any source on that? That doesn't sound like something that current ISAs are capable of doing in a single instruction.

Mason Wheeler
– Mason Wheeler

2024-10-24 20:35:58 +00:00
Commented Oct 24, 2024 at 20:35
Hidden somewhere in a dark corner. Together with encryption/decryption instructions that are also faster than SSD access. It’s Apples processor, they can add any instructions.

gnasher729
– gnasher729

2024-10-24 20:59:17 +00:00
Commented Oct 24, 2024 at 20:59
2

Throwing in a single number is quite useless when there is no other number to compare it with. "3GB/sec" - is that fast, is that slow compared to the throughput by decompressing from memory? I have no idea.

Doc Brown
– Doc Brown

2024-10-25 04:26:31 +00:00
Commented Oct 25, 2024 at 4:26

Add a comment |

Stack Exchange Network

How does data store compression speed up data warehouses?

5 Answers 5

Hot Network Questions

How does data store compression speed up data warehouses?

5 Answers 5

Related

Hot Network Questions