What are the proper ways to do order book downsampling?

Question

I have an access to the order book dataset, which was sampled with resolution that is too high for my sandbox experiments with it. Because of that, I was wondering, what would be the correct way to downsample the order book, say for example, from 1s to 1m? For now I'm just taking snapshots that are separated by some constant time $\Delta t$, but I'm not sure if this is a really proper way to do so.

Thanks in advance!

databento · Accepted Answer · 2025-02-13 01:25:03Z

Two chief reasons for subsampling or using a different event space are (i) computational or spatial tractability and (ii) denoising/signal extraction.

Sergei's response seems to focus on the first issue, and I'll focus more on the latter. The two objectives can diverge. For example, options and OTC data can exhibit trade to order ratios in excess of 1:10,000, so taking trade space will be highly efficient in reducing storage requirements, but inappropriate for modeling.

Typical methods used by trading practitioners include:

Trade (tick) space. Take a snapshot of the book when a trade occurs.
Top of book change space. Take a snapshot of the book when a change to the top of the book occurs.
Uptick/downtick space. Take a snapshot of the book when the price upticks or downticks.
% ADV buckets. Predict the volume and then snap the book at a uniform interval in % daily volume.
Level formation or depletion space. Take a snapshot of the book every time a new price level forms inside the market or a level is depleted.
Time space. e.g. Take a snapshot of the book every second, minute etc.

On the contrary, there is merit in time and volume-based spaces - which is that large participants often use them, so events in time and volume space become informative:

Execution brokers and large quant funds often use % ADV buckets for multi-horizon optimization. This is probably in part due to computational tractability, since even the fastest execution optimizers can take several seconds to compute the entire trajectory of weights. But nevertheless it creates a self-fulfilling effect where % ADV becomes important for everyone else.)
There are also large teams at top market making firms that do use minute frequency data, especially in cash equities, despite access to order book data, because many datasets used in their models are only available historically in subsampled time space.
This is trivially evident when you model order book activity at market open or FOMC. Large macro participants don't necessarily trade around these times in response to the first trade or book event after 9.30 AM or 2.30 PM Eastern, they usually trade on the dot at 9.30 AM or 2.30 PM.

There are variations of these methods that can be parameterized. For example:

Is the residual reset? If you are using 500 volume space and an aggressor comes in for 600, do you take a snapshot again when a trade arrives for 400?
Do you flush the event multiple times? If you are using 1 tick movement space, and a buy comes in to clear 3 levels of offers. Do you print the event three times back-to-back with the same timestamp?
What is the reference instrument? If you are constructing a cross-sectional design matrix with two instruments A and B, do you print the events when there's changes in A, B or A+B?
Do you retain any memory? For example, trades in tick space of a FIFO instrument tend to be highly autocorrelated in their signs. On the other hand, trades in tick space of a pro rata instrument display more autocorrelation because participants may scratch out their overfill. So perhaps you could up/downregulate the number of events to print on trades with repeating signs.
How much do you subsample? I think people often use the term subsampling when they actually mean space. For example, instead of printing on every 1 trade, you can print on every 10 trades.

The principled way of determining optimal sampling frequency is a classical bias-variance tradeoff, and examples can be found in signal processing and denoising literature. The practical way of handling this is just to use a few that you have strong priors around (i.e. you've used it at a previous trading firm) and just cross-validate your model on out-of-sample data.

Sergei Rodionov · Accepted Answer · 2021-03-23 18:06:23Z

This depends on the use case, but there are many options including:

Convert full order log to top-of-book quotes and trades. This will probably eliminate 90% of ticks in the file.
Convert full order log to trades. Even more compression.
Take snapshots when you're present in the market, i.e. to measure and optimize your own trade execution quality
Take snapshots when the order book is in a particular state such large visible orders, probable icebergs, etc.
Filter out non-relevant intervals, symbols.

I would avoid time- and volume-based downsampling, as this get you close to OHLCV bars which defeats the purpose and the beauty of the full order log.

Stack Exchange Network

What are the proper ways to do order book downsampling?

2 Answers 2

Hot Network Questions

What are the proper ways to do order book downsampling?

2 Answers 2

Related

Hot Network Questions