Two chief reasons for subsampling or using a different event space are (i) computational or spatial tractability and (ii) denoising/signal extraction.
Sergei's response seems to focus on the first issue, and I'll focus more on the latter. The two objectives can diverge. For example, options and OTC data can exhibit trade to order ratios in excess of 1:10,000, so taking trade space will be highly efficient in reducing storage requirements, but inappropriate for modeling.
Typical methods used by trading practitioners include:
- Trade (tick) space. Take a snapshot of the book when a trade occurs.
- Top of book change space. Take a snapshot of the book when a change to the top of the book occurs.
- Uptick/downtick space. Take a snapshot of the book when the price upticks or downticks.
- % ADV buckets. Predict the volume and then snap the book at a uniform interval in % daily volume.
- Level formation or depletion space. Take a snapshot of the book every time a new price level forms inside the market or a level is depleted.
- Time space. e.g. Take a snapshot of the book every second, minute etc.
On the contrary, there is merit in time and volume-based spaces - which is that large participants often use them, so events in time and volume space become informative:
- Execution brokers and large quant funds often use % ADV buckets for multi-horizon optimization. This is probably in part due to computational tractability, since even the fastest execution optimizers can take several seconds to compute the entire trajectory of weights. But nevertheless it creates a self-fulfilling effect where % ADV becomes important for everyone else.)
- There are also large teams at top market making firms that do use minute frequency data, especially in cash equities, despite access to order book data, because many datasets used in their models are only available historically in subsampled time space.
- This is trivially evident when you model order book activity at market open or FOMC. Large macro participants don't necessarily trade around these times in response to the first trade or book event after 9.30 AM or 2.30 PM Eastern, they usually trade on the dot at 9.30 AM or 2.30 PM.
There are variations of these methods that can be parameterized. For example:
- Is the residual reset? If you are using 500 volume space and an aggressor comes in for 600, do you take a snapshot again when a trade arrives for 400?
- Do you flush the event multiple times? If you are using 1 tick movement space, and a buy comes in to clear 3 levels of offers. Do you print the event three times back-to-back with the same timestamp?
- What is the reference instrument? If you are constructing a cross-sectional design matrix with two instruments A and B, do you print the events when there's changes in A, B or A+B?
- Do you retain any memory? For example, trades in tick space of a FIFO instrument tend to be highly autocorrelated in their signs. On the other hand, trades in tick space of a pro rata instrument display more autocorrelation because participants may scratch out their overfill. So perhaps you could up/downregulate the number of events to print on trades with repeating signs.
- How much do you subsample? I think people often use the term subsampling when they actually mean space. For example, instead of printing on every 1 trade, you can print on every 10 trades.
The principled way of determining optimal sampling frequency is a classical bias-variance tradeoff, and examples can be found in signal processing and denoising literature. The practical way of handling this is just to use a few that you have strong priors around (i.e. you've used it at a previous trading firm) and just cross-validate your model on out-of-sample data.