I'm looking for a sanity check my thinking around the memory layout of a key-value store.
We have a system that periodically polls and monitors on host-level jobs. On every poll, a given job emit can multiple timeseries. A timeseries is {timestamp->double} mapping. Each timeseries is identified by its timeseries id (e.g. "cpu", "memory"). It can also be user-defined. These timeseries ids are not globally unique though - what's globally unique is the job_id + timeseries_id. I will call this job_ts
The cardinality of jobs id ~5m and cardinality of timeseries ids is ~1B
We want to store the datapoints of every job timeseries into a key value store. I figured the key could be: hash(job_ts) + fixed_width_unix_timstamp. The value is the datapoint value (e.g. cpu of the job at time T).
This way I easily write range queries (i.e. give me all datapoints for a given job_ts from [T, T+100]
The hash function I figured would work is xxhash128 which would cost ~32bytes, per entry. This way I easily write range queries (i.e. give me all datapoints from [T, T+24hours] for job=x and timeseries_id=cpu)
Am I on the right track? I want to do is optimize for our memory footprint on the key value store - what are some techniques/tradeoffs I should consider?