Have a look at HdrHistogram:
https://hdrhistogram.github.io/HdrHistogram/.
There are implementations for all kinds of languages.
What it effectively is, is a history of latency distributions. So you could have a latency distribution per second and if you run a benchmark for 60 seconds, you have a 60 latency distribution of 1 second. With HdrHistogram you can calculate the percentiles and other statistics per time window. And then you can add logic to detect if there are anomalies in whatever kind of statistic. E.g. if you easily detect if there are 10 consecutive windows with a too high p99. You could also aggregate the histograms and create latency distributions per minute/hour/day/week etc. So you don't need to deal with large quantities of tiny histograms.
The nice thing is that you make a final latency distribution and determine e.g. your percentiles; or remove e.g. warmup and cooldown. But you can also zoom into a particular region, e.g. there is a compaction causing problem at a particular moment, then you can zoom into exactly that section.
There are some other nice properties: because you have the full latency distribution, you can merge the latency distributions of multiple load generators. The common approach I see is that engineers average the percentiles of the 2 load generators, but that is mathematically incorrect.
If you are doing a latency test, make sure you also deal correctly with coordinated omission. If you don't deal with it correctly, the worst latencies in your benchmark are omitted and you will falsely assume your system is behaving better than it actually is. You can find some presentations of Gil Tene (author of HdrHistogram) on YouTube on this topic.