Clickhouse OPTIMIZE performance for deduplication

Question

I want to try and understand the performance of the OPTIMIZE query in Clickhouse.

I am planning on using it to remove duplicates right after a bulk insert from a MergeTree, hence I have the options of:

OPTIMIZE TABLE db.table DEDUPLICATE

or

OPTIMIZE TABLE db.table FINAL DEDUPLICATE

I understand that the first state only deduplicates the insert if it hasn't already merged, whereas the second will do it to the whole table. However I am concerned about performance; from dirty analysis of OPTIMIZE TABLE db.table FINAL DEDUPLICATE on different size tables I can see it going to get exponentially worse as the table gets bigger (0.1s for 0.1M rows, 1s for 0.3M rows, 12s for 10M rows). I am assuming OPTIMIZE TABLE db.table DEDUPLICATE is based however on the insert size and table size, so should be more performative?

Can anyone point to some literature on these performances?

In addition, do these problems go away if I replace the table with a ReplacingMergeTree? I imagine the same process will happen under the hood, so doesn't matter either way.

vladimir · Accepted Answer · 2021-08-23 04:40:23Z

Are you sure that:

ingestion pipeline couldn't be changed to avoid/reduce duplicates?
duplicates are critical? Do they impact the metric calculation or consume much more disk storage?

Calling of

OPTIMIZE TABLE db.table FINAL DEDUPLICATE

on regular basis is definitely a bad way (it optimizes the whole table) - consider restricting the scope of impacted rows (see PARTITION param) or columns (see COLUMNS param).

I would consider using [only] ReplacingMergeTree-engine that was designed to dedupe rows during 'native' merging (not manual as for case with OPTIMIZE).

See the additional info:

Stack Exchange Network

Clickhouse OPTIMIZE performance for deduplication

1 Answer 1

Hot Network Questions

Clickhouse OPTIMIZE performance for deduplication

1 Answer 1

Related

Hot Network Questions