14

I have created table and trying to insert the values multiple time to check the duplicates. I can see duplicates are inserting. Is there a way to avoid duplicates in clickhouse table?

CREATE TABLE sample.tmp_api_logs ( id UInt32, EventDate Date) ENGINE = MergeTree(EventDate, id, (EventDate,id), 8192); insert into sample.tmp_api_logs values(1,'2018-11-23'),(2,'2018-11-23'); insert into sample.tmp_api_logs values(1,'2018-11-23'),(2,'2018-11-23'); select * from sample.tmp_api_logs; /* ┌─id─┬──EventDate─┐ │ 1 │ 2018-11-23 │ │ 2 │ 2018-11-23 │ └────┴────────────┘ ┌─id─┬──EventDate─┐ │ 1 │ 2018-11-23 │ │ 2 │ 2018-11-23 │ └────┴────────────┘ */ 
2
  • I just repeat that guys wrote in their answers: the deduplication is provided any Replicated{_/Summing/..}MergeTree-engine when inserting the same data block as before. It was extended the output format of system.table_engines by adding the extra columns including supports_deduplication - github.com/ClickHouse/ClickHouse/pull/8830 - it helps to survey all engine and their key-abilities. Commented Feb 6, 2020 at 6:30
  • FYI: there is PR (github.com/ClickHouse/ClickHouse/pull/8467) to support deduplication on MergeTree-table. Hope soon it be available. Commented Feb 11, 2020 at 0:04

2 Answers 2

11

Most likely ReplacingMergeTree is what you need as long as duplicate records duplicate primary keys. You can also try out other MergeTree engines for more actions when replicate record is encountered. FINAL keyword can be used when doing queries to ensure uniquity.

Sign up to request clarification or add additional context in comments.

1 Comment

From the doc, "The engine differs from MergeTree in that it removes duplicate entries with the same sorting key value (ORDER BY table section, not PRIMARY KEY)."
7

If raw data does not contain duplicates and they might appear only during retries of INSERT INTO, there's a deduplication feature in ReplicatedMergeTree. To make it work you should retry inserts of exactly the same batches of data (same set of rows in same order). You can use different replica for these retries and data block will still be inserted only once as block hashes are shared between replicas via ZooKeeper.

Otherwise, you should deduplicate data externally before inserts to ClickHouse or clean up duplicates asynchronously with ReplacingMergeTree or ReplicatedReplacingMergeTree.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.