I have a DolphinDB table with an array vector column. I need to remove duplicate rows based on subset relationships within that column.
Sample Input:
| sym | prices |
|---|---|
| a | [3,4,5,6] |
| a | [3,4,5] |
| a | [2,4,5,6] |
| a | [5,6] |
| a | [7,9] |
| a | [7,9] |
Expected Output:
| sym | prices |
|---|---|
| a | [3,4,5,6] |
| a | [2,4,5,6] |
| a | [7,9] |
Deduplication Logic:
Subset Removal: If a row's
pricesarray is a subset (i.e., fully contained) of another row'spricesarray, remove the subset row. In the example,[3,4,5]is a subset of[3,4,5,6], so it is removed; similarly,[5,6]is also a subset of[3,4,5,6]and is removed.Full Duplicate Removal: If multiple rows have identical prices arrays, keep only one.
What I've Tried:
I considered using group by to remove exact duplicates, but this approach cannot handle subset relationships.
Core Question:
How can I perform this subset-based deduplication?