1

I Have a dataframe which has some unique IDs in two of the columns.for e.g

S.no. Column1 Column2 1 00001x 00002x 2 00003j 00005k 3 00002x 00001x 4 00004d 00008e 

Value can be anything in the string format I want to compare the two column in such a way that either of s.no 1 or 3 data remains. as these id contains the same information. only its order is different.

Basically if for one row value in a column 1 is X and column 2 is Y and for other row value in column 1 is Y and in Column 2 is x then only one of the row should remain.

is that possible in python?

1
  • Since you refer to the columns as containing unique IDs, you might want to consider using Pandas MultiIndex. You could then use the sorted tuples from @mozway's answer to index your data. Commented Sep 15, 2021 at 9:05

1 Answer 1

5

You can convert your columns as frozenset per row.

This will give a common order to apply duplicated.

Finally, slice the rows using the previous output as mask:

mask = df.filter(like='Column').apply(frozenset, axis=1).duplicated() df[~mask] 

previous answer using set:

mask = df.filter(like='Column').apply(lambda x: tuple(set(x)), axis=1).duplicated() df[~mask] 

NB. Using a set or sorted requires to convert as tuple (lambda x: tuple(sorted(x))) as the duplicated function hashes the values, which is not possible with mutable objects

output:

 S.no. Column1 Column2 0 1 00001x 00002x 1 2 00003j 00005k 3 4 00004d 00008e 
Sign up to request clarification or add additional context in comments.

1 Comment

You can use apply(frozenset, axis=1) to shorten it ;)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.