1

I have a table like this:

event value time
seed 57 2021-08-01 09:49:23
ghy 869 2021-08-02 09:50:12
repo 5324 2021-09-03 10:49:23
repo null 2021-09-03 11:49:23
harv 12 2021-09-05 09:43:23
weig 5,37,12 2021-09-06 09:25:12
repo null,null,4,8 2021-09-07 09:12:23
repo 4,8,null,null 2021-09-07 10:49:23
repo null,null,4,8 2021-09-08 17:49:23
repo 4,8,1,3 2021-09-09 12:12:23
repo 1356 2021-09-10 12:49:23

Sometimes the value column has the following pattern: null, null, x, y, where x and y are any natural numbers.

Do you know how to delete all pairs of records from the diagram: x, y, null, null and then again null, null, x, y immediately after the first occurrence of such a pattern?

I mean the expected output should be:

event value time
seed 57 2021-08-01 09:49:23
ghy 869 2021-08-02 09:50:12
repo 5324 2021-09-03 10:49:23
repo null 2021-09-03 11:49:23
harv 12 2021-09-05 09:43:23
weig 5,37,12 2021-09-06 09:25:12
repo null,null,4,8 2021-09-07 09:12:23
repo 4,8,1,3 2021-09-09 12:12:23
repo 1356 2021-09-10 12:49:23

When according to one of the answers, I use it:

import numpy as np df['value'] = df['value'].apply(lambda x : ','.join(np.sort(x.split(',')))) df.drop_duplicates(['value'], keep='first') 

I get:

event value time
seed 57 2021-08-01 09:49:23
ghy 869 2021-08-02 09:50:12
repo 5324 2021-09-03 10:49:23
repo null 2021-09-03 11:49:23
harv 12 2021-09-05 09:43:23
weig 12,37,5 2021-09-06 09:25:12
repo 4,8,null,null 2021-09-07 09:12:23
repo 4,8,1,3 2021-09-09 12:12:23
repo 1356 2021-09-10 12:49:23

Some of the values ​​in the 'value' column change their positions (see bold).

Do you have an idea how to fix it?

4
  • it is removed, because same pattern null,null,4,8, 4,8,null,null and null,null,4,8 has same numbers, here 4,8 ? If null,null,4,8, 1,0,null,null and 4,8,null,null is removed only last 4,8,null,null ? Commented Oct 4, 2021 at 10:37
  • @jezrael yes, exactly like that, they should only be removed if they have the same x, y numbers Commented Oct 4, 2021 at 10:41
  • value in value column is list or string? Commented Oct 4, 2021 at 10:42
  • value is a string Commented Oct 4, 2021 at 10:46

1 Answer 1

2

Because element in value column is string. you can .split() them and sort them with np.sort then back them to string and use drop_duplicates() like below.

Try this:

import numpy as np df['value2'] = df['value'].apply(lambda x : ','.join(np.sort(x.split(',')))) df.drop_duplicates(['value2'], keep='first') 
Sign up to request clarification or add additional context in comments.

1 Comment

Ah! It was a good idea, now it works fine :) Thanks a lot

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.