How to remove specific records based on column pattern

Question

I have a table like this:

event	value	time
seed	57	2021-08-01 09:49:23
ghy	869	2021-08-02 09:50:12
repo	5324	2021-09-03 10:49:23
repo	null	2021-09-03 11:49:23
harv	12	2021-09-05 09:43:23
weig	5,37,12	2021-09-06 09:25:12
repo	null,null,4,8	2021-09-07 09:12:23
repo	4,8,null,null	2021-09-07 10:49:23
repo	null,null,4,8	2021-09-08 17:49:23
repo	4,8,1,3	2021-09-09 12:12:23
repo	1356	2021-09-10 12:49:23

Sometimes the value column has the following pattern: null, null, x, y, where x and y are any natural numbers.

Do you know how to delete all pairs of records from the diagram: x, y, null, null and then again null, null, x, y immediately after the first occurrence of such a pattern?

I mean the expected output should be:

event	value	time
seed	57	2021-08-01 09:49:23
ghy	869	2021-08-02 09:50:12
repo	5324	2021-09-03 10:49:23
repo	null	2021-09-03 11:49:23
harv	12	2021-09-05 09:43:23
weig	5,37,12	2021-09-06 09:25:12
repo	null,null,4,8	2021-09-07 09:12:23
repo	4,8,1,3	2021-09-09 12:12:23
repo	1356	2021-09-10 12:49:23

When according to one of the answers, I use it:

import numpy as np df['value'] = df['value'].apply(lambda x : ','.join(np.sort(x.split(',')))) df.drop_duplicates(['value'], keep='first')

I get:

event	value	time
seed	57	2021-08-01 09:49:23
ghy	869	2021-08-02 09:50:12
repo	5324	2021-09-03 10:49:23
repo	null	2021-09-03 11:49:23
harv	12	2021-09-05 09:43:23
weig	12,37,5	2021-09-06 09:25:12
repo	4,8,null,null	2021-09-07 09:12:23
repo	4,8,1,3	2021-09-09 12:12:23
repo	1356	2021-09-10 12:49:23

Some of the values in the 'value' column change their positions (see bold).

Do you have an idea how to fix it?

it is removed, because same pattern null,null,4,8, 4,8,null,null and null,null,4,8 has same numbers, here 4,8 ? If null,null,4,8, 1,0,null,null and 4,8,null,null is removed only last 4,8,null,null ? — jezrael
– jezrael, Commented Oct 4, 2021 at 10:37
@jezrael yes, exactly like that, they should only be removed if they have the same x, y numbers — sdom
– sdom, Commented Oct 4, 2021 at 10:41

Mahdi F. · Accepted Answer · 2021-10-04 11:47:24Z

Because element in value column is string. you can .split() them and sort them with np.sort then back them to string and use drop_duplicates() like below.

Try this:

import numpy as np df['value2'] = df['value'].apply(lambda x : ','.join(np.sort(x.split(',')))) df.drop_duplicates(['value2'], keep='first')

Collectives™ on Stack Overflow

How to remove specific records based on column pattern

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related