Find pandas Groupby with NaNs in all columns

Question

Say we have a DataFrame set up as follows:

df = pd.DataFrame() df['ID'] = [432, 601, 601, 383, 887, 887, 944, 68, 195, 724, 408, 351] df['Details'] = [362, 85, 338, 332, 712, 932, 797, 365, 837, 66, 721, 695] df['Tests'] = [769, np.nan, np.nan, np.nan, 988, 496, 7, 408, np.nan, 417, 287, 723] df['Size'] = [877, np.nan, np.nan, np.nan, 550, 967, 646, 654, 76, 185, np.nan, 635] df['GroupID']=0 unique_ids = df.drop_duplicates(['ID']).index df.loc[unique_ids, 'GroupID'] = 1 df['GroupID'] = df['GroupID'].cumsum()

resultant df:

 ID Details Tests Size GroupID 0 432 362 769.0 877.0 1 1 601 85 NaN NaN 2 2 601 338 NaN NaN 2 3 383 332 NaN NaN 3 4 887 712 988.0 550.0 4 5 887 932 496.0 967.0 4 6 944 797 7.0 646.0 5 7 68 365 408.0 654.0 6 8 195 837 NaN 76.0 7 9 724 66 417.0 185.0 8 10 408 721 287.0 NaN 9 11 351 695 723.0 635.0 10

How can I find where ['Tests', 'Size'] are NaN for all members of that group (i.e. have the same GroupID). For this example the answer should be GroupID = (2,3), or ID = 601, 383.

My data is mainly of dtype object - so mostly strings (so Tests and Size would be strings).

I am thinking of something along the lines of: all_null = dflow[['Tests','Size']].isnull().all(axis=1) dflow['all_null'] = all_null — A H
– A H, Commented Jan 19, 2018 at 16:31

Scott Boston · Accepted Answer · 2018-01-19 16:39:25Z

Another way:

df_out = df[df.groupby('GroupID')[['Tests','Size']].transform('count').sum(1).eq(0)]

And the same logic as below to get GroupID or ID

Note: count does not count NaN values so we check the count equal to zero and sum to see if all in that group are NaN.

On way is to use:

df_out = df.groupby('GroupID').filter(lambda x: x[['Tests','Size']].isnull().all().all()) ID Details Tests Size GroupID 1 601 85 NaN NaN 2 2 601 338 NaN NaN 2 3 383 332 NaN NaN 3

Then,

df_out.ID.unique().tolist()

Output:

[601, 383]

OR

df_out.GroupID.unique().tolist()

Output:

[2, 3]

That is a good way to do it, but the filter method is really slow for my dataset
Much better, 9.54 ms ± 393 µs per loop (mean ± std. dev. of 7 runs, 100 loops each). Thanks!

BENY · Accepted Answer · 2018-01-19 17:59:09Z

You can check the , dropna ,and using thresh here, it will give back how many non-nan value required

df.GroupID[~df.GroupID.isin(df.dropna(thresh=df.shape[1]-1).GroupID)].unique() Out[204]: array([2, 3], dtype=int64)

Collectives™ on Stack Overflow

Find pandas Groupby with NaNs in all columns

2 Answers 2

3 Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Related