Say we have a DataFrame set up as follows:
df = pd.DataFrame() df['ID'] = [432, 601, 601, 383, 887, 887, 944, 68, 195, 724, 408, 351] df['Details'] = [362, 85, 338, 332, 712, 932, 797, 365, 837, 66, 721, 695] df['Tests'] = [769, np.nan, np.nan, np.nan, 988, 496, 7, 408, np.nan, 417, 287, 723] df['Size'] = [877, np.nan, np.nan, np.nan, 550, 967, 646, 654, 76, 185, np.nan, 635] df['GroupID']=0 unique_ids = df.drop_duplicates(['ID']).index df.loc[unique_ids, 'GroupID'] = 1 df['GroupID'] = df['GroupID'].cumsum() resultant df:
ID Details Tests Size GroupID 0 432 362 769.0 877.0 1 1 601 85 NaN NaN 2 2 601 338 NaN NaN 2 3 383 332 NaN NaN 3 4 887 712 988.0 550.0 4 5 887 932 496.0 967.0 4 6 944 797 7.0 646.0 5 7 68 365 408.0 654.0 6 8 195 837 NaN 76.0 7 9 724 66 417.0 185.0 8 10 408 721 287.0 NaN 9 11 351 695 723.0 635.0 10 How can I find where ['Tests', 'Size'] are NaN for all members of that group (i.e. have the same GroupID). For this example the answer should be GroupID = (2,3), or ID = 601, 383.
My data is mainly of dtype object - so mostly strings (so Tests and Size would be strings).
all_null = dflow[['Tests','Size']].isnull().all(axis=1)dflow['all_null'] = all_null