Say I have a Dataframe called Data with shape (71067, 4):
StartTime EndDateTime TradeDate Values 0 2018-12-31 23:00:00 2018-12-31 23:30:00 2019-01-01 -44.676 1 2018-12-31 23:30:00 2019-01-01 00:00:00 2019-01-01 -36.113 2 2019-01-01 00:00:00 2019-01-01 00:30:00 2019-01-01 -19.229 3 2019-01-01 00:30:00 2019-01-01 01:00:00 2019-01-01 -23.606 4 2019-01-01 01:00:00 2019-01-01 01:30:00 2019-01-01 -25.899 ... ... ... ... ... 2023-01-30 20:30:00 2023-01-30 21:00:00 2023-01-30 -27.198 2023-01-30 21:00:00 2023-01-30 21:30:00 2023-01-30 -13.221 2023-01-30 21:30:00 2023-01-30 22:00:00 2023-01-30 -12.034 2023-01-30 22:00:00 2023-01-30 22:30:00 2023-01-30 -16.464 2023-01-30 22:30:00 2023-01-30 23:00:00 2023-01-30 -25.441 71067 rows × 4 columns When running Data.isna().sum().sum() I realise I have some NaN values in the dataset:
Data.isna().sum().sum() > 1391 Shown here:
Data[Data['Values'].isna()].reset_index(drop = True).sort_values(by = 'StartTime') 0 2019-01-01 03:30:00 2019-01-01 04:00:00 2019-01-01 NaN 1 2019-01-04 02:30:00 2019-01-04 03:00:00 2019-01-04 NaN 2 2019-01-04 03:00:00 2019-01-04 03:30:00 2019-01-04 NaN 3 2019-01-04 03:30:00 2019-01-04 04:00:00 2019-01-04 NaN 4 2019-01-04 04:00:00 2019-01-04 04:30:00 2019-01-04 NaN ... ... ... ... ... 1386 2022-12-06 13:00:00 2022-12-06 13:30:00 2022-12-06 NaN 1387 2022-12-06 13:30:00 2022-12-06 14:00:00 2022-12-06 NaN 1388 2022-12-22 11:00:00 2022-12-22 11:30:00 2022-12-22 NaN 1389 2023-01-25 11:00:00 2023-01-25 11:30:00 2023-01-25 NaN 1390 2023-01-25 11:30:00 2023-01-25 12:00:00 2023-01-25 NaN Is there anyway of replacing each of the NaN values in the dataset with the mean value of the corresponding half hour across the 70,000 plus rows, see below:
Data['HH'] = pd.to_datetime(Data['StartTime']).dt.time Data.groupby(['HH'], as_index=False)[['Data']].mean().head(10) # Only showing first 10 means HH Values 0 00:00:00 5.236811 1 00:30:00 2.056571 2 01:00:00 4.157455 3 01:30:00 2.339253 4 02:00:00 2.658238 5 02:30:00 0.230557 6 03:00:00 0.217599 7 03:30:00 -0.630243 8 04:00:00 -0.989919 9 04:30:00 -0.494372 For example, if a value is missing against 04:00, can it be replaced with the 04:00 mean value (0.989919) as per the above table of means?
Any help greatly appreciated.