Replacing NaNs with Mean Value using Pandas

Question

Say I have a Dataframe called Data with shape (71067, 4):

 StartTime EndDateTime TradeDate Values 0 2018-12-31 23:00:00 2018-12-31 23:30:00 2019-01-01 -44.676 1 2018-12-31 23:30:00 2019-01-01 00:00:00 2019-01-01 -36.113 2 2019-01-01 00:00:00 2019-01-01 00:30:00 2019-01-01 -19.229 3 2019-01-01 00:30:00 2019-01-01 01:00:00 2019-01-01 -23.606 4 2019-01-01 01:00:00 2019-01-01 01:30:00 2019-01-01 -25.899 ... ... ... ... ... 2023-01-30 20:30:00 2023-01-30 21:00:00 2023-01-30 -27.198 2023-01-30 21:00:00 2023-01-30 21:30:00 2023-01-30 -13.221 2023-01-30 21:30:00 2023-01-30 22:00:00 2023-01-30 -12.034 2023-01-30 22:00:00 2023-01-30 22:30:00 2023-01-30 -16.464 2023-01-30 22:30:00 2023-01-30 23:00:00 2023-01-30 -25.441 71067 rows × 4 columns

When running Data.isna().sum().sum() I realise I have some NaN values in the dataset:

Data.isna().sum().sum() > 1391

Shown here:

Data[Data['Values'].isna()].reset_index(drop = True).sort_values(by = 'StartTime') 0 2019-01-01 03:30:00 2019-01-01 04:00:00 2019-01-01 NaN 1 2019-01-04 02:30:00 2019-01-04 03:00:00 2019-01-04 NaN 2 2019-01-04 03:00:00 2019-01-04 03:30:00 2019-01-04 NaN 3 2019-01-04 03:30:00 2019-01-04 04:00:00 2019-01-04 NaN 4 2019-01-04 04:00:00 2019-01-04 04:30:00 2019-01-04 NaN ... ... ... ... ... 1386 2022-12-06 13:00:00 2022-12-06 13:30:00 2022-12-06 NaN 1387 2022-12-06 13:30:00 2022-12-06 14:00:00 2022-12-06 NaN 1388 2022-12-22 11:00:00 2022-12-22 11:30:00 2022-12-22 NaN 1389 2023-01-25 11:00:00 2023-01-25 11:30:00 2023-01-25 NaN 1390 2023-01-25 11:30:00 2023-01-25 12:00:00 2023-01-25 NaN

Is there anyway of replacing each of the NaN values in the dataset with the mean value of the corresponding half hour across the 70,000 plus rows, see below:

Data['HH'] = pd.to_datetime(Data['StartTime']).dt.time Data.groupby(['HH'], as_index=False)[['Data']].mean().head(10) # Only showing first 10 means HH Values 0 00:00:00 5.236811 1 00:30:00 2.056571 2 01:00:00 4.157455 3 01:30:00 2.339253 4 02:00:00 2.658238 5 02:30:00 0.230557 6 03:00:00 0.217599 7 03:30:00 -0.630243 8 04:00:00 -0.989919 9 04:30:00 -0.494372

For example, if a value is missing against 04:00, can it be replaced with the 04:00 mean value (0.989919) as per the above table of means?

Any help greatly appreciated.

Shubham Sharma · Accepted Answer · 2023-01-31 16:05:53Z

Let's group the dataframe by HH then transform the Values with mean to broadcast the mean values back to the original column shape then use fillna to fill the null values

avg = Data.groupby('HH')['Values'].transform('mean') Data['Values'] = Data['Values'].fillna(avg)

Collectives™ on Stack Overflow

Replacing NaNs with Mean Value using Pandas

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related