2

Say I have a Dataframe called Data with shape (71067, 4):

 StartTime EndDateTime TradeDate Values 0 2018-12-31 23:00:00 2018-12-31 23:30:00 2019-01-01 -44.676 1 2018-12-31 23:30:00 2019-01-01 00:00:00 2019-01-01 -36.113 2 2019-01-01 00:00:00 2019-01-01 00:30:00 2019-01-01 -19.229 3 2019-01-01 00:30:00 2019-01-01 01:00:00 2019-01-01 -23.606 4 2019-01-01 01:00:00 2019-01-01 01:30:00 2019-01-01 -25.899 ... ... ... ... ... 2023-01-30 20:30:00 2023-01-30 21:00:00 2023-01-30 -27.198 2023-01-30 21:00:00 2023-01-30 21:30:00 2023-01-30 -13.221 2023-01-30 21:30:00 2023-01-30 22:00:00 2023-01-30 -12.034 2023-01-30 22:00:00 2023-01-30 22:30:00 2023-01-30 -16.464 2023-01-30 22:30:00 2023-01-30 23:00:00 2023-01-30 -25.441 71067 rows × 4 columns 

When running Data.isna().sum().sum() I realise I have some NaN values in the dataset:

Data.isna().sum().sum() > 1391 

Shown here:

Data[Data['Values'].isna()].reset_index(drop = True).sort_values(by = 'StartTime') 0 2019-01-01 03:30:00 2019-01-01 04:00:00 2019-01-01 NaN 1 2019-01-04 02:30:00 2019-01-04 03:00:00 2019-01-04 NaN 2 2019-01-04 03:00:00 2019-01-04 03:30:00 2019-01-04 NaN 3 2019-01-04 03:30:00 2019-01-04 04:00:00 2019-01-04 NaN 4 2019-01-04 04:00:00 2019-01-04 04:30:00 2019-01-04 NaN ... ... ... ... ... 1386 2022-12-06 13:00:00 2022-12-06 13:30:00 2022-12-06 NaN 1387 2022-12-06 13:30:00 2022-12-06 14:00:00 2022-12-06 NaN 1388 2022-12-22 11:00:00 2022-12-22 11:30:00 2022-12-22 NaN 1389 2023-01-25 11:00:00 2023-01-25 11:30:00 2023-01-25 NaN 1390 2023-01-25 11:30:00 2023-01-25 12:00:00 2023-01-25 NaN 

Is there anyway of replacing each of the NaN values in the dataset with the mean value of the corresponding half hour across the 70,000 plus rows, see below:

Data['HH'] = pd.to_datetime(Data['StartTime']).dt.time Data.groupby(['HH'], as_index=False)[['Data']].mean().head(10) # Only showing first 10 means HH Values 0 00:00:00 5.236811 1 00:30:00 2.056571 2 01:00:00 4.157455 3 01:30:00 2.339253 4 02:00:00 2.658238 5 02:30:00 0.230557 6 03:00:00 0.217599 7 03:30:00 -0.630243 8 04:00:00 -0.989919 9 04:30:00 -0.494372 

For example, if a value is missing against 04:00, can it be replaced with the 04:00 mean value (0.989919) as per the above table of means?

Any help greatly appreciated.

0

1 Answer 1

1

Let's group the dataframe by HH then transform the Values with mean to broadcast the mean values back to the original column shape then use fillna to fill the null values

avg = Data.groupby('HH')['Values'].transform('mean') Data['Values'] = Data['Values'].fillna(avg) 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.