use groupby() and for loop to count column values with conditions

Question

The logic of what I am trying to do I think is best explained with code:

import pandas as pd import numpy as np from datetime import timedelta random.seed(365) #some data start_date = pd.date_range(start = "2015-01-09", end = "2022-09-11", freq = "6D") end_date = [start_date + timedelta(days = np.random.exponential(scale = 100)) for start_date in start_date] df = pd.DataFrame( {"start_date":start_date, "end_date":end_date} ) #randomly remove some end dates df["end_date"] = df["end_date"].sample(frac = 0.7).reset_index(drop = True) df["end_date"] = df["end_date"].dt.date.astype("datetime64[ns]")

I first create a pd.Series with the 1st day of every month in the entire history of the data:

dates = pd.Series(df["start_date"].dt.to_period("M").sort_values(ascending = True).unique()).dt.start_time

What I then want to do is count the number of df["start_date"] values which are less than the 1st day of each month in the series and where the df["end_date"] values are null (recorded as NaT)

I would think I would use a for loop to do this and somehow groupby the dates series so that the resulting output looks something like this:

month_start	count
2015-01-01	5
2015-02-01	10
2015-03-01	35

The count column in the resulting output is a count of the number of df rows where the df["start_date"] values are less than the 1st of each month in the series and where the df["end_date"] values are null - this occurs for every value in the series

Here is the logic of what I am trying to do:

df.groupby(by = dates)[["start_date", "end_date"]].apply( lambda x: [x["start_date"] < date for date in dates] & x["end_date"].isnull == True )

Galo do Leste · Accepted Answer · 2023-02-04 01:47:18Z

1

Is this what you want:

df2 = df[df['end_date'].isnull()] dates_count = dates.apply(lambda x: df2[df2['start_date'] < x]['start_date'].count()) print(pd.concat([dates, dates_count], axis=1))

answered Feb 4, 2023 at 1:47

Galo do Leste

7265 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

JoMcGee Over a year ago

Yes this is exactly it thank you! Follow up question..How would I obtain the counts where start_date is less than the first of the month, and the end_date is greater than the first of the month?

mozway · Accepted Answer · 2023-02-04 01:47:40Z

IIUC, group by period (shifted by 1 month) and count the NaT, then cumsum to accumulate the counts:

(df['end_date'].isna() .groupby(df['start_date'].dt.to_period('M').add(1).dt.start_time) .sum() .cumsum() )

Output:

start_date 2015-02-01 0 2015-03-01 0 2015-04-01 0 2015-05-01 0 2015-06-01 0 ... 2022-06-01 122 2022-07-01 127 2022-08-01 133 2022-09-01 138 2022-10-01 140 Name: end_date, Length: 93, dtype: int64

Thank you for your response. What if I wanted to get the counts of rows where start_date is less than the first of the month, and the end_date is greater than the first of the month?

Collectives™ on Stack Overflow

use groupby() and for loop to count column values with conditions

2 Answers 2

1 Comment

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Linked

Related