0

The logic of what I am trying to do I think is best explained with code:

import pandas as pd import numpy as np from datetime import timedelta random.seed(365) #some data start_date = pd.date_range(start = "2015-01-09", end = "2022-09-11", freq = "6D") end_date = [start_date + timedelta(days = np.random.exponential(scale = 100)) for start_date in start_date] df = pd.DataFrame( {"start_date":start_date, "end_date":end_date} ) #randomly remove some end dates df["end_date"] = df["end_date"].sample(frac = 0.7).reset_index(drop = True) df["end_date"] = df["end_date"].dt.date.astype("datetime64[ns]") 

I first create a pd.Series with the 1st day of every month in the entire history of the data:

dates = pd.Series(df["start_date"].dt.to_period("M").sort_values(ascending = True).unique()).dt.start_time 

What I then want to do is count the number of df["start_date"] values which are less than the 1st day of each month in the series and where the df["end_date"] values are null (recorded as NaT)

I would think I would use a for loop to do this and somehow groupby the dates series so that the resulting output looks something like this:

month_start count
2015-01-01 5
2015-02-01 10
2015-03-01 35

The count column in the resulting output is a count of the number of df rows where the df["start_date"] values are less than the 1st of each month in the series and where the df["end_date"] values are null - this occurs for every value in the series

Here is the logic of what I am trying to do:

df.groupby(by = dates)[["start_date", "end_date"]].apply( lambda x: [x["start_date"] < date for date in dates] & x["end_date"].isnull == True ) 

2 Answers 2

1

Is this what you want:

df2 = df[df['end_date'].isnull()] dates_count = dates.apply(lambda x: df2[df2['start_date'] < x]['start_date'].count()) print(pd.concat([dates, dates_count], axis=1)) 
Sign up to request clarification or add additional context in comments.

1 Comment

Yes this is exactly it thank you! Follow up question..How would I obtain the counts where start_date is less than the first of the month, and the end_date is greater than the first of the month?
1

IIUC, group by period (shifted by 1 month) and count the NaT, then cumsum to accumulate the counts:

(df['end_date'].isna() .groupby(df['start_date'].dt.to_period('M').add(1).dt.start_time) .sum() .cumsum() ) 

Output:

start_date 2015-02-01 0 2015-03-01 0 2015-04-01 0 2015-05-01 0 2015-06-01 0 ... 2022-06-01 122 2022-07-01 127 2022-08-01 133 2022-09-01 138 2022-10-01 140 Name: end_date, Length: 93, dtype: int64 

1 Comment

Thank you for your response. What if I wanted to get the counts of rows where start_date is less than the first of the month, and the end_date is greater than the first of the month?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.