The logic of what I am trying to do I think is best explained with code:
import pandas as pd import numpy as np from datetime import timedelta random.seed(365) #some data start_date = pd.date_range(start = "2015-01-09", end = "2022-09-11", freq = "6D") end_date = [start_date + timedelta(days = np.random.exponential(scale = 100)) for start_date in start_date] df = pd.DataFrame( {"start_date":start_date, "end_date":end_date} ) #randomly remove some end dates df["end_date"] = df["end_date"].sample(frac = 0.7).reset_index(drop = True) df["end_date"] = df["end_date"].dt.date.astype("datetime64[ns]") I first create a pd.Series with the 1st day of every month in the entire history of the data:
dates = pd.Series(df["start_date"].dt.to_period("M").sort_values(ascending = True).unique()).dt.start_time What I then want to do is count the number of df["start_date"] values which are less than the 1st day of each month in the series and where the df["end_date"] values are null (recorded as NaT)
I would think I would use a for loop to do this and somehow groupby the dates series so that the resulting output looks something like this:
| month_start | count |
|---|---|
| 2015-01-01 | 5 |
| 2015-02-01 | 10 |
| 2015-03-01 | 35 |
The count column in the resulting output is a count of the number of df rows where the df["start_date"] values are less than the 1st of each month in the series and where the df["end_date"] values are null - this occurs for every value in the series
Here is the logic of what I am trying to do:
df.groupby(by = dates)[["start_date", "end_date"]].apply( lambda x: [x["start_date"] < date for date in dates] & x["end_date"].isnull == True )