181

Consider a CSV file:

string,date,number a string,2/5/11 9:16am,1.0 a string,3/5/11 10:44pm,2.0 a string,4/22/11 12:07pm,3.0 a string,4/22/11 12:10pm,4.0 a string,4/29/11 11:59am,1.0 a string,5/2/11 1:41pm,2.0 a string,5/2/11 2:02pm,3.0 a string,5/2/11 2:56pm,4.0 a string,5/2/11 3:00pm,5.0 a string,5/2/14 3:02pm,6.0 a string,5/2/14 3:18pm,7.0 

I can read this in, and reformat the date column into datetime format:

b = pd.read_csv('b.dat') b['date'] = pd.to_datetime(b['date'],format='%m/%d/%y %I:%M%p') 

I have been trying to group the data by month. It seems like there should be an obvious way of accessing the month and grouping by that. But I can't seem to do it. Does anyone know how?

What I am currently trying is re-indexing by the date:

b.index = b['date'] 

I can access the month like so:

b.index.month 

However I can't seem to find a function to lump together by month.

1
  • 6
    If you are struggling with applying any of the answer, please keep in mind that in this question (and therefore in the answers) the Datetime value is assigned to the index of the Dataframe. A quick tip/reminder could be the following: if you have a Datetime column, you can actually access the single Yeay/Month/Day/Hour/Minute value just by doing my_df.my_column.dt.month Commented Nov 30, 2020 at 13:03

5 Answers 5

276

Managed to do it:

b = pd.read_csv('b.dat') b.index = pd.to_datetime(b['date'],format='%m/%d/%y %I:%M%p') b.groupby(by=[b.index.month, b.index.year]) 

Or

b.groupby(pd.Grouper(freq='M')) # update for v0.21+ 
Sign up to request clarification or add additional context in comments.

6 Comments

I think the more pandonic ways are to either use resample (when it provides the functionality you need) or use a TimeGrouper: df.groupby(pd.TimeGrouper(freq='M'))
to get the result DataFrame sum or average, df.groupby(pd.TimeGrouper(freq='M')).sum() or df.groupby(pd.TimeGrouper(freq='M')).mean()
pd.TimeGrouper has been deprecated in favor of pd.Grouper, which is a bit more flexible but still takes freq and level arguments.
@ely The answer implicitly relies on the lines in the original question where b is given an index after being read from CSV. Add b.index = pd.to_datetime(b['date'],format='%m/%d/%y %I:%M%p') after the line b = pd.read_csv('b.dat'). [I've edited the answer just now too.]
To group the months in chronological order, you need to swap the month and year index. The resulting command for the grouping being b.groupby(by=[b.index.year, b.index.month]).
|
126

(update: 2018)

Note that pd.Timegrouper is depreciated and will be removed. Use instead:

 df.groupby(pd.Grouper(freq='M')) 

4 Comments

Find the Grouper docs here and the frequency specifications (freq=...) here. Some examples are freq=D for days, freq=B for business days, freq=W for weeks or even freq=Q for quarters.
I found it useful to use 'key' to avoid having to reindex the df, as follows: df.groupby(pd.Grouper(key='your_date_column', freq='M'))
Does this work if you're grouping by two columns, only one of which is datetime value column?
speeding up the further research for those who want to group by specific column (or more): df.groupby(['col1', pd.Grouper(key='date_col', freq='1M')]).agg({ 'col2': 'sum', 'col3': 'max' })
33

To groupby time-series data you can use the method resample. For example, to groupby by month:

df.resample(rule='M', on='date')['Values'].sum() 

The list with offset aliases you can find here.

1 Comment

Thanks! This should be the accepted answer: groupby by time is called resample.
19

One solution which avoids MultiIndex is to create a new datetime column setting day = 1. Then group by this column.

Normalise day of month

df = pd.DataFrame({'Date': pd.to_datetime(['2017-10-05', '2017-10-20', '2017-10-01', '2017-09-01']), 'Values': [5, 10, 15, 20]}) # normalize day to beginning of month, 4 alternative methods below df['YearMonth'] = df['Date'] + pd.offsets.MonthEnd(-1) + pd.offsets.Day(1) df['YearMonth'] = df['Date'] - pd.to_timedelta(df['Date'].dt.day-1, unit='D') df['YearMonth'] = df['Date'].map(lambda dt: dt.replace(day=1)) df['YearMonth'] = df['Date'].dt.normalize().map(pd.tseries.offsets.MonthBegin().rollback) 

Then use groupby as normal:

g = df.groupby('YearMonth') res = g['Values'].sum() # YearMonth # 2017-09-01 20 # 2017-10-01 30 # Name: Values, dtype: int64 

Comparison with pd.Grouper

The subtle benefit of this solution is, unlike pd.Grouper, the grouper index is normalized to the beginning of each month rather than the end, and therefore you can easily extract groups via get_group:

some_group = g.get_group('2017-10-01') 

Calculating the last day of October is slightly more cumbersome. pd.Grouper, as of v0.23, does support a convention parameter, but this is only applicable for a PeriodIndex grouper.

Comparison with string conversion

An alternative to the above idea is to convert to a string, e.g. convert datetime 2017-10-XX to string '2017-10'. However, this is not recommended since you lose all the efficiency benefits of a datetime series (stored internally as numerical data in a contiguous memory block) versus an object series of strings (stored as an array of pointers).

5 Comments

See this answer for the proper way to utilize offsets when there are already day=1 values : stackoverflow.com/a/45831333/9987623.
@AlexK, does pd.tseries.offsets have an advantage over pd.tseries.MonthBegin ?
sorry, I don't know enough to tell those apart. I just added the comment because your df['YearMonth'] = df['Date'] - pd.offsets.MonthBegin(1) code above changes any date that is already the first of the month to the first of the previous month.
@AlexK, Good spot, have updated answer accordingly.
For beginning of month with Grouper, see also stackoverflow.com/a/56280791/10495893
11

Slightly alternative solution to @jpp's but outputting a YearMonth string:

df['YearMonth'] = pd.to_datetime(df['Date']).apply(lambda x: '{year}-{month}'.format(year=x.year, month=x.month)) res = df.groupby('YearMonth')['Values'].sum() 

2 Comments

Helpful in the case where you don't want to set your date column as an index
Can also generate strings using dt.strfrtime: df['Date'].dt.strftime('%Y %b.')

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.