Skip to content

BUG: timezone lost in groupby-agg with cython functions #15426

@munierSalem

Description

@munierSalem

xref #10668 (for more examples)

Hello!

I'm running into some odd behavior trying to group rows of a pandas dataframe by ID and then selecting out max/min datetimes (w/ timezones). This is with python 2.7, pandas 0.18.1 and numpy 1.11.1 (I saw in earlier posts a similar problem was apparently fixed w/ pandas 0.15).

Specifically, if I try:
print orders.groupby('OrderID')['start_time'].agg(np.min).iloc[:5]

I get:

OrderID O161101XVS100000044 2016-11-01 12:03:12.920000-04:00 O161101XVS100000047 2016-11-01 12:03:36.693000-04:00 O161101XVS100000098 2016-11-01 12:09:08.330000-04:00 O161101XVS100000122 2016-11-01 12:09:59.950000-04:00 O161101XVS100000152 2016-11-01 12:11:29.790000-04:00 Name: start_time, dtype: datetime64[ns, US/Eastern] 

Where the raw data had times closer to 8 am (US/Eastern). In other words, it reverted back to UTC times, even though it says it's eastern times, and has UTC-4 offset.

But if I instead try:
print orders.groupby('OrderID')['start_time'].agg(lambda x: np.min(x)).iloc[:5]

I now get:

OrderID O161101XVS100000044 2016-11-01 08:03:12.920000-04:00 O161101XVS100000047 2016-11-01 08:03:36.693000-04:00 O161101XVS100000098 2016-11-01 08:09:08.330000-04:00 O161101XVS100000122 2016-11-01 08:09:59.950000-04:00 O161101XVS100000152 2016-11-01 08:11:29.790000-04:00 Name: start_time, dtype: datetime64[ns, US/Eastern] 

Which is the behavior I intended. This second method is vastly slower, and I would have assumed the two approaches would yield identical results ...

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions