-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
Description
I think there may be a bug with the row-wise handling of numpy.timedelta64 data types when using DataFrame.apply. As a check, the problem does not appear when using DataFrame.applymap. The problem may be related to #4532, but I'm unsure. I've included an example below.
This is only a minor problem for my use-case, which is cross-checking timestamps from a counter/timer card. I can easily work around the issue with DataFrame.itertuples etc.
Thank you for your time and for making such a useful package!
Example
Version
Import and check versions.
$ date Thu Jul 17 16:28:38 CDT 2014 $ conda update pandas Fetching package metadata: .. # All requested packages already installed. # packages in environment at /Users/harrold/anaconda: # pandas 0.14.1 np18py27_0 $ ipython Python 2.7.8 |Anaconda 2.0.1 (x86_64)| (default, Jul 2 2014, 15:36:00) Type "copyright", "credits" or "license" for more information. IPython 2.1.0 -- An enhanced Interactive Python. Anaconda is brought to you by Continuum Analytics. Please check out: http://continuum.io/thanks and https://binstar.org ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. In [1]: from __future__ import print_function In [2]: import numpy as np In [3]: import pandas as pd In [4]: pd.util.print_versions.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 2.7.8.final.0 python-bits: 64 OS: Darwin OS-release: 11.4.2 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 pandas: 0.14.1 nose: 1.3.3 Cython: 0.20.1 numpy: 1.8.1 scipy: 0.14.0 statsmodels: 0.5.0 IPython: 2.1.0 sphinx: 1.2.2 patsy: 0.2.1 scikits.timeseries: None dateutil: 1.5 pytz: 2014.4 bottleneck: None tables: 3.1.1 numexpr: 2.3.1 matplotlib: 1.3.1 openpyxl: 1.8.5 xlrd: 0.9.3 xlwt: 0.7.5 xlsxwriter: 0.5.5 lxml: 3.3.5 bs4: 4.3.1 html5lib: 0.999 httplib2: 0.8 apiclient: 1.2 rpy2: None sqlalchemy: 0.9.4 pymysql: None psycopg2: None Create test data
Using subset of original raw data as example.
In [5]: datetime_start = np.datetime64(u'2014-05-31T01:23:19.9600345Z') In [6]: timedeltas_elapsed = [30053400, 40053249, 50053098] Compute datetimes from elapsed timedeltas, then create differential timedeltas from datetimes. All elements are either type numpy.datetime64 or numpy.timedelta64.
In [7]: df = pd.DataFrame(dict(datetimes = timedeltas_elapsed)) In [8]: df = df.applymap(lambda elt: np.timedelta64(elt, 'us')) In [9]: df = df.applymap(lambda elt: np.datetime64(datetime_start + elt)) In [10]: df['differential_timedeltas'] = df['datetimes'] - df['datetimes'].shift() In [11]: print(df) datetimes differential_timedeltas 0 2014-05-31 01:23:50.013434500 NaT 1 2014-05-31 01:24:00.013283500 00:00:09.999849 2 2014-05-31 01:24:10.013132500 00:00:09.999849 Expected behavior
With element-wise handling using DataFrame.applymap, all elements are correctly identified as datetimes (timestamps) or timedeltas.
In [12]: print(df.applymap(lambda elt: type(elt))) datetimes differential_timedeltas 0 <class 'pandas.tslib.Timestamp'> <type 'numpy.timedelta64'> 1 <class 'pandas.tslib.Timestamp'> <type 'numpy.timedelta64'> 2 <class 'pandas.tslib.Timestamp'> <type 'numpy.timedelta64'> Bug
With row-wise handling using DataFrame.apply, all elements are type pandas.tslib.Timestamp. I expected 'differential_timedeltas' to be type numpy.timedelta64 or another type of timedelta, not a type of datetime (timestamp).
In [13]: # For 'datetimes': In [14]: print(df.apply(lambda row: type(row['datetimes']), axis=1)) 0 <class 'pandas.tslib.Timestamp'> 1 <class 'pandas.tslib.Timestamp'> 2 <class 'pandas.tslib.Timestamp'> dtype: object In [15]: # For 'differential_timedeltas': In [16]: print(df.apply(lambda row: type(row['differential_timedeltas']), axis=1)) 0 <class 'pandas.tslib.NaTType'> 1 <class 'pandas.tslib.Timestamp'> 2 <class 'pandas.tslib.Timestamp'> dtype: object