-
- Notifications
You must be signed in to change notification settings - Fork 19.3k
API: timestamp resolution inference: default to microseconds when possible #62031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
API: timestamp resolution inference: default to microseconds when possible #62031
Conversation
| tm.assert_index_equal(result, expected) | ||
| | ||
| | ||
| class TestToDatetimeInferUnit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is already tested elsewhere (in various places?), but while developing it just wrote some simple tests with the different cases I encountered.
For example now I see there is pandas/tests/tslibs/test_array_to_datetime.py::TestArrayToDatetimeResolutionInference and pandas/tests/tslibs/test_strptime.py::TestArrayStrptimeResolutionInference that are failing, so can integrate those tests.
| @jbrockmendel would you have time to give this a review? |
| Yes, but its in line behind a few other reviews i owe. |
| np.iinfo(np.int64).min + 1, NPY_DATETIMEUNIT.NPY_FR_us, &dts_us_min | ||
| ) | ||
| pandas_datetime_to_datetimestruct( | ||
| np.iinfo(np.int64).max, NPY_DATETIMEUNIT.NPY_FR_us, &dts_us_max |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
util.INT64_MAX
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you implement this in np_datetime.pyx you can reuse _NS_MIN_DTS, _NS_MAX_DTS, etc
| # Similar as above, but taking the actual datetime value in account, | ||
| # defaulting to 'us' if possible. | ||
| if reso == NPY_DATETIMEUNIT.NPY_FR_GENERIC: | ||
| return NPY_DATETIMEUNIT.NPY_FR_ns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be FR_us?
| # defaulting to 'us' if possible. | ||
| if reso == NPY_DATETIMEUNIT.NPY_FR_GENERIC: | ||
| return NPY_DATETIMEUNIT.NPY_FR_ns | ||
| # if dts.ps != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are the .ps checks not necessary?
| Small comments, no complaints about the approach. Haven't looked at the tests yet since i don't expect any surprises; will do so once green. |
| This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
| } | ||
| ) | ||
| if parser.engine == "pyarrow": | ||
| expected["a"] = expected["a"].dt.as_unit("s") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
had to make the same change in my branch. does this mean the pyarrow csv parser is giving back second unit? i thought it standardized on "us"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyarrow doesn't really have a default, essentially you always have to specify a precision yourself when manually creating an array / specifying the type.
But for string parsing in the CSV reader, it will indeed be data dependent unfortunately (but for example, for something like "2012-01-01" in a CSV file, it will also use a date type and not timestamp)
| # GH 5869 | ||
| # datetimelike dtype conversion from int | ||
| df = DataFrame({"A": Timestamp("20130101"), "B": np.arange(5)}) | ||
| # TODO: can we retain second reso in .apply here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this comment no longer desirable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose there might still be value in the question about retaining resolution in apply, but it is no longer the case for this test here, since the test data will use a default of microseconds and then also result in microseconds
| im having second thoughts about pushing forward on my branch, so posting the most-worth-salvaging parts: |
Draft PR for #58989
This should already make sure that we consistently use 'us' when converting non-numeric data in
pd.to_datetimeandpd.Timestamp, but if we want to do this, this PR still requires updating lots of tests and docs (and whatsnew) and cleaning up.Currently the changes here will ensure that we use microseconds more consistently when inferring the resolution while creating datetime64 data. Exceptions: if the data don't fit in the range of us (either because out of bounds (use ms or s) or because it has nanoseconds or below (use ns)), or if the input data already has a resolution defined (for Timestamp objects, or numpy datetime64 data).