Skip to content

Conversation

@jorisvandenbossche
Copy link
Member

Draft PR for #58989

This should already make sure that we consistently use 'us' when converting non-numeric data in pd.to_datetime and pd.Timestamp, but if we want to do this, this PR still requires updating lots of tests and docs (and whatsnew) and cleaning up.

Currently the changes here will ensure that we use microseconds more consistently when inferring the resolution while creating datetime64 data. Exceptions: if the data don't fit in the range of us (either because out of bounds (use ms or s) or because it has nanoseconds or below (use ns)), or if the input data already has a resolution defined (for Timestamp objects, or numpy datetime64 data).

@jorisvandenbossche jorisvandenbossche added Datetime Datetime data dtype Non-Nano datetime64/timedelta64 with non-nanosecond resolution Timestamp pd.Timestamp and associated methods labels Aug 3, 2025
tm.assert_index_equal(result, expected)


class TestToDatetimeInferUnit:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is already tested elsewhere (in various places?), but while developing it just wrote some simple tests with the different cases I encountered.

For example now I see there is pandas/tests/tslibs/test_array_to_datetime.py::TestArrayToDatetimeResolutionInference and pandas/tests/tslibs/test_strptime.py::TestArrayStrptimeResolutionInference that are failing, so can integrate those tests.

@jorisvandenbossche
Copy link
Member Author

@jbrockmendel would you have time to give this a review?

@jbrockmendel
Copy link
Member

Yes, but its in line behind a few other reviews i owe.

np.iinfo(np.int64).min + 1, NPY_DATETIMEUNIT.NPY_FR_us, &dts_us_min
)
pandas_datetime_to_datetimestruct(
np.iinfo(np.int64).max, NPY_DATETIMEUNIT.NPY_FR_us, &dts_us_max
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

util.INT64_MAX

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you implement this in np_datetime.pyx you can reuse _NS_MIN_DTS, _NS_MAX_DTS, etc

# Similar as above, but taking the actual datetime value in account,
# defaulting to 'us' if possible.
if reso == NPY_DATETIMEUNIT.NPY_FR_GENERIC:
return NPY_DATETIMEUNIT.NPY_FR_ns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be FR_us?

# defaulting to 'us' if possible.
if reso == NPY_DATETIMEUNIT.NPY_FR_GENERIC:
return NPY_DATETIMEUNIT.NPY_FR_ns
# if dts.ps != 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are the .ps checks not necessary?

@jbrockmendel
Copy link
Member

Small comments, no complaints about the approach. Haven't looked at the tests yet since i don't expect any surprises; will do so once green.

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Sep 15, 2025
}
)
if parser.engine == "pyarrow":
expected["a"] = expected["a"].dt.as_unit("s")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had to make the same change in my branch. does this mean the pyarrow csv parser is giving back second unit? i thought it standardized on "us"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyarrow doesn't really have a default, essentially you always have to specify a precision yourself when manually creating an array / specifying the type.
But for string parsing in the CSV reader, it will indeed be data dependent unfortunately (but for example, for something like "2012-01-01" in a CSV file, it will also use a date type and not timestamp)

# GH 5869
# datetimelike dtype conversion from int
df = DataFrame({"A": Timestamp("20130101"), "B": np.arange(5)})
# TODO: can we retain second reso in .apply here?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment no longer desirable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose there might still be value in the question about retaining resolution in apply, but it is no longer the case for this test here, since the test data will use a default of microseconds and then also result in microseconds

@jbrockmendel
Copy link
Member

im having second thoughts about pushing forward on my branch, so posting the most-worth-salvaging parts:

cdef NPY_DATETIMEUNIT get_supported_reso_for_dts( npy_datetimestruct* dts ): """ If the dts can be stored with NPY_FR_ns, do so. Otherwise find the nearest unit that it can be stored in. Raise if this entails lossy rounding. """ # First, find the highest resolution such that storing dts in it won't # overflow. if cmp_dtstructs(dts, &_NS_MIN_DTS, Py_GE) and cmp_dtstructs(dts, &_NS_MAX_DTS, Py_LE): # i.e. dts >= ns_dts_min and dts <= ns_dts_max: creso = NPY_FR_ns elif cmp_dtstructs(dts, &_US_MIN_DTS, Py_GE) and cmp_dtstructs(dts, &_US_MAX_DTS, Py_LE): creso = NPY_FR_us elif cmp_dtstructs(dts, &_MS_MIN_DTS, Py_GE) and cmp_dtstructs(dts, &_MS_MAX_DTS, Py_LE): creso = NPY_FR_ms else: creso = NPY_FR_s check_dts_bounds(dts, creso) # Next, check that storing in that resolution doesn't mean lossy rounding. if creso == NPY_FR_ns: pass elif creso == NPY_FR_us: if dts.ps // 1000 != 0: check_dts_bounds(dts, NPY_FR_ns) elif creso == NPY_FR_ms: if dts.ps // 1000 != 0 or dts.us % 1000 != 0: check_dts_bounds(dts, NPY_FR_us) else: if dts.ps // 1000 != 0 or dts.us != 0: check_dts_bounds(dts, NPY_FR_ms) return creso 
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Datetime Datetime data dtype Non-Nano datetime64/timedelta64 with non-nanosecond resolution Stale Timestamp pd.Timestamp and associated methods

2 participants