-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
ENH: Parse %z and %Z directive in format for to_datetime #19979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jreback left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need tests
pandas/core/tools/datetimes.py Outdated
| try: | ||
| result = array_strptime(arg, format, exact=exact, | ||
| errors=errors) | ||
| errors=errors)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was this changed?
pandas/_libs/tslibs/strptime.pyx Outdated
| z = z[:3] + z[4:] | ||
| if len(z) > 5: | ||
| if z[5] != ':': | ||
| msg = "Unconsistent use of : in {0}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent
9f273c4 to 7592ed8 Compare Codecov Report
@@ Coverage Diff @@ ## master #19979 +/- ## ========================================== + Coverage 91.84% 91.84% +<.01% ========================================== Files 153 153 Lines 49506 49516 +10 ========================================== + Hits 45467 45477 +10 Misses 4039 4039
Continue to review full report at Codecov.
|
| Summary of the logic thus far and open to feedback:
|
| I should be able to transfer this logic into cython on the next iteration. |
a6c61d8 to 94641dc Compare pandas/_libs/tslibs/strptime.pyx Outdated
| if z == 'Z': | ||
| gmtoff = 0 | ||
| else: | ||
| if z[3] == ':': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a check on the len upfront here
pandas/_libs/tslibs/strptime.pyx Outdated
| seconds = int(z[5:7] or 0) | ||
| gmtoff = (hours * 60 * 60) + (minutes * 60) + seconds | ||
| gmtoff_remainder = z[8:] | ||
| # Pad to always return microseconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line
| else: | ||
| tz = value | ||
| break | ||
| elif parse_code == 19: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move this whole parse to a function and just all it here (and return the values as a tuple)
pandas/core/tools/datetimes.py Outdated
| try: | ||
| result = array_strptime(arg, format, exact=exact, | ||
| errors=errors) | ||
| parsing_tzname = '%Z' in format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woa, what do you need all this for???
| | ||
| @pytest.mark.skipif(not PY3, | ||
| reason="datetime.timezone not supported in PY2") | ||
| def test_to_datetime_parse_timezone(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import at the top
| tm.assert_numpy_array_equal(result, expected) | ||
| | ||
| # %z and %Z parsing | ||
| dates = ['2010-01-01 12:00:00 UTC +0100'] * 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need more checking for invalid (partially formed such as +0, +1foo, UTCbar
| tm.assert_index_equal(result, expected) | ||
| | ||
| result = pd.to_datetime(dates, format=fmt, box=False) | ||
| expected = np.array(expected_dates, dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use assert_index_equal always
94641dc to 9f38bda Compare | @jreback I was able to simplify a lot of my logic ( I underestimated how I created separate functions to parse the timezone directive and then box the result. |
pandas/_libs/tslibs/strptime.pyx Outdated
| days_to_week = week_0_length + (7 * (week_of_year - 1)) | ||
| return 1 + days_to_week + day_of_week | ||
| | ||
| cdef _parse_timezone_directive(object z): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be de-privatized (no leading _); these modules are all private.
pandas/_libs/tslibs/strptime.pyx Outdated
| | ||
| if z == 'Z': | ||
| gmtoff = 0 | ||
| gmtoff_fraction = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would just directly return for this case (0, 0)
then don't need the else
pandas/_libs/tslibs/strptime.pyx Outdated
| gmtoff = 0 | ||
| gmtoff_fraction = 0 | ||
| else: | ||
| if z[3] == ':': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might need to wrap this entire block in a try/except if the string is not long enough (or check lengths for each sub-section) and the raise the appropriate error
pandas/core/tools/datetimes.py Outdated
| ts = tslib.Timestamp(res) | ||
| ts = ts.tz_localize(tzoffset) | ||
| tz_results.append(ts) | ||
| tz_results = np.array(tz_results) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you need all of this for, this is jumping thru a lot of hoops here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This elif branch is:
- Creating a
pytz.FixedOffsetfrom the parsed offset - Creating a naive Timestamp, then localizing it to the
pytz.FixedOffset(can't do it directly likeTimezone(res, tz=pytz.FixedOffset(...))because of my realization from DOC: Clarify passing epoch timestamp to Timestamp with timezone. #20257)
| Index, DatetimeIndex, NaT, date_range, compat) | ||
| | ||
| if PY3: | ||
| from datetime import timezone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have a fixture for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a fixture for timezone.utc, but I am testing parsing a custom timezone.
| def test_to_datetime_parse_tzname_or_tzoffset(self, box, const, | ||
| assert_equal, fmt, | ||
| dates, expected_dates): | ||
| # %z or %Z parsing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number
| assert_equal, dates, | ||
| expected_dates): | ||
| # %z and %Z parsing | ||
| fmt = '%Y-%m-%d %H:%M:%S %Z %z' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
return timedeltas as list return timedeltas in a numpy array some flake fixes Extend logic of parsing timezones address comment misspelling Add additional tests address timezone localization
39d1ba4 to 0e2a0cd Compare | I've changed a couple of things after your review @jreback
|
pandas/_libs/tslibs/strptime.pyx Outdated
| bint is_ignore = errors=='ignore' | ||
| bint is_coerce = errors=='coerce' | ||
| int ordinal | ||
| dict _parse_code_table = {'y': 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could make this a module level variable
| raise ValueError("Cannot pass a tz argument when " | ||
| "parsing strings with timezone " | ||
| "information.") | ||
| result, timezones = array_strptime( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would much rather do the error handling in the _return_parsed_timezone_results. This block is just very complicated and hard to grok
| lgtm @mroeschke merge on green. |
| New features are for 0.24.0, @mroeschke can you move the whatsnew? |
jorisvandenbossche left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice addition!
Added some comments
| [pd.Timestamp('2010-01-01 12:00:00', tz='UTC'), | ||
| pd.Timestamp('2010-01-01 12:00:00', tz='GMT'), | ||
| pd.Timestamp('2010-01-01 12:00:00', tz='US/Pacific')]], | ||
| ['%Y-%m-%d %H:%M:%S %z', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you one of them, eg this one, without the space before the tz?
| ['%Y-%m-%d %H:%M:%S %z', | ||
| ['2010-01-01 12:00:00 Z', '2010-01-01 12:00:00 Z'], | ||
| [pd.Timestamp('2010-01-01 12:00:00', | ||
| tzinfo=pytz.FixedOffset(0)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be UTC or a fixed offset of 0 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytz coerces a fixed offset of 0 to UTC
In [2]: pytz.FixedOffset(0) Out[2]: <UTC> But making it explicit here that %z should return pytz.FixedOffset(0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the actual DatetimeIndex you get here has UTC timezone? OK, that's good! (but maybe add a small comment since I would not expect that)
| pd.Timestamp('2010-01-01 12:00:00', | ||
| tzinfo=pytz.FixedOffset(-60))]], | ||
| ['%Y-%m-%d %H:%M:%S %z', | ||
| ['2010-01-01 12:00:00 Z', '2010-01-01 12:00:00 Z'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also work with %Z?
It seems that with datetime.datetime.strptime it does not work with either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regex I pulled from https://github.com/python/cpython/blob/master/Lib/_strptime.py has an option for 'Z' with %z:
But %Z only makes timezones found in the system local time available, i.e. no 'Z' option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK (that's probably a newer addition to python), then it makes sense to follow upstream python to be consistent
| pd.to_datetime(dates, format=fmt, box=box, utc=True) | ||
| | ||
| @pytest.mark.parametrize('offset', [ | ||
| '+0', '-1foo', 'UTCbar', ':10', '+01:000:01']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an empty string here as well?
| thanks @mroeschke nice patch! betcha didn't think it would be this long when you first put it up! hahah. tests and code look great! |
| ha no problem, thanks! |
git diff upstream/master -u -- "*.py" | flake8 --diffThe implimentiontion is near identical to https://github.com/python/cpython/blob/master/Lib/_strptime.py an currently works as
datetime.strptimewould:Currently, an offset needs to get passed (%z) in order for the tzname used be used (%Z).
I'd like to get feedback of what this function should return having parsed %z or %Z. It may be difficult to return a normal DatimeIndex/Series/array given the following edge cases:
[date] +0100, [date]. -0600, [date] +1530)[date] UTC, [date]. EST, [date] CET)I suppose the most agnostic thing to is to return an array of Timestamps?