Skip to content

Conversation

@jbrockmendel
Copy link
Member

Unifies apply_index implementations for MonthEnd/MonthBegin, plus extends them to BMonthEnd and BMonthBegin.

Unifies onOffset implementations for QuarterEnd/BQuarterEnd, plus extends them to QuarterBegin/BQuarterBegin.

Implements a cdef version of monthrange.

@jbrockmendel
Copy link
Member Author

(will run asv shortly)


from numpy cimport int64_t, int32_t


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call this calendar.pxd ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid overlap with stdlib names. The thought behind ccalendar is by analogy with chardet --> cchardet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think that's actually a problem, I guess ccalendar is ok.

@cython.boundscheck(False)
@cython.cdivision
cdef int dayofweek(int y, int m, int d) nogil:
"""Sakamoto's method, from wikipedia"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the link

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc-string

@cython.wraparound(False)
@cython.boundscheck(False)
cpdef monthrange(int64_t year, Py_ssize_t month):
cdef:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc-string



cdef int is_leapyear(int64_t year) nogil:
"""Returns 1 if the given year is a leap year, 0 otherwise."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc-string

from numpy cimport int64_t, int32_t


cpdef monthrange(int64_t year, Py_ssize_t month)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so monthrange is actually used in a few places. changing to use this one? (similar with other methods you defined), e.g. its defined in tslib.pyx now (as well as normalize_date)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. I figured I'd way to update the other usages until after you pointed it out, keep the diff smaller for the first round.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no I'd rather see the far reaching changes (if any) now. Ideally in a PR you change a small number of things, but clean them up everywhere (e.g. replace usages of monthrange). sure its not always possible, but then its not 'half-done'

# ----------------------------------------------------------------------
# Constants

# Slightly more performant cython lookups than a 2D table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then add that the first 12 are non-leap years and second are.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@jreback jreback added Clean Frequency DateOffsets labels Nov 25, 2017
@jbrockmendel
Copy link
Member Author

OK, I've addressed all the comments locally, am running benchmarks (several times) and fixed the travis flake problem. Doing some extra cleanup and parametrization of the tests to make sure the affected offset methods are covered, will update later.

if self.isAnchored:
return self.rule_code
else:
month = liboffsets._int_to_month[self.n]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob should de-privatize these in offsets (_int_to_month)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@jbrockmendel
Copy link
Member Author

OK, just confirmed: the existing tests do not hit the new apply_index methods, and the speedup is pretty enormous.

@jreback
Copy link
Contributor

jreback commented Nov 26, 2017

what is the perf issue here?

@jbrockmendel
Copy link
Member Author

what is the perf issue here?

I'm still running asvs to try to pin it down. AFAICT something in ccalendar must be less performant than the version in src/datetime

taskset 4 asv continuous -f 1.1 -E virtualenv master HEAD -b offset -b timeseries -b period -b timedelta [...] before after ratio [be66ef83] [b244bab9] + 2.50±0.3ms 4.81±0.4ms 1.93 period.Algorithms.time_drop_duplicates('series') + 2.81±0.02ms 5.27±0.07ms 1.88 period.Algorithms.time_value_counts('series') + 398±20ms 460±1ms 1.16 offset.SemiMonthOffset.time_begin_apply_index + 401±1ms 461±2ms 1.15 offset.SemiMonthOffset.time_end_apply_index + 406±6ms 458±2ms 1.13 offset.SemiMonthOffset.time_begin_incr_rng - 22.8±0.5μs 20.4±0.06μs 0.89 timeseries.DatetimeAccessor.time_dt_accessor - 126±6μs 112±0.3μs 0.89 timeseries.DatetimeIndex.time_unique - 1.56s 1.38s 0.89 offset.ApplyIndex.time_apply_series(<BusinessYearBegin: month=1>) - 68.0±3μs 60.1±0.2μs 0.88 period.PeriodUnaryMethods.time_now('min') - 79.1±0.1μs 68.6±0.2μs 0.87 period.PeriodProperties.time_start_time('M') - 25.7±2ms 22.0±0ms 0.86 timeseries.DatetimeIndex.time_to_pydatetime - 180±5ns 154±0.6ns 0.85 timedelta.TimedeltaProperties.time_timedelta_nanoseconds - 1.61s 1.37s 0.85 offset.ApplyIndex.time_apply_series(<BusinessYearEnd: month=12>) - 11.3±0.07μs 9.55±0.08μs 0.85 offset.YearBegin.time_timeseries_year_apply - 24.8±2μs 20.8±0.1μs 0.84 offset.SemiMonthOffset.time_begin_decr - 25.2s 21.1s 0.84 gil.nogil_datetime_fields.time_period_to_datetime - 26.9±0.2μs 22.1±0.09μs 0.82 period.PeriodUnaryMethods.time_now('M') - 1.74s 1.42s 0.82 offset.ApplyIndex.time_apply_series(<BusinessQuarterEnd: startingMonth=3>) - 1.04±0μs 848±2ns 0.81 period.PeriodProperties.time_minute('min') - 9.62s 7.76s 0.81 gil.nogil_datetime_fields.time_datetime_to_period - 9.17±0.3ms 7.31±0.09ms 0.80 timeseries.DatetimeIndex.time_dti_tz_factorize - 15.3±1ms 12.2±0.06ms 0.80 timeseries.DatetimeIndex.time_infer_freq_none - 1.46s 14.5±0.3ms 0.01 offset.ApplyIndex.time_apply_series(<BusinessMonthEnd>) - 1.48s 13.8±0.08ms 0.01 offset.ApplyIndex.time_apply_series(<BusinessMonthBegin>) - 1.50s 11.9±0.09ms 0.01 offset.ApplyIndex.time_apply_index(<BusinessMonthBegin>) - 1.54s 12.3±0.2ms 0.01 offset.ApplyIndex.time_apply_index(<BusinessMonthEnd>) 

b244bab9 is after updating all the cimports as suggested, except for one in fields that I reverted to try track down the regression.

@codecov
Copy link

codecov bot commented Nov 26, 2017

Codecov Report

Merging #18489 into master will decrease coverage by 0.01%.
The diff coverage is 95.23%.

Impacted file tree graph

@@ Coverage Diff @@ ## master #18489 +/- ## ========================================== - Coverage 91.33% 91.32% -0.02%  ========================================== Files 163 163 Lines 49801 49773 -28 ========================================== - Hits 45487 45454 -33  - Misses 4314 4319 +5
Flag Coverage Δ
#multiple 89.12% <95.23%> (-0.02%) ⬇️
#single 40.72% <28.57%> (-0.05%) ⬇️
Impacted Files Coverage Δ
pandas/tseries/offsets.py 96.94% <95.23%> (-0.09%) ⬇️
pandas/core/indexes/interval.py 92.69% <0%> (-0.53%) ⬇️
pandas/core/indexes/datetimes.py 95.52% <0%> (-0.2%) ⬇️
pandas/core/generic.py 95.73% <0%> (-0.06%) ⬇️
pandas/core/indexes/base.py 96.4% <0%> (-0.02%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.01%) ⬇️
pandas/core/strings.py 98.46% <0%> (-0.01%) ⬇️
pandas/core/internals.py 94.47% <0%> (-0.01%) ⬇️
pandas/core/panel.py 97.14% <0%> (+0.28%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d101064...d3ff628. Read the comment docs.

@codecov
Copy link

codecov bot commented Nov 26, 2017

Codecov Report

Merging #18489 into master will decrease coverage by 50.56%.
The diff coverage is 28.57%.

Impacted file tree graph

@@ Coverage Diff @@ ## master #18489 +/- ## =========================================== - Coverage 91.33% 40.77% -50.57%  =========================================== Files 163 163 Lines 49801 49796 -5 =========================================== - Hits 45487 20304 -25183  - Misses 4314 29492 +25178
Flag Coverage Δ
#multiple ?
#single 40.77% <28.57%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/tseries/offsets.py 44.04% <28.57%> (-52.99%) ⬇️
pandas/tools/hashing.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/io/formats/style.py 0% <0%> (-100%) ⬇️
pandas/parser.py 0% <0%> (-100%) ⬇️
pandas/lib.py 0% <0%> (-100%) ⬇️
pandas/io/json/json.py 0% <0%> (-100%) ⬇️
pandas/types/concat.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.08%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.28%) ⬇️
... and 112 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d101064...c503c53. Read the comment docs.

@jbrockmendel
Copy link
Member Author

Separating the ccalendar stuff from the new apply_index and onOffset methods. The newly-pushed commit should be an unambiguous win.

asv continuous -f 1.1 -E virtualenv master HEAD -b offset -b period -b timedelta -b timeseries [...] [d1010643] [d3ff6286] + 27.5±0.09μs 36.5±2μs 1.33 offset.CBDayHolidays.time_custom_bday_cal_decr + 499±1ms 612±20ms 1.23 offset.ApplyIndex.time_apply_series(<QuarterEnd: startingMonth=3>) + 18.5±0.04μs 22.2±0.8μs 1.20 offset.Day.time_timeseries_day_apply + 467±2ms 554±2ms 1.19 offset.SemiMonthOffset.time_end_apply_index + 22.1±0.1μs 25.0±0.2μs 1.13 offset.SemiMonthOffset.time_end_decr_n + 24.3±0.1μs 27.5±0.1μs 1.13 offset.CBDayHolidays.time_custom_bday_cal_incr_n + 911±4ns 1.02±0.06μs 1.12 period.PeriodProperties.time_is_leap_year('M') + 21.0±0.09μs 23.4±1μs 1.11 period.PeriodUnaryMethods.time_now('M') + 107±0.2ns 118±10ns 1.10 timedelta.TimedeltaProperties.time_timedelta_seconds - 1.73s 1.53s 0.89 timeseries.Iteration.time_iter_periodindex - 3.93±0.04ms 3.47±0.02ms 0.88 period.PeriodIndexConstructor.time_from_pydatetime('D') - 7.60±0.07μs 6.60±0.02μs 0.87 timeseries.DatetimeIndex.time_timestamp_tzinfo_cons - 1.62s 1.41s 0.87 offset.ApplyIndex.time_apply_series(<BusinessYearEnd: month=12>) - 530±9μs 459±0.6μs 0.87 period.Algorithms.time_value_counts('index') - 536±7ms 458±3ms 0.85 offset.SemiMonthOffset.time_begin_apply_index - 1.03±0.1μs 866±7ns 0.84 period.PeriodProperties.time_day('min') - 133±10μs 111±0.05μs 0.83 timeseries.DatetimeIndex.time_unique - 1.05±0.04μs 873±1ns 0.83 period.PeriodProperties.time_dayofyear('min') - 20.1±0.3μs 16.4±0.09μs 0.82 timeseries.AsOf.time_asof_single_early - 4.39±0.1ms 3.31±0.01ms 0.75 timeseries.ToDatetime.time_cache_true_with_unique_seconds_and_unit - 24.0s 14.2s 0.59 gil.nogil_datetime_fields.time_datetime_to_period - 1.46s 20.8±0.8ms 0.01 offset.ApplyIndex.time_apply_index(<BusinessMonthEnd>) - 1.55s 20.9±0.1ms 0.01 offset.ApplyIndex.time_apply_series(<BusinessMonthBegin>) - 1.61s 19.6±0.4ms 0.01 offset.ApplyIndex.time_apply_index(<BusinessMonthBegin>) - 1.79s 21.6±0.2ms 0.01 offset.ApplyIndex.time_apply_series(<BusinessMonthEnd>) - 668±2μs 6.93±0.02μs 0.01 offset.OnOffset.time_on_offset(<BusinessQuarterBegin: startingMonth=3>) - 628±2μs 6.10±0.04μs 0.01 offset.OnOffset.time_on_offset(<QuarterBegin: startingMonth=3>) 

Running again. The 100x improvements are real. The slowdowns I expect to be noise.

@jreback
Copy link
Contributor

jreback commented Nov 26, 2017

what path was BusinessMonthEnd for example taking before that made it a lot slower? all in python?

@jbrockmendel
Copy link
Member Author

The base class applyindex raises, at which point it falls back to point wise addition

@jreback jreback added the Performance Memory or execution speed performance label Nov 27, 2017
@jreback jreback added this to the 0.22.0 milestone Nov 27, 2017
@jreback
Copy link
Contributor

jreback commented Nov 27, 2017

this needs a note for performance in whats for 0.22.0. ping when pushed as this is green already.

@jreback
Copy link
Contributor

jreback commented Nov 27, 2017

add an additional line (even though you have one for #18218). in general any PR that is perf related should get an entry (or you can include that PR number on an existing entry). cleaning/reorg PR's that don't touch the user don't need ones.

@jbrockmendel
Copy link
Member Author

ping

@jreback
Copy link
Contributor

jreback commented Nov 27, 2017

thanks!

@jreback jreback merged commit f745e52 into pandas-dev:master Nov 27, 2017
@jbrockmendel jbrockmendel deleted the tslibs-offsets6 branch December 8, 2017 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Clean Frequency DateOffsets Performance Memory or execution speed performance

2 participants