Skip to content

Conversation

@jbrockmendel
Copy link
Member

Does what it says on the tin: DatetimeBlock.values is always DatetimeArray, and dt64tzblock.shape == dt64tzblock.values in all cases. Similarly TimedeltaBlock.values is always TimedeltaArray.

Notes:

  • It is straightforward to extend this to work for PeriodDtype (i have a branch). Haven't tried it, but I expect it would be similarly easy to do the same for CategoricalDtype.

Things that im not yet fully happy with:

  • fillna method on 2D (I think @simonjayhawkins commented on this in another branch recently),
  • nargminmax with 2D and mask.any()
  • pytables kludge

ASVs: run repeatedly (vs master from yesterday) with --record-samples --append-samples so im pretty confident these are stable (but still include some nonsense xref #40066)

 before after ratio [f4b67b5e] [65792836] <master> <ref-hybrid-3> + 10.1±3ms 13.9±3ms 1.38 eval.Eval.time_add('python', 'all') + 2.06±0.02ms 2.40±0.06ms 1.16 hash_functions.NumericSeriesIndexingShuffled.time_loc_slice(<class 'pandas.core.indexes.numeric.Int64Index'>, 1000000) + 227±2μs 263±2μs 1.15 groupby.GroupByMethods.time_dtype_as_field('datetime', 'head', 'transformation') + 228±2μs 261±2μs 1.15 groupby.GroupByMethods.time_dtype_as_field('datetime', 'head', 'direct') + 238±2μs 272±2μs 1.14 groupby.GroupByMethods.time_dtype_as_field('datetime', 'tail', 'transformation') + 248±6μs 282±5μs 1.14 groupby.GroupByMethods.time_dtype_as_field('datetime', 'tail', 'direct') + 3.92±0.03ms 4.37±0.01ms 1.11 rolling.Engine.time_rolling_apply('DataFrame', 'float', <function Engine.<lambda> at 0x7fb1c0b40670>, 'cython', 'median') + 2.83±0.02ms 3.14±0.06ms 1.11 io.hdf.HDFStoreDataFrame.time_store_info - 275±4μs 248±4μs 0.90 groupby.GroupByMethods.time_dtype_as_field('datetime', 'shift', 'direct') - 1.41±0.05ms 1.27±0.01ms 0.90 stat_ops.FrameOps.time_op('sum', 'int', 1) - 1.13±0.06ms 1.02±0.07ms 0.90 arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function ne>) - 271±2μs 242±2μs 0.89 groupby.GroupByMethods.time_dtype_as_field('datetime', 'shift', 'transformation') - 188±3μs 167±1μs 0.89 algos.isin.IsIn.time_isin_empty('datetime64[ns]') - 192±2μs 170±2μs 0.89 algos.isin.IsIn.time_isin_mismatched_dtype('datetime64[ns]') - 227±2μs 200±2μs 0.88 groupby.GroupByMethods.time_dtype_as_field('datetime', 'any', 'direct') - 226±2μs 199±1μs 0.88 groupby.GroupByMethods.time_dtype_as_field('datetime', 'all', 'transformation') - 227±2μs 199±1μs 0.88 groupby.GroupByMethods.time_dtype_as_field('datetime', 'any', 'transformation') - 895±60μs 785±80μs 0.88 arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function ge>) - 10.2±0.3ms 8.93±0.7ms 0.88 algos.isin.IsinAlmostFullWithRandomInt.time_isin(<class 'numpy.int64'>, 19, 'inside') - 235±4μs 204±4μs 0.87 groupby.GroupByMethods.time_dtype_as_field('datetime', 'all', 'direct') - 3.26±0.03μs 2.83±0.03μs 0.87 frame_methods.ToNumpy.time_to_numpy_tall - 3.28±0.03μs 2.82±0.02μs 0.86 frame_methods.ToNumpy.time_to_numpy_wide - 9.77±0.2ms 8.40±0.2ms 0.86 indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc') - 2.94±0.05μs 2.52±0.02μs 0.86 frame_methods.ToNumpy.time_values_tall - 2.95±0.03μs 2.52±0.02μs 0.85 frame_methods.ToNumpy.time_values_wide - 2.09±0.02ms 1.77±0.01ms 0.85 groupby.FillNA.time_df_ffill - 2.09±0.02ms 1.77±0.01ms 0.85 groupby.FillNA.time_df_bfill - 204±3μs 168±2μs 0.82 arithmetic.OffsetArrayArithmetic.time_add_series_offset(<Day>) - 170±2μs 137±3μs 0.81 groupby.GroupByMethods.time_dtype_as_field('datetime', 'count', 'direct') - 170±2μs 137±4μs 0.80 groupby.GroupByMethods.time_dtype_as_field('datetime', 'count', 'transformation') - 29.1±3ms 22.9±0.4ms 0.79 algos.isin.IsinAlmostFullWithRandomInt.time_isin(<class 'numpy.uint64'>, 20, 'outside') - 26.0±1ms 19.2±2ms 0.74 algos.isin.IsinAlmostFullWithRandomInt.time_isin(<class 'numpy.int64'>, 20, 'inside') - 26.2±0.2ms 18.0±0.07ms 0.69 index_object.SetOperations.time_operation('date_string', 'symmetric_difference') - 11.6±0.1ms 7.32±0.08ms 0.63 reshape.ReshapeExtensionDtype.time_stack('datetime64[ns, US/Pacific]') - 40.2±0.5μs 25.0±0.3μs 0.62 ctors.SeriesDtypesConstructors.time_dtindex_from_index_with_series - 3.77±0.03ms 2.08±0.03ms 0.55 reshape.ReshapeExtensionDtype.time_unstack_slow('datetime64[ns, US/Pacific]') - 32.1±0.5μs 17.0±0.2μs 0.53 ctors.SeriesDtypesConstructors.time_dtindex_from_series - 1.11±0.03ms 408±7μs 0.37 categoricals.Constructor.time_datetimes - 14.1±0.1μs 1.26±0.02μs 0.09 attrs_caching.SeriesArrayAttribute.time_extract_array_numpy('datetime64') - 13.7±0.1μs 1.04±0.03μs 0.08 attrs_caching.SeriesArrayAttribute.time_extract_array('datetime64') - 13.0±0.2μs 455±10ns 0.04 attrs_caching.SeriesArrayAttribute.time_array('datetime64') - 73.8±1ms 1.66±0.03ms 0.02 reshape.ReshapeExtensionDtype.time_unstack_fast('datetime64[ns, US/Pacific]') - 64.3±0.9ms 258±2μs 0.00 reshape.ReshapeExtensionDtype.time_transpose('datetime64[ns, US/Pacific]') 

IIRC the groupby.GroupByMethods.time_dtype_as_field were heavily influenced by constructor overhead, which motivated #40054. Still need to try out @jorisvandenbossche's suggestion of non-cython optimization there.

@jbrockmendel
Copy link
Member Author

@jreback would it help to split off the perf-improving part of this for a follow-up to further trim the diff?

@jreback
Copy link
Contributor

jreback commented Apr 20, 2021

@jreback would it help to split off the perf-improving part of this for a follow-up to further trim the diff?

sure

@jbrockmendel
Copy link
Member Author

jbrockmendel commented Apr 21, 2021

@jreback would it help to split off the perf-improving part of this for a follow-up to further trim the diff?

sure

hmm this is looking more involved than i expected. Can try if it makes a difference, basically would split off everything in frame.py

@jreback
Copy link
Contributor

jreback commented Apr 21, 2021

let me look again

i think if u can reduce what is added to frame would be good

jbrockmendel added a commit to jbrockmendel/pandas that referenced this pull request Apr 21, 2021
@jbrockmendel
Copy link
Member Author

broken in as close to half as possible (not that close) in #41082

@jbrockmendel
Copy link
Member Author

fairly small diff now

@jbrockmendel jbrockmendel changed the title POC/REF: Back DatetimeTZBlock directly by (sometimes 2D) DTA PERF: DataFrame.transpose with dt64tz May 10, 2021
@jreback jreback added this to the 1.3 milestone May 17, 2021
@jreback
Copy link
Contributor

jreback commented May 17, 2021

looks fine. can you rebase. pls add a whatsnew note (as this is a non-trivial perf increse). ping on green.

@jbrockmendel
Copy link
Member Author

ping

@jreback jreback merged commit 93fb9d9 into pandas-dev:master May 17, 2021
@jreback
Copy link
Contributor

jreback commented May 17, 2021

thanks!

@jbrockmendel jbrockmendel deleted the ref-hybrid-3 branch May 17, 2021 19:22
TLouf pushed a commit to TLouf/pandas that referenced this pull request Jun 1, 2021
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Refactor Internal refactoring of code

6 participants