PERF: DataFrame.transpose with dt64tz by jbrockmendel · Pull Request #40149 · pandas-dev/pandas (original) (raw)

Does what it says on the tin: DatetimeBlock.values is always DatetimeArray, and dt64tzblock.shape == dt64tzblock.values in all cases. Similarly TimedeltaBlock.values is always TimedeltaArray.

ASVs: run repeatedly (vs master from yesterday) with --record-samples --append-samples so im pretty confident these are stable (but still include some nonsense xref #40066)

       before           after         ratio
     [f4b67b5e]       [65792836]
     <master>         <ref-hybrid-3>
+        10.1±3ms         13.9±3ms     1.38  eval.Eval.time_add('python', 'all')
+     2.06±0.02ms      2.40±0.06ms     1.16  hash_functions.NumericSeriesIndexingShuffled.time_loc_slice(<class 'pandas.core.indexes.numeric.Int64Index'>, 1000000)
+         227±2μs          263±2μs     1.15  groupby.GroupByMethods.time_dtype_as_field('datetime', 'head', 'transformation')
+         228±2μs          261±2μs     1.15  groupby.GroupByMethods.time_dtype_as_field('datetime', 'head', 'direct')
+         238±2μs          272±2μs     1.14  groupby.GroupByMethods.time_dtype_as_field('datetime', 'tail', 'transformation')
+         248±6μs          282±5μs     1.14  groupby.GroupByMethods.time_dtype_as_field('datetime', 'tail', 'direct')
+     3.92±0.03ms      4.37±0.01ms     1.11  rolling.Engine.time_rolling_apply('DataFrame', 'float', <function Engine.<lambda> at 0x7fb1c0b40670>, 'cython', 'median')
+     2.83±0.02ms      3.14±0.06ms     1.11  io.hdf.HDFStoreDataFrame.time_store_info
-         275±4μs          248±4μs     0.90  groupby.GroupByMethods.time_dtype_as_field('datetime', 'shift', 'direct')
-     1.41±0.05ms      1.27±0.01ms     0.90  stat_ops.FrameOps.time_op('sum', 'int', 1)
-     1.13±0.06ms      1.02±0.07ms     0.90  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function ne>)
-         271±2μs          242±2μs     0.89  groupby.GroupByMethods.time_dtype_as_field('datetime', 'shift', 'transformation')
-         188±3μs          167±1μs     0.89  algos.isin.IsIn.time_isin_empty('datetime64[ns]')
-         192±2μs          170±2μs     0.89  algos.isin.IsIn.time_isin_mismatched_dtype('datetime64[ns]')
-         227±2μs          200±2μs     0.88  groupby.GroupByMethods.time_dtype_as_field('datetime', 'any', 'direct')
-         226±2μs          199±1μs     0.88  groupby.GroupByMethods.time_dtype_as_field('datetime', 'all', 'transformation')
-         227±2μs          199±1μs     0.88  groupby.GroupByMethods.time_dtype_as_field('datetime', 'any', 'transformation')
-        895±60μs         785±80μs     0.88  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function ge>)
-      10.2±0.3ms       8.93±0.7ms     0.88  algos.isin.IsinAlmostFullWithRandomInt.time_isin(<class 'numpy.int64'>, 19, 'inside')
-         235±4μs          204±4μs     0.87  groupby.GroupByMethods.time_dtype_as_field('datetime', 'all', 'direct')
-     3.26±0.03μs      2.83±0.03μs     0.87  frame_methods.ToNumpy.time_to_numpy_tall
-     3.28±0.03μs      2.82±0.02μs     0.86  frame_methods.ToNumpy.time_to_numpy_wide
-      9.77±0.2ms       8.40±0.2ms     0.86  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
-     2.94±0.05μs      2.52±0.02μs     0.86  frame_methods.ToNumpy.time_values_tall
-     2.95±0.03μs      2.52±0.02μs     0.85  frame_methods.ToNumpy.time_values_wide
-     2.09±0.02ms      1.77±0.01ms     0.85  groupby.FillNA.time_df_ffill
-     2.09±0.02ms      1.77±0.01ms     0.85  groupby.FillNA.time_df_bfill
-         204±3μs          168±2μs     0.82  arithmetic.OffsetArrayArithmetic.time_add_series_offset(<Day>)
-         170±2μs          137±3μs     0.81  groupby.GroupByMethods.time_dtype_as_field('datetime', 'count', 'direct')
-         170±2μs          137±4μs     0.80  groupby.GroupByMethods.time_dtype_as_field('datetime', 'count', 'transformation')
-        29.1±3ms       22.9±0.4ms     0.79  algos.isin.IsinAlmostFullWithRandomInt.time_isin(<class 'numpy.uint64'>, 20, 'outside')
-        26.0±1ms         19.2±2ms     0.74  algos.isin.IsinAlmostFullWithRandomInt.time_isin(<class 'numpy.int64'>, 20, 'inside')
-      26.2±0.2ms      18.0±0.07ms     0.69  index_object.SetOperations.time_operation('date_string', 'symmetric_difference')
-      11.6±0.1ms      7.32±0.08ms     0.63  reshape.ReshapeExtensionDtype.time_stack('datetime64[ns, US/Pacific]')
-      40.2±0.5μs       25.0±0.3μs     0.62  ctors.SeriesDtypesConstructors.time_dtindex_from_index_with_series
-     3.77±0.03ms      2.08±0.03ms     0.55  reshape.ReshapeExtensionDtype.time_unstack_slow('datetime64[ns, US/Pacific]')
-      32.1±0.5μs       17.0±0.2μs     0.53  ctors.SeriesDtypesConstructors.time_dtindex_from_series
-     1.11±0.03ms          408±7μs     0.37  categoricals.Constructor.time_datetimes
-      14.1±0.1μs      1.26±0.02μs     0.09  attrs_caching.SeriesArrayAttribute.time_extract_array_numpy('datetime64')
-      13.7±0.1μs      1.04±0.03μs     0.08  attrs_caching.SeriesArrayAttribute.time_extract_array('datetime64')
-      13.0±0.2μs         455±10ns     0.04  attrs_caching.SeriesArrayAttribute.time_array('datetime64')
-        73.8±1ms      1.66±0.03ms     0.02  reshape.ReshapeExtensionDtype.time_unstack_fast('datetime64[ns, US/Pacific]')
-      64.3±0.9ms          258±2μs     0.00  reshape.ReshapeExtensionDtype.time_transpose('datetime64[ns, US/Pacific]')

IIRC the groupby.GroupByMethods.time_dtype_as_field were heavily influenced by constructor overhead, which motivated #40054. Still need to try out @jorisvandenbossche's suggestion of non-cython optimization there.