PERF: Always using panda's hashtable approach, dropping np.in1d by realead · Pull Request #36611 · pandas-dev/pandas (original) (raw)

The timings are (pls use asv continuous -f 1.01 upstream/master HEAD~1 -b ^series_methods.IsInLongSeries because I've removing some of the test in the last commit - it is nice to see where the cut could be done, but takes otherwise too much time), for comparison with earlier timings IsInLongSeries was renamed to IsInLongSeriesLookUpDominates:

       before           after         ratio
     [8d1b8aba]       [ef49ca39]
+      11.1±0.1ms          130±1ms    11.70  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 1, 'monotone_misses')
+      18.9±0.2ms        203±0.6ms    10.73  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 2, 'monotone_misses')
+     18.9±0.08ms        176±0.3ms     9.34  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 2, 'random_misses')
+      11.2±0.1ms       90.3±0.4ms     8.10  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 1, 'random_misses')
+      19.0±0.2ms        126±0.6ms     6.66  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 2, 'random_hits')
+      34.6±0.2ms          224±3ms     6.49  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 2, 'monotone_misses')
+        43.3±2ms          265±2ms     6.13  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 5, 'monotone_misses')
+      43.0±0.5ms          249±1ms     5.78  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 5, 'random_misses')
+      34.6±0.2ms          193±2ms     5.58  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 2, 'random_misses')
+      26.2±0.7ms          145±1ms     5.52  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 1, 'monotone_misses')
+      42.7±0.5ms          218±2ms     5.10  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 5, 'random_hits')
+      57.8±0.3ms          280±1ms     4.85  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 5, 'monotone_misses')
+      11.0±0.2ms       51.0±0.8ms     4.65  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 1, 'monotone_hits')
+      57.6±0.1ms          265±2ms     4.60  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 5, 'random_misses')
+      11.3±0.1ms       50.9±0.4ms     4.50  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 1, 'random_hits')
+      23.0±0.2ms          102±2ms     4.42  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 2, 'monotone_misses')
+      13.1±0.2ms       53.0±0.2ms     4.06  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 1, 'random_hits')
+      13.3±0.3ms       53.1±0.2ms     4.00  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 1, 'monotone_hits')
+      19.0±0.4ms       74.6±0.3ms     3.93  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 2, 'monotone_hits')
+        36.3±3ms          141±2ms     3.89  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 2, 'random_hits')
+      27.3±0.9ms        106±0.7ms     3.88  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 1, 'random_misses')
+        60.7±1ms          234±2ms     3.86  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 5, 'random_hits')
+      13.2±0.4ms       50.0±0.7ms     3.80  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 1, 'monotone_misses')
+      12.8±0.1ms       47.7±0.3ms     3.73  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 1, 'random_misses')
+        84.1±1ms          273±1ms     3.25  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 10, 'random_misses')
+      83.6±0.9ms          271±1ms     3.25  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 10, 'monotone_misses')
+      83.0±0.5ms          248±2ms     2.99  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 10, 'random_hits')
+      38.8±0.9ms        115±0.9ms     2.96  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 2, 'monotone_misses')
+      98.5±0.7ms          287±1ms     2.91  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 10, 'monotone_misses')
+         100±2ms        287±0.9ms     2.87  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 10, 'random_misses')
+      42.7±0.8ms        116±0.7ms     2.72  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 5, 'monotone_hits')
+        98.5±1ms          262±2ms     2.66  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 10, 'random_hits')
+      34.4±0.4ms       90.5±0.9ms     2.63  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 2, 'monotone_hits')
+      53.0±0.3ms        137±0.4ms     2.59  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 5, 'random_misses')
+      26.2±0.3ms       66.5±0.6ms     2.54  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 1, 'monotone_hits')
+      26.6±0.2ms       66.4±0.4ms     2.49  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 1, 'random_hits')
+        53.2±1ms          133±1ms     2.49  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 5, 'monotone_misses')
+      28.5±0.2ms         69.3±1ms     2.43  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 1, 'monotone_hits')
+      29.0±0.2ms       68.0±0.7ms     2.35  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 1, 'random_hits')
+      23.0±0.2ms       52.9±0.4ms     2.30  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 2, 'monotone_hits')
+      23.2±0.2ms       52.7±0.2ms     2.27  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 2, 'random_hits')
+        58.6±1ms          133±1ms     2.27  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 5, 'monotone_hits')
+      28.6±0.6ms       64.7±0.5ms     2.26  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 1, 'monotone_misses')
+        68.3±2ms          153±1ms     2.24  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 5, 'random_misses')
+      28.5±0.4ms       61.5±0.7ms     2.16  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 1, 'random_misses')
+        70.2±2ms        149±0.8ms     2.13  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 5, 'monotone_misses')
+      23.0±0.1ms       47.2±0.8ms     2.05  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 2, 'random_misses')
+      38.3±0.5ms       68.9±0.8ms     1.80  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 2, 'monotone_hits')
+         132±3ms          231±2ms     1.75  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 16, 'monotone_misses')
+        38.8±1ms       67.6±0.5ms     1.74  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 2, 'random_hits')
+         144±1ms          246±2ms     1.70  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 16, 'monotone_misses')
+       103±0.7ms          173±2ms     1.68  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 10, 'monotone_misses')
+      38.6±0.5ms         61.7±1ms     1.60  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 2, 'random_misses')
+         117±1ms          187±1ms     1.59  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 10, 'monotone_misses')
+         132±1ms          200±1ms     1.52  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 16, 'random_misses')
+      83.8±0.8ms        124±0.6ms     1.48  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 10, 'monotone_hits')
+         147±1ms        213±0.3ms     1.45  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 16, 'random_misses')
+         131±1ms        189±0.7ms     1.45  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 16, 'random_hits')
+      97.7±0.6ms          139±1ms     1.43  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 10, 'monotone_hits')
+         147±3ms          205±2ms     1.39  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 16, 'random_hits')
+         374±3ms         518±10ms     1.39  series_methods.IsInLongSeriesValuesDominate.time_isin('float64', 'monotone')
+         390±4ms         506±10ms     1.30  series_methods.IsInLongSeriesValuesDominate.time_isin('float32', 'monotone')
+       102±0.3ms        118±0.9ms     1.16  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 10, 'random_misses')
-         147±1ms          130±1ms     0.88  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 16, 'monotone_hits')
-         177±3ms        155±0.5ms     0.87  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 16, 'monotone_misses')
-         133±1ms          115±1ms     0.87  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 16, 'monotone_hits')
-         162±3ms        139±0.9ms     0.86  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 16, 'monotone_misses')
-         342±5ms          277±1ms     0.81  series_methods.IsInLongSeriesValuesDominate.time_isin('int32', 'monotone')
-         326±4ms          264±2ms     0.81  series_methods.IsInLongSeriesValuesDominate.time_isin('int64', 'monotone')
-         542±3ms          357±4ms     0.66  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 100000, 'monotone_misses')
-         532±6ms          345±4ms     0.65  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 100000, 'monotone_misses')
-         522±2ms          298±1ms     0.57  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 100000, 'monotone_misses')
-         121±2ms       68.6±0.5ms     0.57  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 10, 'random_hits')
-         123±4ms       68.2±0.4ms     0.56  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 10, 'monotone_hits')
-         518±8ms          286±3ms     0.55  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 100000, 'monotone_misses')
-         104±2ms       53.5±0.6ms     0.52  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 10, 'monotone_hits')
-         104±1ms       52.8±0.1ms     0.51  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 10, 'random_hits')
-         418±5ms        200±0.9ms     0.48  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 50, 'monotone_misses')
-         536±4ms          246±2ms     0.46  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 1000, 'monotone_misses')
-         1.27±0s         577±20ms     0.45  series_methods.IsInLongSeriesValuesDominate.time_isin('float32', 'random')
-        408±30ms          183±1ms     0.45  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 50, 'monotone_misses')
-         418±4ms          187±1ms     0.45  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 50, 'random_misses')
-         535±7ms          229±2ms     0.43  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 1000, 'monotone_misses')
-         405±8ms        173±0.6ms     0.43  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 50, 'random_misses')
-         1.26±0s          528±7ms     0.42  series_methods.IsInLongSeriesValuesDominate.time_isin('float64', 'random')
-         180±2ms         69.2±2ms     0.38  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 16, 'monotone_hits')
-         179±2ms       67.9±0.2ms     0.38  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 16, 'random_hits')
-         1.09±0s         409±20ms     0.38  series_methods.IsInLongSeriesValuesDominate.time_isin('int32', 'random')
-         416±3ms          155±1ms     0.37  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 50, 'random_hits')
-      1.08±0.01s          391±5ms     0.36  series_methods.IsInLongSeriesValuesDominate.time_isin('int64', 'random')
-         399±3ms          142±2ms     0.35  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 50, 'random_hits')
-         181±1ms       61.4±0.3ms     0.34  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 16, 'random_misses')
-         436±2ms          147±2ms     0.34  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 100000, 'monotone_hits')
-       160±0.6ms       53.1±0.6ms     0.33  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 16, 'random_hits')
-       162±0.6ms       52.9±0.3ms     0.33  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 16, 'monotone_hits')
-         424±5ms          132±2ms     0.31  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 100000, 'monotone_hits')
-         514±2ms          158±2ms     0.31  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 1000, 'monotone_misses')
-         163±3ms       46.6±0.2ms     0.29  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 16, 'random_misses')
-         420±6ms        120±0.8ms     0.28  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 50, 'monotone_hits')
-         507±7ms        144±0.9ms     0.28  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 1000, 'monotone_misses')
-         442±3ms        124±0.4ms     0.28  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 1000, 'monotone_hits')
-         401±5ms        105±0.7ms     0.26  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 50, 'monotone_hits')
-         437±8ms        109±0.7ms     0.25  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 1000, 'monotone_hits')
-         812±3ms          201±1ms     0.25  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 100, 'random_misses')
-         815±4ms          197±1ms     0.24  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 100, 'monotone_misses')
-         796±7ms          187±2ms     0.23  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 100, 'random_misses')
-         519±5ms        120±0.6ms     0.23  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 50, 'monotone_misses')
-         803±8ms          184±2ms     0.23  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 100, 'monotone_misses')
-         499±8ms          105±1ms     0.21  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 50, 'monotone_misses')
-         816±4ms        151±0.6ms     0.18  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 100, 'random_hits')
-         405±2ms         73.7±1ms     0.18  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 100000, 'monotone_hits')
-         389±1ms       68.2±0.6ms     0.18  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 1000, 'monotone_hits')
-      2.07±0.02s          358±2ms     0.17  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 100000, 'random_misses')
-      1.94±0.01s          333±5ms     0.17  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 100000, 'random_misses')
-        796±10ms        136±0.3ms     0.17  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 100, 'random_hits')
-      2.04±0.02s          342±2ms     0.17  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 100000, 'random_misses')
-      1.94±0.03s          317±2ms     0.16  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 100000, 'random_misses')
-      2.06±0.01s          324±1ms     0.16  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 100000, 'random_hits')
-      2.04±0.01s          311±2ms     0.15  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 100000, 'random_hits')
-         814±5ms          120±1ms     0.15  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 100, 'monotone_hits')
-      1.71±0.01s        251±0.7ms     0.15  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 1000, 'random_misses')
-         399±9ms       56.9±0.4ms     0.14  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 100000, 'monotone_hits')
-         381±5ms       53.4±0.2ms     0.14  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 1000, 'monotone_hits')
-      1.71±0.02s          236±3ms     0.14  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 1000, 'random_misses')
-         800±7ms          106±1ms     0.13  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 100, 'monotone_hits')
-         522±4ms       68.5±0.5ms     0.13  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 50, 'monotone_hits')
-        532±20ms       67.9±0.2ms     0.13  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 50, 'random_hits')
-      1.01±0.01s          122±1ms     0.12  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 100, 'monotone_misses')
-         519±6ms       62.3±0.7ms     0.12  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 50, 'random_misses')
-         998±4ms          106±1ms     0.11  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 100, 'monotone_misses')
-         498±5ms       52.7±0.1ms     0.11  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 50, 'random_hits')
-         501±8ms       52.9±0.4ms     0.11  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 50, 'monotone_hits')
-      1.71±0.01s          168±1ms     0.10  series_methods.IsInLongSeriesLookUpDominates.time_isin('float32', 1000, 'random_hits')
-         500±7ms       46.8±0.7ms     0.09  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 50, 'random_misses')
-      1.70±0.01s          154±2ms     0.09  series_methods.IsInLongSeriesLookUpDominates.time_isin('float64', 1000, 'random_hits')
-      1.02±0.01s       68.4±0.6ms     0.07  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 100, 'random_hits')
-      1.02±0.01s         68.0±1ms     0.07  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 100, 'monotone_hits')
-      1.01±0.01s       61.5±0.6ms     0.06  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 100, 'random_misses')
-      1.00±0.01s       53.7±0.9ms     0.05  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 100, 'monotone_hits')
-         999±6ms       53.5±0.9ms     0.05  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 100, 'random_hits')
-         1.00±0s       47.1±0.5ms     0.05  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 100, 'random_misses')
-      1.92±0.01s         85.8±1ms     0.04  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 100000, 'random_hits')
-         1.59±0s       67.9±0.2ms     0.04  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 1000, 'random_hits')
-      1.58±0.01s       62.0±0.5ms     0.04  series_methods.IsInLongSeriesLookUpDominates.time_isin('int32', 1000, 'random_misses')
-      1.91±0.01s       70.0±0.3ms     0.04  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 100000, 'random_hits')
-      1.57±0.01s         55.6±2ms     0.04  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 1000, 'random_hits')
-      1.58±0.05s       47.0±0.2ms     0.03  series_methods.IsInLongSeriesLookUpDominates.time_isin('int64', 1000, 'random_misses')

When the look-up is dominated by the calculation of the hash-function (small numbers), we see the disadvantages of #36729 - it is costlier now (almost factor 3). However, already for n about 100 we see the advantages of a more robust hash-function: for some series we are almost 10 times faster (e.g. series_methods.IsInLongSeries.time_isin('float32', 100, 'random_misses') 1.8s vs. 201±1ms).

The question is: is it worth to keep np.in1d for (len(values)<16) for best possible performance or to drop it completely?

Seeing the numbers, I would say "Yes", even if it make the code harder to understand/maintain. What is your opinion @jreback @jbrockmendel @WillAyd ?