ENH: Series constructor converts dicts with tuples in keys to MultiIndex by adgaudio · Pull Request #4805 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation128 Commits3 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

@adgaudio

closes #3323
related #4187
related #1848

Hi,

I'd love for Series (and DataFrame) initialized with a dict to automatically create the MultiIndex. See below example.

Desired Result:

In [1]: pandas.Series({('a', 'b'): 1, ('a', 'a'): 0, ('a', 'c'): 2, ('b', 'a'): 3, ('b', 'b'): 4})
Out[1]:
a  a    0
   b    1
   c    2
b  a    3
   b    4
dtype: int64

In [2]:

Current result:

In [1]: pandas.Series({('a', 'b'): 1, ('a', 'a'): 0, ('a', 'c'): 2, ('b', 'a'): 3, ('b', 'b'): 4})
Out[1]:
(a, a)    0
(a, b)    1
(a, c)    2
(b, a)    3
(b, b)    4
dtype: int64

In [2]:

Is this something I can go ahead and update tests for? It seems like we should have it.

Also, what are your thoughts about adding this to the DataFrame constructor as well?

Thanks,
Alex

@adgaudio

Previous discussion of:

hayd

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know whether it's worth considering the edge case where only the first element is a tuple, and just doing Index.

@hayd

You have some failing tests, because this isn't valid in python 3 (IIRC keys() is a set rather than a list):

data.keys()[0], tuple):
TypeError: 'dict_keys' object does not support indexing

You need to add some tests to demonstrate this behaviour for both Series and DataFrame.

I guess for consistency, this should also be the case if you are passing a list of tuples to index argument...? Are there any other cases where this would also make sense?

@adgaudio

Hey @hayd,

Thanks for your response. I was waiting for a response before continuing with this pr, hence the general incompleteness of the pr.

I think your idea to put this in the Index constructor is great, because then we don't need duplicate code in the Series or DataFrames. However, I've been trying to make this change, and it's actually quite complicated. First, I'm getting multiple seg faults (one of which I fixed and this fix might resolve other bugs in pandas). Second, because the Index constructor converts a list of tuples into a MultiIndex, it breaks a few tests.

It's fairly complicated to resolve, especially since I don't know an efficient way for figuring out causes for seg faults. Hopefully, I'll update this pr later this weekend with some progress.

@jtratner

@adgaudio if you're experiencing segfaults, you might want to check that you aren't doing the following (with Cython code):

  1. passing incorrectly typed objects to the code (Cython functions need to have exactly the correct type)
  2. passing objects of the wrong length (i.e., with code with the boundscheck false decorator)

For example, if you pass a tuple to a function that expects type list or a defaultdict to a function expecting dict, it could fail (though usually it'll explicitly tell you).

@hayd

@jtratner I've seen some segfaulting with the MultiIndexes before (example was Series of a mi #4187, now NotImplemented), I didn't look into the why it was happening...

@jtratner

@hayd if I had to hazard a guess, it's probably because MultiIndex is basically a fake ndarray, as opposed to Index which really is an ndarray. So you can pass a MultiIndex to Cython and it will try to do things with it, but all of them will fail, since the internal datastructure does't really have anything to do with ndarray.

@adgaudio

Thanks for that tips @jtratner and @hayd! I just figured out some interesting details about GH #4187 that could let us propose a workaround fix to that one... When the Series constructor is called, it calls _sanitize_array, which eventually calls infer_dtype and causes the seg fault. ...So, if you make the cpython call to "infer_dtype(...)" (see inference.pyx) with a MultiIndex, you can cause the seg fault.


If we're interested in avoiding workarounds, I think the core question we need to ask is just what @jtratner basically asked: Is it possible to make a MultiIndex compatible with Cython?

If we can't make the MultiIndex (or pa.Array) cython compatible, it seems pretty clear that we will have a million little work arounds for these damn MultiIndexes. The work-arounds I'd propose are to add an "if isinstance(X, MultiIndex): data = data.values" to _sanitize_array and a variation of that to _homogenize() (in frame.py)

@jreback

Its faulting somewhere else....this is safe to call (as it accepts an object type)

In [1]: pd.lib.infer_dtype(np.array([1,2,3]))
Out[1]: 'integer'

In [2]: pd.lib.infer_dtype(np.array([1,2,3],dtype=object))
Out[2]: 'integer'

In [3]: pd.lib.infer_dtype(np.array([1,2,3.0],dtype=object))
Out[3]: 'mixed-integer-float'

In [4]: pd.lib.infer_dtype(np.array(['a','b','c'],dtype=object))
Out[4]: 'string'

@jreback

@adgaudio show the traceback where it faults (the whole one)

@adgaudio

Sorry, I realized that I mis-spoke and updated my comment a few mins ago. This is the error I get. No traceback:

In [13]: a = pandas.MultiIndex.from_tuples(zip(range(10), range(10)))

In [14]: pandas.lib.infer_dtype(a)
zsh: segmentation fault (core dumped)  ipython

-- OR --

In [2]: pandas.lib.fast_multiget({1: 2}, a)
zsh: segmentation fault (core dumped)  ipython

@cpcloud

I'm getting this segfault as well. I don't even get an error message :(

@jreback

Do this.....the problem is an index/mi is a 'special' that overrides some ndarray methods....
I don't think this will break anything else.

This is called in _possibly_convert_datetime, so that will do nothing (which is correct)

diff --git a/pandas/src/inference.pyx b/pandas/src/inference.pyx
index e0bbc1a..a37de47 100644
--- a/pandas/src/inference.pyx
+++ b/pandas/src/inference.pyx
@@ -41,6 +41,9 @@ def infer_dtype(object _values):
             _values = list(_values)
         values = list_to_object_array(_values)

+    if hasattr(values,'values'):
+        values = values.values
+
     val_kind = values.dtype.type
     if val_kind in _TYPE_MAP:
         return _TYPE_MAP[val_kind]

@jreback

also...pls check perf, e.g. test_perf.sh for any change you make....it is important that Series construction doesn't change the common cases too much

as an aside...I don't think doing an isinstance(value, Index) in _sanitize_array is a big deal (after all they ARE (currrently) ndarrays)

@jtratner

@adgaudio @jreback yeah, it's tricky. MultiIndex is only nominally an ndarray, because all of its features are actually reimplemented in Python so if Cython does any optimizations for ndarray, it'll mess up around this.

Btw, for cython, better to do this:

values = getattr(values, 'values', values)

than the hasattr check, because it can optimize it better.

@cpcloud

good 2 know about hasattr vs. getattr.

@adgaudio

Hmm - I think the topic of this pr has strayed a little from its original goal (which is primarily my fault!). The original goal was to make pandas.Index(list_of_tuples) return a MultiIndex. Having spent several hours in the past couple days looking at pandas internals, it seems pretty clear to me now that the Index should not return a MultiIndex. For instance, methods like MultiIndex._tuple_index rely on the fact that the Index won't return a MI... And as a further argument, if an Index constructor was allowed to return a MultiIndex, then shouldn't a DataFrame be able to return a Panel? That doesn't seem like functionality pandas would support.

On the other hand, we discussed some useful things that are probably worth implementing:

Is everyone okay if I close this PR and maybe make a new one to address the seg faults? In sum, I changed my mind and don't think an Index should return a MultiIndex.

@jtratner

@adgaudio if you want that makes sense. That said, I don't think it's unreasonable to have a dict input to constructor return something with a MultiIndex.

@adgaudio

Ok - I can certainly do that! In that case, I may just try to merge seg fault stuff into here. Will be interesting to see whether there's a significant perf hit with the changes this brings.

... will be in prog here for a few days

@adgaudio

Ok...this is ready for review. Sorry for the delay

In summary, MultiIndexes are automatically created for index and columns when passing a dict to Series and DataFrames.

I also ran a perf test and will be posting those results in a follow-up comment. I described() the last column of numbers, which represent the ratio of master to "my changes rebased against master". If I understand this correctly, this feature makes pandas slower by 1.005 times on average.

count 171.000000
mean 1.004673
std 0.076243
min 0.779600
25% 0.977800
50% 0.993800
75% 1.026200
max 1.498600

ratios

@adgaudio

Invoked with :
--ncalls: 1
--repeats: 3


-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
frame_reindex_both_axes                      |  15.2452 |  32.3210 |   0.4717 |
series_value_counts_strings                  |   4.0641 |   5.2130 |   0.7796 |
frame_add                                    |   3.9530 |   4.9031 |   0.8062 |
timeseries_asof_single                       |   0.0291 |   0.0360 |   0.8079 |
indexing_dataframe_boolean                   |  49.4530 |  57.8468 |   0.8549 |
indexing_dataframe_boolean_rows              |   0.2789 |   0.3099 |   0.9000 |
join_dataframe_index_single_key_bigger       |   5.2481 |   5.7790 |   0.9081 |
write_csv_standard                           |  30.0810 |  33.0341 |   0.9106 |
stats_rolling_mean                           |   1.0490 |   1.1470 |   0.9146 |
timeseries_add_irregular                     |  15.7468 |  17.2162 |   0.9147 |
reindex_frame_level_reindex                  |   0.5591 |   0.6101 |   0.9164 |
reshape_pivot_time_series                    | 129.9939 | 141.0310 |   0.9217 |
mask_bools                                   |  24.4670 |  26.4800 |   0.9240 |
frame_to_string_floats                       |  30.3071 |  32.7630 |   0.9250 |
frame_repr_wide                              |   0.7708 |   0.8330 |   0.9253 |
series_string_vector_slice                   | 146.1499 | 156.5220 |   0.9337 |
frame_getitem_single_column                  |  17.7062 |  18.9250 |   0.9356 |
stats_rank_average                           |  23.2301 |  24.7140 |   0.9400 |
append_frame_single_mixed                    |   0.5651 |   0.5999 |   0.9420 |
sort_level_one                               |   3.7041 |   3.9291 |   0.9427 |
mask_floats                                  |  12.7940 |  13.5541 |   0.9439 |
melt_dataframe                               |   1.6730 |   1.7691 |   0.9457 |
groupby_multi_series_op                      |  12.2011 |  12.8889 |   0.9466 |
stat_ops_frame_mean_int_axis_1               |   5.3780 |   5.6581 |   0.9505 |
frame_fancy_lookup                           |   2.7161 |   2.8570 |   0.9507 |
reshape_unstack_simple                       |   2.7251 |   2.8622 |   0.9521 |
frame_insert_500_columns_end                 |  79.2389 |  83.1900 |   0.9525 |
frame_drop_duplicates_na                     |  13.4761 |  14.1251 |   0.9541 |
frame_drop_duplicates                        |  13.3901 |  14.0190 |   0.9551 |
index_datetime_intersection                  |   8.4031 |   8.7969 |   0.9552 |
stat_ops_frame_sum_int_axis_0                |   4.8449 |   5.0640 |   0.9567 |
frame_to_csv2                                |  81.8870 |  85.3350 |   0.9596 |
stats_rank2d_axis1_average                   |  10.9470 |  11.3881 |   0.9613 |
stat_ops_frame_sum_float_axis_1              |   4.5249 |   4.6859 |   0.9657 |
frame_fillna_inplace                         |  10.3819 |  10.7038 |   0.9699 |
indexing_panel_subset                        |   0.3691 |   0.3800 |   0.9711 |
frame_to_csv                                 |  91.5911 |  94.3000 |   0.9713 |
frame_ctor_list_of_dict                      |  64.0192 |  65.8691 |   0.9719 |
merge_2intkey_sort                           |  32.4190 |  33.3450 |   0.9722 |
frame_multi_and                              |  69.8371 |  71.7850 |   0.9729 |
stat_ops_frame_sum_int_axis_1                |   4.9090 |   5.0452 |   0.9730 |
frame_reindex_both_axes_ix                   |  32.0160 |  32.8770 |   0.9738 |
series_drop_duplicates_int                   |   0.5660 |   0.5791 |   0.9774 |
timeseries_asof                              |   6.8550 |   7.0112 |   0.9777 |
join_dataframe_index_multi                   |  16.8700 |  17.2520 |   0.9779 |
join_dataframe_index_single_key_bigger_sort  |  13.3920 |  13.6938 |   0.9780 |
frame_loc_dups                               |   0.6258 |   0.6399 |   0.9780 |
frame_constructor_ndarray                    |   0.0429 |   0.0439 |   0.9783 |
reindex_fillna_backfill                      |   0.5271 |   0.5388 |   0.9783 |
series_align_left_monotonic                  |  10.3209 |  10.5450 |   0.9787 |
groupby_multi_cython                         |  13.3641 |  13.6521 |   0.9789 |
index_int64_union                            |  64.3780 |  65.7530 |   0.9791 |
stat_ops_frame_sum_float_axis_0              |   4.4549 |   4.5469 |   0.9798 |
dti_reset_index_tz                           |  10.0758 |  10.2811 |   0.9800 |
read_csv_vb                                  |  16.2349 |  16.5591 |   0.9804 |
stat_ops_series_std                          |   0.8879 |   0.9048 |   0.9813 |
timeseries_asof_nan                          |   6.3412 |   6.4590 |   0.9818 |
stat_ops_frame_mean_int_axis_0               |   4.8530 |   4.9391 |   0.9826 |
frame_to_csv_mixed                           | 191.1800 | 194.5429 |   0.9827 |
stat_ops_frame_mean_float_axis_0             |   4.4539 |   4.5290 |   0.9834 |
groupby_multi_different_numpy_functions      |  10.8140 |  10.9761 |   0.9852 |
frame_reindex_upcast                         |  10.7739 |  10.9279 |   0.9859 |
groupby_last                                 |   3.2680 |   3.3140 |   0.9861 |
stat_ops_level_series_sum                    |   2.3448 |   2.3770 |   0.9865 |
series_value_counts_int64                    |   1.8809 |   1.9040 |   0.9879 |
groupby_multi_python                         | 104.6109 | 105.8180 |   0.9886 |
merge_2intkey_nosort                         |  14.5009 |  14.6639 |   0.9889 |
frame_reindex_axis1                          | 454.3140 | 459.0740 |   0.9896 |
match_strings                                |   0.2739 |   0.2768 |   0.9897 |
groupby_series_simple_cython                 |   4.0159 |   4.0562 |   0.9901 |
series_align_int64_index                     |  23.7849 |  24.0159 |   0.9904 |
read_csv_thou_vb                             |  14.1041 |  14.2410 |   0.9904 |
frame_fancy_lookup_all                       |  12.5110 |  12.6290 |   0.9907 |
frame_mult                                   |   3.9458 |   3.9830 |   0.9907 |
lib_fast_zip                                 |   7.6091 |   7.6802 |   0.9907 |
stats_rank_average_int                       |  18.6820 |  18.8482 |   0.9912 |
groupby_pivot_table                          |  16.5930 |  16.7401 |   0.9912 |
timeseries_timestamp_downsample_mean         |   3.5810 |   3.6111 |   0.9917 |
frame_reindex_axis0                          |  77.9679 |  78.6190 |   0.9917 |
stats_corr_spearman                          |  70.6170 |  71.1949 |   0.9919 |
timeseries_infer_freq                        |   6.5250 |   6.5749 |   0.9924 |
timeseries_period_downsample_mean            |   5.3890 |   5.4300 |   0.9924 |
stat_ops_level_frame_sum                     |   2.9190 |   2.9399 |   0.9929 |
join_dataframe_integer_2key                  |   4.4181 |   4.4479 |   0.9933 |
series_align_irregular_string                |  50.0710 |  50.4081 |   0.9933 |
reindex_frame_level_align                    |   0.5832 |   0.5870 |   0.9935 |
stat_ops_level_frame_sum_multiple            |   7.2091 |   7.2539 |   0.9938 |
groupby_multi_size                           |  24.5860 |  24.7180 |   0.9947 |
stats_rank2d_axis0_average                   |  18.5640 |  18.6539 |   0.9952 |
sparse_series_to_frame                       | 117.1200 | 117.5661 |   0.9962 |
unstack_sparse_keyspace                      |   1.3909 |   1.3959 |   0.9964 |
groupby_first                                |   3.0730 |   3.0839 |   0.9964 |
frame_ctor_nested_dict                       |  57.8749 |  58.0280 |   0.9974 |
reindex_fillna_backfill_float32              |   0.4182 |   0.4191 |   0.9977 |
reshape_stack_simple                         |   1.4410 |   1.4429 |   0.9987 |
join_dataframe_integer_key                   |   1.4930 |   1.4949 |   0.9987 |
replace_replacena                            |   3.6631 |   3.6659 |   0.9992 |
groupby_frame_median                         |   5.4619 |   5.4641 |   0.9996 |
read_csv_standard                            |   9.2361 |   9.2380 |   0.9998 |
groupby_last_float32                         |   3.2320 |   3.2320 |   1.0000 |
series_constructor_ndarray                   |   0.0160 |   0.0160 |   1.0000 |
frame_fillna_many_columns_pad                |  12.1920 |  12.1920 |   1.0000 |
timeseries_large_lookup_value                |   0.0231 |   0.0231 |   1.0000 |
reindex_daterange_backfill                   |   2.4390 |   2.4371 |   1.0008 |
timeseries_sort_index                        |  18.2889 |  18.2478 |   1.0022 |
reindex_daterange_pad                        |   2.4340 |   2.4259 |   1.0033 |
frame_sort_index_by_columns                  |  31.2161 |  31.1110 |   1.0034 |
read_csv_comment2                            |  17.5040 |  17.4401 |   1.0037 |
groupby_indices                              |   6.3572 |   6.3312 |   1.0041 |
frame_boolean_row_select                     |   0.2131 |   0.2120 |   1.0056 |
timeseries_to_datetime_iso8601               |   3.8910 |   3.8612 |   1.0077 |
replace_fillna                               |   3.6750 |   3.6449 |   1.0082 |
lib_fast_zip_fillna                          |  10.5841 |  10.4971 |   1.0083 |
groupby_multi_different_functions            |  10.8781 |  10.7880 |   1.0084 |
datetimeindex_normalize                      |   2.2500 |   2.2280 |   1.0098 |
stat_ops_level_series_sum_multiple           |   6.6872 |   6.6178 |   1.0105 |
index_datetime_union                         |   8.6920 |   8.6010 |   1.0106 |
frame_get_dtype_counts                       |   0.0880 |   0.0870 |   1.0110 |
frame_iteritems                              |  25.3410 |  25.0630 |   1.0111 |
reindex_fillna_pad                           |   0.5260 |   0.5200 |   1.0115 |
frame_insert_100_columns_begin               |  15.4629 |  15.2860 |   1.0116 |
panel_from_dict_two_different_indexes        |  39.9530 |  39.4778 |   1.0120 |
groupby_first_float32                        |   3.0582 |   3.0160 |   1.0140 |
frame_getitem_single_column2                 |  18.0881 |  17.8120 |   1.0155 |
stat_ops_frame_mean_float_axis_1             |   5.0459 |   4.9639 |   1.0165 |
indexing_dataframe_boolean_rows_object       |   0.4680 |   0.4601 |   1.0171 |
groupby_sum_booleans                         |   0.7811 |   0.7651 |   1.0209 |
frame_drop_dup_inplace                       |   2.2190 |   2.1710 |   1.0221 |
read_parse_dates_iso8601                     |   1.1170 |   1.0900 |   1.0247 |
groupby_simple_compress_timing               |  29.3281 |  28.5370 |   1.0277 |
sparse_frame_constructor                     |   9.3942 |   9.1388 |   1.0279 |
sort_level_zero                              |   3.7801 |   3.6709 |   1.0297 |
panel_from_dict_all_different_indexes        |  53.7281 |  52.1729 |   1.0298 |
groupby_frame_singlekey_integer              |   2.1300 |   2.0669 |   1.0306 |
reindex_multiindex                           |   0.8409 |   0.8130 |   1.0343 |
append_frame_single_homogenous               |   0.2060 |   0.1991 |   1.0347 |
reindex_fillna_pad_float32                   |   0.4320 |   0.4170 |   1.0360 |
dataframe_reindex                            |   0.3462 |   0.3340 |   1.0364 |
groupby_transform                            | 117.1820 | 112.8821 |   1.0381 |
frame_get_numeric_data                       |   0.0780 |   0.0751 |   1.0381 |
groupby_apply_dict_return                    |  24.9550 |  23.9921 |   1.0401 |
index_int64_intersection                     |  23.8118 |  22.8329 |   1.0429 |
concat_series_axis1                          |  65.4199 |  62.6309 |   1.0445 |
read_table_multiple_date_baseline            |  75.4662 |  72.0320 |   1.0477 |
panel_from_dict_equiv_indexes                |  24.9131 |  23.7088 |   1.0508 |
timeseries_1min_5min_mean                    |   0.4830 |   0.4590 |   1.0525 |
dti_reset_index                              |   0.1781 |   0.1690 |   1.0536 |
panel_from_dict_same_index                   |  24.9889 |  23.6831 |   1.0551 |
datetimeindex_unique                         |   0.0939 |   0.0889 |   1.0563 |
read_table_multiple_date                     | 163.5280 | 153.4901 |   1.0654 |
frame_drop_dup_na_inplace                    |   2.2259 |   2.0850 |   1.0676 |
datetimeindex_add_offset                     |   0.1731 |   0.1619 |   1.0692 |
frame_iloc_dups                              |   0.2170 |   0.2019 |   1.0744 |
frame_iteritems_cached                       |   0.4621 |   0.4299 |   1.0749 |
concat_small_frames                          |  12.3281 |  11.4250 |   1.0790 |
frame_xs_row                                 |   0.0370 |   0.0341 |   1.0839 |
series_drop_duplicates_string                |   0.4370 |   0.4032 |   1.0840 |
frame_reindex_columns                        |   0.3030 |   0.2789 |   1.0863 |
join_dataframe_index_single_key_small        |   5.3852 |   4.9469 |   1.0886 |
frame_repr_tall                              |   2.2731 |   2.0730 |   1.0965 |
frame_ctor_nested_dict_int64                 |  82.3250 |  75.0730 |   1.0966 |
groupby_frame_cython_many_columns            |   3.0849 |   2.8129 |   1.0967 |
frame_xs_col                                 |   0.0231 |   0.0210 |   1.1023 |
timeseries_1min_5min_ohlc                    |   0.5291 |   0.4780 |   1.1067 |
timeseries_to_datetime_YYYYMMDD              |   7.9660 |   7.1719 |   1.1107 |
ctor_index_array_string                      |   0.0169 |   0.0150 |   1.1270 |
timeseries_slice_minutely                    |   0.0441 |   0.0381 |   1.1562 |
timeseries_timestamp_tzinfo_cons             |   0.0129 |   0.0110 |   1.1739 |
groupby_frame_apply_overhead                 |   8.7209 |   7.2041 |   1.2106 |
groupby_frame_apply                          |  42.1200 |  34.2922 |   1.2283 |
series_ctor_from_dict                        |   2.9459 |   2.1560 |   1.3664 |
period_setitem                               | 156.1499 | 104.1942 |   1.4986 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [da81e40] : ENH: Series and DataFrame constructor autodetect when index/columns
should be MultiIndex

ENH: prevents some seg faults in calls to cython funcs
Base   [d702de0] : Merge pull request #4783 from jtratner/wrap-compressed-fileobjects-in-py3

BUG: Fix input bytes conversion in Py3 to return str


<class 'pandas.core.frame.DataFrame'>
Index: 172 entries, frame_reindex_both_axes to period_setitem
Data columns (total 8 columns):
count    172  non-null values
mean     172  non-null values
std      172  non-null values
min      172  non-null values
25%      172  non-null values
50%      172  non-null values
75%      172  non-null values
max      172  non-null values
dtypes: float64(8)

@jreback

can you add releaset notes? does it close all 3 of those listed issues? have tests from all 3? pls add the issues it closes to the very top of the PR, with something like closes #issue

@jreback

can you perf tests the bottom 2 (period_setitem and series_constructor) with a higher number of ncalls (so it will avg them)

@jtratner

If it turns out to be slow, might make sense to add a keyword argument to disable/enable this processing. (similar to tupleize_cols for csv, etc)

cpcloud

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you need to keep the list constructor here because dict.values() is a view in Python 3, unless extract_index does the conversion for you

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract_index just iterates over the values and appends its elements to some list. So isn't the list(...) call unnecessary?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's passing on the Python 3 builds then yes :)

@hayd

It's not so crazy that passing in tuples to index could lead to a mi... (You can also get a DatetimeIndex when creating an Index and it's very different from the Panel vs DataFrame!*)

In [1]: s = pd.Series(index=[[1, 3], [3, 4]])

In [2]: s.index
Out[2]:
MultiIndex
[(1, 3), (3, 4)]

In [3]: isinstance(s.index, pd.Index)
Out[3]: True

In [4]: pd.Index([[1, 3], [3, 4]])  # inconsistent / weird
Out[4]: Index([[1, 3], [3, 4]], dtype=object)

I think this should be consistent however you construct... imo this is a good opportunity to correct the above behaviour, where you get a different index depending! Should be the same via Index, Series(index=) or passing dict... atm it's not

+1 for adding some keyword to disable (?) this (like tuplize_col..)

@adgaudio

@hayd,

I think it would be a great idea to have the Index constructure also return a MultiIndex, but only if a kwarg like "infer_multiindex=True" was passed to the constructor. By default, the "infer_multiindex=False" because if it was True, it would cause seg faults and break several tests. I might be able to tackle this in this PR. If not, perhaps another pr...

Logically, something also seems funky about letting an Index return a MultiIndex. For instance, should a DataFrame be able to return a Panel? I think you hinted at this in your comment.

@adgaudio

@jtratner - thanks for your comments.

Here are the perf results with ncalls=5. I also just rebased against master. The average ratio is now actually a tiny bit worse (was 1.004 and now is 1.01), but the basic distribution of ratios looks more or less like the picture I posted above.

Are these performance hits alright for this PR, or should I add a kwarg to the DataFrame and Series constructors that disables it by default?

count 190.000000
mean 1.010531
std 0.041385
min 0.877500
25% 0.995000
50% 1.003550
75% 1.018125
max 1.231000

[... truncated ...]
frame_ctor_list_of_dict                      |  68.6394 |  64.7336 |   1.0603 |
merge_2intkey_sort                           |  34.2390 |  32.1878 |   1.0637 |
frame_fancy_lookup_all                       |  13.8032 |  12.9736 |   1.0639 |
frame_sort_index_by_columns                  |  33.8332 |  31.7710 |   1.0649 |
frame_ctor_nested_dict_int64                 |  81.2192 |  75.8838 |   1.0703 |
series_value_counts_strings                  |   4.4824 |   4.1206 |   1.0878 |
stats_rank_average                           |  23.8794 |  21.7868 |   1.0961 |
merge_2intkey_nosort                         |  15.9592 |  14.2020 |   1.1237 |
join_dataframe_index_multi                   |  19.3036 |  16.8800 |   1.1436 |
mask_bools                                   |  13.2502 |  11.4928 |   1.1529 |
groupby_pivot_table                          |  20.3042 |  16.7432 |   1.2127 |
groupby_multi_different_functions            |  13.4090 |  11.0168 |   1.2171 |
groupby_multi_different_numpy_functions      |  13.3692 |  10.8602 |   1.2310 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [8e15111] : rel notes  ### FYI - I squashed this, and it represents my latest commit.
Base   [275a42c] : Merge pull request #4965 from westurner/patch-1

DOC: remote_data.rst: Fama/French punctuation

jreback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take this out; identical is public but will confuse users; they don't need to know about it

@adgaudio

ready for review/merge again. just rebased master.

jreback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take this space out

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done - btw why are there 3 spaces in these sections instead of 4? Is this some kind of documentation standard?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not a standard. It's just that it doesn't matter, and there is no real standard. In reStructuredText, you will often see that it is indented like the start of the directive name (here ipython), so then you have 3 spaces (but even in their documentation they are not consequent with that). But I personally like to use 4 spaces as this is much more convenient to type with tab.

@jreback

@adgaudio can you rebase ....i'll take another look

@adgaudio

@adgaudio

@adgaudio

@jreback

@jreback

@adgaudio side note, this is actually a dict of tuples, right? (you refer to it as a list of tuples) (I guess its technically a list of tuples too), or are the docs I added not clear?

@jorisvandenbossche

@jreback

@adgaudio

This is great! Thank you for merging this work :)

@jreback Thanks for the partial fix to documentation. Also, you are correct that the PR works for both dicts of tuples and lists of tuples.

@jorisvandenbossche I should be able to get to these soon in another PR. I guess I was wrong to assume passing tests meant the documentation wouldn't have errors. Why do Travis builds still pass if I introduce documentation errors? Can/should we change that behavior?

@jorisvandenbossche

@adgaudio I think the docs are OK now. Yes, indeed, getting green light from travis does not say anything about the doc build, only about the test suite that is passing. The docs are built on travis so you can see the log of that, but it does not notify when there were warnings (but we were thinking to maybe try to change this, see #3800)

Labels