ENH: _dir_additions returns also the first level of a MultiIndex by BibMartin · Pull Request #16326 · pandas-dev/pandas (original) (raw)

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

BibMartin

Assuming a DataFrame with MultiIndex columns like this:

df foo bar
foo bar 0 0.32 0.15 0.45 1 0.62 0.73 0.36 2 0.21 0.68 0.10 3 0.05 0.36 0.90

one can access to df['foo'] and df['bar'] with the shortcuts df.foo and df.bar,
but one don't benefit from autocompletion because 'foo' and 'bar' are not listed in dir(df).

'bar' in dir(df) False

This PR extends df._dir_additions so that the first level of a MultiIndex is listed in dir(df).

@codecov

@codecov

@jreback

I am +0 on this. This add some needless complexity, though from a user POV might be nice. Would need some additional tests to consider.

@BibMartin

@jreback

Would need some additional tests to consider.

Sure. I plan to put tests if the principle is accepted. Any idea is welcome.
Btw, I've not seen whether there are tests about dir behavior.

@jreback

Btw, I've not seen whether there are tests about dir behavior.

bash-3.2$ grep -R 'dir(' pandas/tests/

@BibMartin

Added test_tab_completion for DataFrames, inpired from Series.

@jreback

@jreback

this is ok in principle. can you rebase / update

@jreback

@BibMartin

Sorry, I've been long to rebase.

Do I need to add a whatsneww entry ? In which version file ?

jreback

@@ -196,7 +196,11 @@ def __unicode__(self):
def _dir_additions(self):
""" add the string-like attributes from the info_axis """
additions = set([c for c in self._info_axis
if isinstance(c, string_types) and isidentifier(c)])
if isinstance(c, string_types) and isidentifier(c)] +
[c[0] for c in self._info_axis

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather both of these to (private) methods on Index and overriden on a MultiIndex; this become much simpler

e.g.

def _to_identifiers(self):
    return [c for c in self if isinstance(c, string_types) and isidentifier(c)]

with some doc-strings & tests

then this method becomes really simple.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I understand the code better, I realize that another simple implementation could be:

additions = set([c for c in self._info_axis.get_level_values(0) if isinstance(c, string_types) and isidentifier(c)])

Please tell me which method would be best.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this looks reasonable. pls add a comment on what is going one.

@BibMartin

Changes are done, and rebased. Btw, I've added a test on Series auto-completion.

jreback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a note in 0.22, other enhancements.

@@ -234,6 +234,39 @@ def test_tab_completion(self):
assert 'str' not in dir(s)
assert 'dt' in dir(s) # as it is a datetime categorical
def test_index_tab_completion(self):
# dir contains string-like values of the Index.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can parametrize this with index

assert (not isinstance(x, string_types) or
not isidentifier(x) or x in dir_s)
# dir contains string-like values of the MultiIndex first level.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can add these in the parameterization above

@jreback

one additional thing I think we need. If you have a very large index, _dir_additions actually takes quite a bit of time (this is not exclusive of this change).

So what I would do is if the index is say < 100, use the currently _dir_addition, otherwise return an empty list! (its essentially too big to use tab completion for anyhow). can you make this change and add an asv for this (could be a separate PR as well)

@jreback

@jreback

you can actually use .unique(level=0) to make this efficient

BibMartin pushed a commit to BibMartin/pandas that referenced this pull request

Dec 5, 2017

@BibMartin

I think that it's ready to go ; please tell me if I shall squash the commits.

jreback

@@ -77,6 +77,8 @@ Other Enhancements
- :func:`Series.fillna` now accepts a Series or a dict as a ``value`` for a categorical dtype (:issue:`17033`)
- :func:`pandas.read_clipboard` updated to use qtpy, falling back to PyQt5 and then PyQt4, adding compatibility with Python3 and multiple python-qt bindings (:issue:`17722`)
- Improved wording of ``ValueError`` raised in :func:`read_csv` when the ``usecols`` argument cannot match all columns. (:issue:`17301`)
- :func:`NDFrame._dir_additions` (tab completion) also returns identifiers in the first level of a :func:`MultiIndex`. (:issue:`16326`)
- :func:`NDFrame._dir_additions` (tab completion) limits to 100 values, for better performance. (:issue:`18587`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move 2nd to performance. don't use NDFrame. just say Series/DataFrame tab completion

def test_tab_completion(self):
# DataFrame whose columns are identifiers shall have them in __dir__.
df = pd.DataFrame([list('abcd'), list('efgh')], columns=list('ABCD'))
assert 'A' in dir(df)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually you can assert all of the items are in the dir() (use a loop)

columns=pd.MultiIndex.from_tuples(list(zip('ABCD', 'EFGH'))))
assert 'A' in dir(df)
assert isinstance(df.__getitem__('A'), pd.DataFrame)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert EFGH are NOT in the result (in a loop) and assert ABCD there

@@ -10,7 +10,7 @@
from pandas import Index, Series, DataFrame, date_range
from pandas.core.indexes.datetimes import Timestamp
from pandas.compat import range
from pandas.compat import range, lzip, isidentifier, string_types

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test that has 101 columns and assert first 100 there and last 1 is not.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do thsi

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put this in the other test (See here and there)
Please tell me if you prefer that I put it in a separate test.

""" add the string-like attributes from the info_axis.
If info_axis is a MultiIndex, it's first level values are used.
"""
additions = set(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an asv that does dir() on a Series/DataFrame with say 10000 elements (we might already have one of these)

BibMartin pushed a commit to BibMartin/pandas that referenced this pull request

Dec 6, 2017

BibMartin pushed a commit to BibMartin/pandas that referenced this pull request

Dec 6, 2017

Martin Journois added 2 commits

December 8, 2017 19:53

BibMartin pushed a commit to BibMartin/pandas that referenced this pull request

Dec 8, 2017

@BibMartin

Martin Journois added 3 commits

December 8, 2017 20:13

@jreback

@jreback

jreback

@jreback

@BibMartin

Thanks to you @jreback : it's been a great occasion to learn for me.

2 participants

@BibMartin @jreback