HTML (and text) reprs for large dataframes. by takluyver · Pull Request #5550 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation59 Commits5 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

As discussed in #4886, the HTML representation of DataFrames currently starts off as a table, but switches to the condensed info view if the table exceeds a certain size (by default, more than 60 rows or 20 columns). I've seen this confusing users, who think that they suddenly have a completely different kind of object, and don't understand why.

With these changes, the HTML repr always displays the table, but truncates it when it exceeds a certain size. It reuses the same options, display.max_rows and display.max_columns.

Before:

After:

a couple of issues: edge formatting on wide tables, and the empty dataframe
repr looks off.

Also, the PR affects only the html repr so (QT)console users will see different behavior then
ipnb users.

I've fixed the wide display issue - it wasn't truncating the extra row added for the row index names.

The empty dataframe repr is the same as what's displayed in my system installation of pandas - I agree that it's a bit odd, but I think that's a separate issue.

I agree, the behaviour of plain text reprs should be similar. Do you want me to tackle that in this PR, or separately?

this is related to #1889 as well

That'd be great, I think it fits here ok.

Not sure about putting this in 0.13, I've had bad luck with last minute changes to
display code. @jreback?

This was referenced

Nov 20, 2013

I think this is fine with a couple of minor issues:

need a mention in v0.13.0 and main docs
about this default and how to use .info() to get existing summary view

the hard codes for max rows / columns should come purely from the options and not hard code the functions
can u change that

I also mentioned it in the issue, but what do you think of showing first 30 ... last 30 rows/cols instead of first 60 ...? As is done for the Series html repr.
(It's also what is proposed in #1889)

Makes sense to me, but can be done in a subsequent PR if needs be.
ipnb is solidifying it's browser<->kernel message passing, may soon be
time to revisit the grid view from (with paging) #2974, this time with
built-in functionality rather then an in-process web server (like Exhibitionist does).

I have:

Made the plain text reprs match the HTML reprs (truncating with ... beyond max_rows/max_columns). To get the info view, you need to call the info() method.
Removed the defaults for max_rows/max_cols from the methods to which they are passed - so calling to_string() or to_html() will by default show the entire DataFrame untruncated. Only the reprs automatically truncate.
Removed the max_info_rows option - this was only used when displaying the info view for a repr.
Documented the changes.

Better deprecate max_info_rows rather then remove it, we may wish to move it's
enforcement over to info(). There are examples of doing that in config_init.py. Actually,
just generally deprecate rather then remove to reduce friction.
you removed some of the ugliest code I've ever written - good omen.
can you have a look at Console-width detection should be interactive sessions only #1610 and
see if that raises any issues with the changes? (regressions)

max_info_rows is back, deprecated.

I've had a brief look at #1610 - I don't think this should cause a regression, because format.get_console_size() checks whether it's in an interactive session.

I was very wrong with my initial objections, this is just great.

It's impossible to set default values for for max_rows and max_columns
that make things look good on both ipnb and qtconsole (which I usually use),
but that's a pre-existing issue. IPython scroller for tall output seems too large
to me, as well.

Regardless - tested this and liked it, +1 to merge.

@jreback, any more issues to address before the green button?

is it possible to have an option to do the exisiting behavior , but default to the new

maybe display.notebook_repr_html = 'info' ?

if its easy I would add this to provide back compat, if not then ok (w/o going back to @y-p admitted 'ugliest' code)

in the terminal, with display.expand_frame_repr=False and display.width=0 so that auto-detection is used,
the truncation doesn't obey the width detection (since it depends on number of columns,
not terminal width). Not a blocker.

I think it should be easy enough to have an option to revert to the old behaviour (at least roughly - I'd rather not restore max_info_rows as well). I'll do one option for both the terminal and the notebook: display.large_repr = 'truncate' | 'info'.

Truncation to terminal width is harder, because that would have to propagate down into the actual formatting code, and no doubt deal with various corner cases.

Added the option in the form I described in my last message.

When truncating, having a footer with total row count would eliminate the need
to use df.info in many cases and so reduce the impact of the change on existing users.
(For example, after filtering a frame you're often interested in the size of the result).

Edit: as a header is probably better, since in ipnb you may be forced to scroll down manually to
expose that part of the view.

I played around with some different options: showing it below the table looked more natural, and I opted to show it whether or not the table is truncated. The format is "61 rows × 26 columns". In the terminal, it shows up in [square brackets] to highlight that it's not part of the table.

The failing test attempts to roundtrip a dataframe to and from the clipboard. It tests various ways of doing this, but one of them (passing excel=False 😕) will simply write str(df) to the clipboard. That would already not work for any dataframe large enough to get the info repr, but the test only uses a 5 × 3 frame.

Should we attempt to fix that, or simply remove the code path that writes str(df) to the clipboard.

The size issue is known: #5346, re confusion see #5070.

to_clipboard(excel=false) should probably use show_dimensions=False and use
to_string directly to avoid truncation.

I've made the clipboard use to_string() instead of str(), which should also fix #5346. We'll see what Travis says.

However, now I appear to have a merge conflict. What's the preferred strategy for pandas: rebase, merge into my branch, or let whoever merges the PR handle it?

you need to clear merge conflicts via rebasing

Rebased, squashing a couple of commits where I had undone some change.

Mercilessly squashing to 1 commit will make life a easier imo...

@jreback perhaps we should add that to wiki?

sure feel free to update/expand wiki

I don't follow why squashing the whole PR to one commit would be useful. It seems to defeat the point of a DVCS.

OK, great. Here's a more prominent section in the release notes, including a little picture.

ghost pushed a commit that referenced this pull request

Nov 26, 2013

HTML reprs for large dataframes.

:-) Thanks everyone for the review and improvements.

@takluyver

docs on the web are built at 5pm est

pls review the changes and make sure they look right

thanks again

no that's right
but when u checkin you have to use -f
as git normally ignores it

Just check it in (there are a few other static images there). The folder is ignored because all the generated plots are stored there.

So, should we change the defaults for max_rows and max_columns?

The image is now PR #5594.

I might consider bumping the default max_columns down a bit, because I think in most real examples, 20 columns is very wide. Then again, when I open a blank spreadsheet, I see 20 columns, and I think it's more annoying to hide columns than to hide rows, so I'm not sure that it should change.

Has anyone had some performance issues with this on large DataFrames in the IPython notebook?
For a DataFrame with 1,536,532 rows and 22 columns, it ran for a minute before I interrupted the kernel.

It doesn't take long at all in terminal, and I don't use the qtconsole.

I don't mind, but I wanted people to be aware.

this should be ok on master (as it doesn't display all the rows), unless you have max_rows set to some big number

My display.max_columns is 20 and display.max_rows is 60.

That's why I was surprised it was taking longer on large frames.

I'm doing some timing right now to dig into it (I'll put up a notebook).

I guess it's a bit tricky to profile reprs. I'll come back to this later.

I can say that its a lot quicker just on a random frame. My example a had MultiIndex.

confirmed, we fixed that bug for the Index case, but I missed the MultiIndex equivalent.
Will fix.

good catch.

Once again, the wisdom of not merging things right before a release (and vice versa) shines through.

@ghost ghost mentioned this pull request

Dec 5, 2013

Should be fixed, add vbenches.

@ghost ghost mentioned this pull request

Dec 7, 2013

@ghost ghost mentioned this pull request

Jan 16, 2014

... I kinda like this phase of the release cycle:

#pandas new output of row and column numbers during every print is surprisingly nice cc @wesmckinn

— Chris (@cdubhland) January 17, 2014

This pull request was closed.