HTML (and text) reprs for large dataframes. by takluyver · Pull Request #5550 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation59 Commits5 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
As discussed in #4886, the HTML representation of DataFrames currently starts off as a table, but switches to the condensed info view if the table exceeds a certain size (by default, more than 60 rows or 20 columns). I've seen this confusing users, who think that they suddenly have a completely different kind of object, and don't understand why.
With these changes, the HTML repr always displays the table, but truncates it when it exceeds a certain size. It reuses the same options, display.max_rows
and display.max_columns
.
Before:
After:
a couple of issues: edge formatting on wide tables, and the empty dataframe
repr looks off.
Also, the PR affects only the html repr so (QT)console users will see different behavior then
ipnb users.
I've fixed the wide display issue - it wasn't truncating the extra row added for the row index names.
The empty dataframe repr is the same as what's displayed in my system installation of pandas - I agree that it's a bit odd, but I think that's a separate issue.
I agree, the behaviour of plain text reprs should be similar. Do you want me to tackle that in this PR, or separately?
this is related to #1889 as well
That'd be great, I think it fits here ok.
Not sure about putting this in 0.13, I've had bad luck with last minute changes to
display code. @jreback?
This was referenced
Nov 20, 2013
I think this is fine with a couple of minor issues:
need a mention in v0.13.0 and main docs
about this default and how to use .info() to get existing summary view
the hard codes for max rows / columns should come purely from the options and not hard code the functions
can u change that
I also mentioned it in the issue, but what do you think of showing first 30 ... last 30
rows/cols instead of first 60 ...
? As is done for the Series html repr.
(It's also what is proposed in #1889)
Makes sense to me, but can be done in a subsequent PR if needs be.
ipnb is solidifying it's browser<->kernel message passing, may soon be
time to revisit the grid view from (with paging) #2974, this time with
built-in functionality rather then an in-process web server (like Exhibitionist does).
I have:
- Made the plain text reprs match the HTML reprs (truncating with
...
beyond max_rows/max_columns). To get the info view, you need to call theinfo()
method. - Removed the defaults for max_rows/max_cols from the methods to which they are passed - so calling
to_string()
orto_html()
will by default show the entire DataFrame untruncated. Only the reprs automatically truncate. - Removed the
max_info_rows
option - this was only used when displaying the info view for a repr. - Documented the changes.
- Better deprecate
max_info_rows
rather then remove it, we may wish to move it's
enforcement over toinfo()
. There are examples of doing that inconfig_init.py
. Actually,
just generally deprecate rather then remove to reduce friction. - you removed some of the ugliest code I've ever written - good omen.
can you have a look at Console-width detection should be interactive sessions only #1610 and
see if that raises any issues with the changes? (regressions)
max_info_rows
is back, deprecated.
I've had a brief look at #1610 - I don't think this should cause a regression, because format.get_console_size()
checks whether it's in an interactive session.
I was very wrong with my initial objections, this is just great.
It's impossible to set default values for for max_rows
and max_columns
that make things look good on both ipnb and qtconsole (which I usually use),
but that's a pre-existing issue. IPython scroller for tall output seems too large
to me, as well.
Regardless - tested this and liked it, +1 to merge.
@jreback, any more issues to address before the green button?
is it possible to have an option to do the exisiting behavior , but default to the new
maybe display.notebook_repr_html = 'info'
?
if its easy I would add this to provide back compat, if not then ok (w/o going back to @y-p admitted 'ugliest' code)
in the terminal, with display.expand_frame_repr=False
and display.width=0
so that auto-detection is used,
the truncation doesn't obey the width detection (since it depends on number of columns,
not terminal width). Not a blocker.
I think it should be easy enough to have an option to revert to the old behaviour (at least roughly - I'd rather not restore max_info_rows
as well). I'll do one option for both the terminal and the notebook: display.large_repr = 'truncate' | 'info'
.
Truncation to terminal width is harder, because that would have to propagate down into the actual formatting code, and no doubt deal with various corner cases.
Added the option in the form I described in my last message.
When truncating, having a footer with total row count would eliminate the need
to use df.info
in many cases and so reduce the impact of the change on existing users.
(For example, after filtering a frame you're often interested in the size of the result).
Edit: as a header is probably better, since in ipnb you may be forced to scroll down manually to
expose that part of the view.
I played around with some different options: showing it below the table looked more natural, and I opted to show it whether or not the table is truncated. The format is "61 rows × 26 columns". In the terminal, it shows up in [square brackets] to highlight that it's not part of the table.
The failing test attempts to roundtrip a dataframe to and from the clipboard. It tests various ways of doing this, but one of them (passing excel=False
😕) will simply write str(df)
to the clipboard. That would already not work for any dataframe large enough to get the info repr, but the test only uses a 5 × 3 frame.
Should we attempt to fix that, or simply remove the code path that writes str(df)
to the clipboard.
The size issue is known: #5346, re confusion see #5070.
to_clipboard(excel=false)
should probably use show_dimensions=False
and use
to_string directly to avoid truncation.
I've made the clipboard use to_string()
instead of str()
, which should also fix #5346. We'll see what Travis says.
However, now I appear to have a merge conflict. What's the preferred strategy for pandas: rebase, merge into my branch, or let whoever merges the PR handle it?
you need to clear merge conflicts via rebasing
Rebased, squashing a couple of commits where I had undone some change.
Mercilessly squashing to 1 commit will make life a easier imo...
@jreback perhaps we should add that to wiki?
sure feel free to update/expand wiki
I don't follow why squashing the whole PR to one commit would be useful. It seems to defeat the point of a DVCS.
OK, great. Here's a more prominent section in the release notes, including a little picture.
ghost pushed a commit that referenced this pull request
HTML reprs for large dataframes.
:-) Thanks everyone for the review and improvements.
docs on the web are built at 5pm est
pls review the changes and make sure they look right
thanks again
no that's right
but when u checkin you have to use -f
as git normally ignores it
Just check it in (there are a few other static images there). The folder is ignored because all the generated plots are stored there.
So, should we change the defaults for max_rows
and max_columns
?
The image is now PR #5594.
I might consider bumping the default max_columns down a bit, because I think in most real examples, 20 columns is very wide. Then again, when I open a blank spreadsheet, I see 20 columns, and I think it's more annoying to hide columns than to hide rows, so I'm not sure that it should change.
Has anyone had some performance issues with this on large DataFrames in the IPython notebook?
For a DataFrame with 1,536,532 rows and 22 columns, it ran for a minute before I interrupted the kernel.
It doesn't take long at all in terminal, and I don't use the qtconsole.
I don't mind, but I wanted people to be aware.
this should be ok on master (as it doesn't display all the rows), unless you have max_rows set to some big number
My display.max_columns
is 20 and display.max_rows
is 60.
That's why I was surprised it was taking longer on large frames.
I'm doing some timing right now to dig into it (I'll put up a notebook).
I guess it's a bit tricky to profile reprs. I'll come back to this later.
I can say that its a lot quicker just on a random frame. My example a had MultiIndex.
confirmed, we fixed that bug for the Index case, but I missed the MultiIndex equivalent.
Will fix.
good catch.
Once again, the wisdom of not merging things right before a release (and vice versa) shines through.
ghost mentioned this pull request
Should be fixed, add vbenches.
ghost mentioned this pull request
ghost mentioned this pull request
... I kinda like this phase of the release cycle:
#pandas new output of row and column numbers during every print is surprisingly nice cc @wesmckinn
— Chris (@cdubhland) January 17, 2014
This pull request was closed.