ENH: more flexible describe() + tests by bthyreau · Pull Request #8164 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation72 Commits1 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
this patch adds a return_type keyword argument to describe() to make it more
flexible to use on mixed-type dataframe. User can now select among returning
numeric, categorical, or both, as well as 'auto' (previous behaviour, default),
and 'same', which keep columns identical (useful, e.g. with groupby())
pls show examples of the use case for this
Rationale.
When using dataframe of mixed type, ie. containing numeric values, string, categorical, etc, the current behaviour of describe() is a bit rough, as its only summarize over numerical columns only, or if none exists, over categorical columns only. With this change, describe() gets more flexible in its return form, which considerably smoothed my interactive data-analysis sessions.
From the doc
return_type : string, default 'auto' The result format. Most relevant for mixed-types dataframes. One of 'numeric_only', 'categorical_only', 'both', 'same', 'auto' 'auto' returns the 'numeric_only' description, except on fully categorical frames, where it returns 'categorical_only'. 'both' returns the pair, 'same' returns a frame with same columns as original dataframe
Example
Although real-life scenario are more convincing, here is small examples
In [1]: import numpy as np In [2]: import pandas as pd
In [3]: df = pd.DataFrame({'colA': ['foo', 'foo', 'bar'] * 10, ...: 'colB': ['a', 'b', 'c', 'd', 'e'] * 6, ...: 'colC': np.arange(30), 'colD' : np.ones(30)})
In [4]: df.head() Out[4]: colA colB colC colD 0 foo a 0 1 1 foo b 1 1 2 bar c 2 1 3 foo d 3 1 4 foo e 4 1
old behaviour pick columns based on the types of the dataframe. Not so nice.
In [6]: df.describe() Out[6]: colC colD count 30.000000 30 mean 14.500000 1 std 8.803408 0 min 0.000000 1 25% 7.250000 1 50% 14.500000 1 75% 21.750000 1 max 29.000000 1
using the new option, we can explictely asks to describe both types
In [8]: df.describe(return_type="categorical_only") Out[8]: colA colB count 30 30 unique 2 5 top foo d freq 20 6
In [9]: df.describe(return_type="numeric_only") Out[9]: colC colD count 30.000000 30 mean 14.500000 1 std 8.803408 0 min 0.000000 1 25% 7.250000 1 50% 14.500000 1 75% 21.750000 1 max 29.000000 1
using option "same" returns a df with similar-columns
In [11]: df.describe(return_type="same") Out[11]: colA colB colC colD count 30 30 30 30 unique 2 5 NaN NaN top foo d NaN NaN freq 20 6 NaN NaN mean NaN NaN 14.5 1 std NaN NaN 8.803408 0 min NaN NaN 0 1 25% NaN NaN 7.25 1 50% NaN NaN 14.5 1 75% NaN NaN 21.75 1 max NaN NaN 29 1
one of my favorite pattern, using groupby:
In [13]: out = df.groupby("colA").describe(return_type="same")
In [14]: out.unstack(0)
colB colC colD
colA bar foo bar foo bar foo
count 10 20 10 20 10 20
unique 5 5 NaN NaN NaN NaN
top d d NaN NaN NaN NaN
freq 2 4 NaN NaN NaN NaN
mean NaN NaN 15.5 14 1 1
std NaN NaN 9.082951 8.855566 0 0
min NaN NaN 2 0 1 1
25% NaN NaN 8.75 6.75 1 1
50% NaN NaN 15.5 14 1 1
75% NaN NaN 22.25 21.25 1 1
max NaN NaN 29 28 1 1
Is their a reason you think that the above approach is better than:
In [8]: df.select_dtypes(['object']).describe()
Out[8]:
colA colB
count 30 30
unique 2 5
top foo d
freq 20 6
In [9]: df.select_dtypes(['number']).describe()
Out[9]:
colC colD
count 30.000000 30
mean 14.500000 1
std 8.803408 0
min 0.000000 1
25% 7.250000 1
50% 14.500000 1
75% 21.750000 1
max 29.000000 1
(possibly adding include/exclude
to .describe
then doing a select_dtypes
might be ok though)
- The current behaviour of describe() is un-intuitive (unless the dataframe is homogenous type.) as it returns a subset of columns; and that subset depends on other included columns.
- describe(), as an easy, short, concise way to get an overview of the df values, is especially appreciable during interactive sessions. Using a manual approach, or "select_dtypes" pre-filterings, can of course eventually output the results, but it defeat some of the practicality of describe().
- Getting the result of describe() with the same columns as original df is quite handy. e.g. you can re-use a same columns-subset list. For e.g. compare
In [60]: df.dtypes Out[60]: colA object colB object colC int32 colD float64 dtype: object
In [56]: model_col = ["colA","colB"]
In [57]: df.loc[:,model_col].describe().loc[:,model_col] Out[57]: colA colB count 30 30 unique 2 5 top foo d freq 20 6
In [58]: model_col = ["colA","colB","colC"]
In [59]: df.loc[:,model_col].describe().loc[:,model_col] Out[59]: colA colB colC count NaN NaN 30.000000 mean NaN NaN 14.500000 std NaN NaN 8.803408 min NaN NaN 0.000000 25% NaN NaN 7.250000 50% NaN NaN 14.500000 75% NaN NaN 21.750000 max NaN NaN 29.000000
Here we have lost the count(),unique(), first(), etc. of colA and colB as soon as we introduced colC in the model
However, it's not lost anymore when using return_type = "same"
In [61]: df.loc[:,model_col].describe(return_type="same").loc[:,model_col] Out[61]: colA colB colC count 30 30 30 unique 2 5 NaN top foo d NaN freq 20 6 NaN mean NaN NaN 14.5 std NaN NaN 8.803408 min NaN NaN 0 25% NaN NaN 7.25 50% NaN NaN 14.5 75% NaN NaN 21.75 max NaN NaN 29
- It also works out-of-the-box with
groupby()
results objects. Very useful when interacting with data in an exploratory way, i can easily see/stack the summaries results of my various grouping pivot. An alternative would need several lines of code, possibly the need to create a small helper function (or an ugly lambda) every time.
In the exemple before, even with select_dtypes()
, I'm not even sure how to properly include a summary for colB
(categorical) in the df.groupby("colA").describe()
default output. In my proposal, it's just a matter of df.groupby("colA").describe(return_type="same").
Of course, it's more convincing with real-world large dataframe of mixed types, (as used e.g. in Psychology) where it's easy to mentally lost track of every columns and their types.
- most of the internal type-selecting logic was already implemented, so was only a matter of exposing it
ok, your idea of 'same' is ok, but the API is not consistent with the pandas style.
I would be ok with adding include,exclude
keywords to .describe
to filter (with the default of include='number',exclude=None
. You could simply make it accept include='same'
(and intercept that) to implement the functionailiy (and use .select_dtypes
internally).
its a well-constructed and general API by @cpcloud
ok, i implemented your suggested API, and it's indeed more flexible, while retaining the usability. Great !
Now it's possible to specify output form using include=/exclude= list.
I override 'include' to add the 'all' (alias '*') keyword, which enforce no-filtering, and the None/None (default) pair the enforce the previous, type-dependent behaviour.
Some snippet below:
from pandas import Series from pandas import DataFrame import pandas.util.testing as tm import numpy as np
df = DataFrame({'catA': ['foo', 'foo', 'bar'] * 8, ... 'catB': ['a', 'b', 'c', 'd'] * 6, ... 'numC': np.arange(24), ... 'numD': np.arange(24.) + .5, ... 'ts': tm.makeTimeSeries()[:24].index})
df.describe(include=["number","object"]) catA catB numC numD count 24 24 24.000000 24.000000 unique 2 4 NaN NaN top foo d NaN NaN freq 16 6 NaN NaN mean NaN NaN 11.500000 12.000000 std NaN NaN 7.071068 7.071068 min NaN NaN 0.000000 0.500000 25% NaN NaN 5.750000 6.250000 50% NaN NaN 11.500000 12.000000 75% NaN NaN 17.250000 17.750000 max NaN NaN 23.000000 23.500000 df.loc[:,:].describe() # as before numC numD count 24.000000 24.000000 mean 11.500000 12.000000 std 7.071068 7.071068 min 0.000000 0.500000 25% 5.750000 6.250000 50% 11.500000 12.000000 75% 17.250000 17.750000 max 23.000000 23.500000
df.loc[:,['catA','catB','ts']].describe() # contains NaN, as before catA catB ts count 24 24 24 unique 2 4 24 first NaN NaN 2000-01-03 00:00:00 last NaN NaN 2000-02-03 00:00:00 top foo d 2000-01-31 00:00:00 freq 16 6 1
df.describe(include=["object"]) catA catB count 24 24 unique 2 4 top foo d freq 16 6 df.describe(include='*') catA catB numC numD ts count 24 24 24.000000 24.000000 24 unique 2 4 NaN NaN 24 top foo d NaN NaN 2000-01-31 00:00:00 freq 16 6 NaN NaN 1 first NaN NaN NaN NaN 2000-01-03 00:00:00 last NaN NaN NaN NaN 2000-02-03 00:00:00 mean NaN NaN 11.500000 12.000000 NaN std NaN NaN 7.071068 7.071068 NaN min NaN NaN 0.000000 0.500000 NaN 25% NaN NaN 5.750000 6.250000 NaN 50% NaN NaN 11.500000 12.000000 NaN 75% NaN NaN 17.250000 17.750000 NaN max NaN NaN 23.000000 23.500000 NaN
df.loc[:,['catA','catB']].describe(include='') catA catB count 24 24 unique 2 4 top foo d freq 16 6 df.describe(include='', exclude='XXX') Traceback (most recent call last): File "", line 1, in File "pandas/core/generic.py", line 3681, in describe raise ValueError("exclude must be None when include is '%s'" % include) ValueError: exclude must be None when include is '*'
df.groupby("catA").describe(include='*') # my favorite catB numC numD ts catA
bar count 8 8.000000 8.000000 8 unique 4 NaN NaN 8 top d NaN NaN 2000-01-31 00:00:00 freq 2 NaN NaN 1 first NaN NaN NaN 2000-01-05 00:00:00 last NaN NaN NaN 2000-02-03 00:00:00 mean NaN 12.500000 13.000000 NaN std NaN 7.348469 7.348469 NaN min NaN 2.000000 2.500000 NaN 25% NaN 7.250000 7.750000 NaN 50% NaN 12.500000 13.000000 NaN 75% NaN 17.750000 18.250000 NaN max NaN 23.000000 23.500000 NaN foo count 16 16.000000 16.000000 16 unique 4 NaN NaN 16 top d NaN NaN 2000-01-25 00:00:00 freq 4 NaN NaN 1 first NaN NaN NaN 2000-01-03 00:00:00 last NaN NaN NaN 2000-02-02 00:00:00 mean NaN 11.000000 11.500000 NaN std NaN 7.118052 7.118052 NaN min NaN 0.000000 0.500000 NaN 25% NaN 5.500000 6.000000 NaN 50% NaN 11.000000 11.500000 NaN 75% NaN 16.500000 17.000000 NaN max NaN 22.000000 22.500000 NaN df.groupby("catA").describe(include=["object", "datetime", "number"], exclude=["float"]) catB numC ts catA
bar count 8 8.000000 8 unique 4 NaN 8 top d NaN 2000-01-31 00:00:00 freq 2 NaN 1 first NaN NaN 2000-01-05 00:00:00 last NaN NaN 2000-02-03 00:00:00 mean NaN 12.500000 NaN std NaN 7.348469 NaN min NaN 2.000000 NaN 25% NaN 7.250000 NaN 50% NaN 12.500000 NaN 75% NaN 17.750000 NaN max NaN 23.000000 NaN foo count 16 16.000000 16 unique 4 NaN 16 top d NaN 2000-01-25 00:00:00 freq 4 NaN 1 first NaN NaN 2000-01-03 00:00:00 last NaN NaN 2000-02-02 00:00:00 mean NaN 11.000000 NaN std NaN 7.118052 NaN min NaN 0.000000 NaN 25% NaN 5.500000 NaN 50% NaN 11.000000 NaN 75% NaN 16.500000 NaN max NaN 22.000000 NaN
Some Design decision minor points
- Meaningless input raise exceptions (mostly through select_dtypes), except through groupby() who catch them and fallback to returning 'all'.
- the index of the returned results (count/mean/50%/unique/first...) still vary depending on the dataframe content (which kindof contradict my initial complain for columns, i never saw that as a problem when it concerned rows). Its order is also not stable.
- I thought it might be convenient to move the '*' dtype-filtering trick directly into select_dtypes(). Would it be useful ?
for x in fself.columns] |
---|
# merge individual outputs, preserving index order as possible |
names = [] |
ldesc_indexes = sorted([x.index for x in ldesc], key=len) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm concat does this by default (preserve order and concat other axes)
so no need for all this section (you might need to tweak some options to concat)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless i missed something, i believe by default only the columns order is preserved; the index order is lexsorted; that makes the output a bit unfriendly. Without it:
pd.concat(ldesc, keys=fself.columns, axis=1) Out[38]: catA catB numC numD ts 25% NaN NaN 5.750000 6.250000 NaN 50% NaN NaN 11.500000 12.000000 NaN 75% NaN NaN 17.250000 17.750000 NaN count 24 24 24.000000 24.000000 24 first NaN NaN NaN NaN 2000-01-03 00:00:00 freq 16 6 NaN NaN 1 last NaN NaN NaN NaN 2000-02-03 00:00:00 max NaN NaN 23.000000 23.500000 NaN mean NaN NaN 11.500000 12.000000 NaN min NaN NaN 0.000000 0.500000 NaN std NaN NaN 7.071068 7.071068 NaN top foo d NaN NaN 2000-01-31 00:00:00 unique 2 4 NaN NaN 24
It interleaved different type's output.Though the output index-order of describe() is not supposed to be guaranteed, i found it greatly convenient for interactive sessions.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to use keys
that's the problem , just construct the sub frames and concat glues them together in the same order
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry i'm confused.
Here, the ldecs list is already in the column order, ready to be concat'ed; i use keys= only as a way to set back the column names that Series.describe()s lost.
The index (row) order has to be ordered "by blocks" according to a logic which can not trivially be infered from the list (ie. categorical rows if any; then numeric if any); even if i could somehow avoid the lexsort and use inputs' orders (which i failed to), it would order the blocks differently depending on the first passed column's dtype, which is less friendly for user. So i believe it's necessary to pre-create the index order ("names" here)
I admit i could use join_axes=Index(names)
instead of .loc[names]
, though.
Introducing this kind of coupling hardly seems worth it for the "inconvenience" of having to call a single method.
Also how is all and star in select dtypes different from just not calling the method?
Ok, following the previous comment, i refrained from touching select_dtypes.
I left the row-creation logic as it was, as i believed it was necessary, as discussed before.
I'm quite happy with the current code. In addition to docstrings, I also added a brief overview in the main-doc.
[I also have a potential short changelog doc, but i guess it's rude to commit it (e.g. in v0.15.txt) before knowing if you plan to merge this at all :)]
bthyreau changed the title
more flexible describe() + tests ENH: more flexible describe() + tests
.. ipython:: python |
---|
frame = DataFrame({'a': ['Yes', 'Yes', 'No', 'No'], 'b': range(4)}) |
frame.describe(include=['object']) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
show frame.describe(), e.g. the default case
Ok, i updated the main doc and docstring following your request.
Thanks
As for the rationale of those loops, this is necessary to compute the order of the row axis (statistics-list). The describe() functions must output results immediately practical for users, but without the loop, as you showed, percentiles are not surrounded by min/max; count is at the middle, etc; due to the default lexsorting logic of Index operations.
In detail, in the snippet below:
- This first loop is basically the apply (except the cat'ing is delayed later, so i can still have access to series's length)
- The second loop is the index-sorting logic
That's why i gave-up using apply in this case. I also experimented other way, such as the various Index manipulation functions, or pre-computing the rows-keys, etc. but it didn't improve much.
Note also that, as a side effect, the whole function itself seems to be slightly faster than the sole logicless apply.
def test1(fself, percentile_width = None, percentiles = []):
ldesc = []
for name, col in fself.iteritems():
s = col.describe(percentile_width=percentile_width,
percentiles=percentiles)
s.name = name
ldesc.append(s)
# set a convenient order for rows
names = []
ldesc_indexes = sorted([x.index for x in ldesc], key=len)
for idxnames in ldesc_indexes:
for name in idxnames:
if name not in names:
names.append(name)
d = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)
return d
In [84]: %timeit test1(df, percentiles=[.42]) 100 loops, best of 3: 5.4 ms per loop
In [85]: %timeit df.apply(lambda x : x.describe(percentile_width = None, percentiles=[.42])) 100 loops, best of 3: 6.59 ms per loop
Same pattern on a wider (24, 500)-shaped df: 458 ms vs 499 ms
# set a convenient order for rows |
---|
names = [] |
ldesc_indexes = sorted([x.index for x in ldesc], key=len) |
for idxnames in ldesc_indexes: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at the very list this should be a list comprehension then
Well, it's only style, but if you want back the list-comprehension, then fine; while at it, to make some actual improvement, i changed the behaviour on Series so that the index name got filed at creation time. See commit.
pls squash to a single commit
fself = self.select_dtypes(include=include, exclude=exclude) |
---|
ldesc = [col.describe(percentile_width=percentile_width, |
percentiles=percentiles) for _, col in fself.iteritems()] |
# set a convenient order for rows |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line betwwen blocks
Ok, i squashed all commits into one, which updates the code and the main doc.
What i haven't commited yet is the changelog/release-notes text, which i have ready. Should i commit it somewhere ?
Thanks
@@ -490,6 +490,23 @@ number of unique values and most frequently occurring values: |
---|
s = Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a']) |
s.describe() |
Note that on a mixed-type DataFrame object, `describe` will restrict the summary to |
include only numerical columns or, if none are, only categorical columns: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add similar note to v0.15.0.txt (and include this PR number as a refernce). put in the API section. Include a reference this doc section here.
frame.describe(include=['object']) |
---|
frame.describe(include=['number']) |
frame.describe(include='all') |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you refer to select_dtypes
here? (as it is in that function that the include/exclude arguments are explained in more detail)
@bthyreau added some more comments (sorry it took a while to look at).
Further I have another remark: is the '*'
option needed? Is just using 'all'
not enough? (it's only two characters more). I don't think * is used anywhere in pandas at the moment, and is just two aliases for the same.
@cpcloud: are you ok with including this? as you objected in the first place?
- Thank for your feedback. Following your comments about the condition-path, i finally decided to refactor that whole function. I was initially reluctant to change it, as it changed some of the (unspecified) order of fields in output; but it makes the code quite smaller and much clearer now.
The only user-visible change is that the order of the output rows is slightly changed (in a more consistent way IMHO), and i had to alter an existing unittest which assumed an order. - Also, i wanted to confirm the behaviour using the new pd.Categorical data type, but i think select_dtypes() can't (explicitely) filter on it (yet?). Am i right ?
select_dtypes handles categorical (pass in 'category' as the dtype)
let's drop using * and just use all to include all dtypes
@jreback ok thanks. Ok to drop "*" if you think it's inconsistant with the rest of pandas.
I'll fix the branch ASAP (fail due to minor discrepency as for the status of "bool" + i want to assess categories) and push it again soon
- Ok, removed '*' as you requested.
[Ok course, if it was only for me, "all" would be the default describe() behaviour (since I can't figure-out the motivation for silentely dropping random columns), so in real-world use, i will have to alias it anyway.]. I hope i made the doc clear enough to warn about this pitfall.
Thanks
@bthyreau well
describe is most useful for only numeric columns and that is the default
so it provides backwards compat
it does not drop random columns rather it by default selects numeric
(which is how most operations work FYI)
@@ -3658,6 +3658,16 @@ def abs(self): |
---|
The percentiles to include in the output. Should all |
be in the interval [0, 1]. By default `percentiles` is |
[.25, .5, .75], returning the 25th, 50th, and 75th percentiles. |
include, exclude : list-like, 'all', or None (default) |
Specify the form of the returned result. Either: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to leave an empty line after this line in order that the list will be rendered as a list (in the online html docstring pages). Or other option is to remove the line and just have the list items, then there does not need to be an empty line.
ok, refactored a bit to avoid the recomputation of parameters due to recursion. Thanks for pointing it out; As a bonus, the codepaths are shorter and easier to follow !
I also updated the docstring following your suggestion about more explicit mixed-type cases.
Thanks !
@@ -3751,42 +3767,45 @@ def describe_categorical_1d(data): |
---|
elif issubclass(data.dtype.type, np.datetime64): |
asint = data.dropna().values.view('i8') |
names += ['first', 'last', 'top', 'freq'] |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this does not work with Timedelta
, but easy to fix, do something like:
if com.needs_i8_conversion(data):
boxer = com.i8_boxer(data)
asint = data.dropna().asi8
names = ......
then just user boxer
rather than lib.Timestamp
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out.
Hmm.. although your snippet works well, Timedelta seems closer to a numeric type than a object one: select_dtypes() filters it with np.number; and users would like to get a min/max/mean values, not first/last. So instead, i moved it to the describe_numeric codepath. Since it's new in 0.15, i think it's ok to do so even in the "default" describe call. However, std() doesn't make a lot of sense, so an alternative would be to create another specific index for timedelta.
ie. current code behave:
In [4]: df = DataFrame({'A_cat': Categorical(['foo', 'foo', 'bar'] * 8), ...: 'B_obj': ['a', 'b', 'c', 'd'] * 6, ...: 'C_int': np.arange(24, dtype='int64'), ...: 'D_ts': tm.makeTimeSeries(24).index, ...: 'E_tdelta': to_timedelta(np.arange(24)%20,"D")})
In [6]: df.describe(include="all") Out[6]: A_cat B_obj C_int D_ts E_tdelta count 24 24 24.000000 24 24 unique 2 4 NaN 24 NaN top foo d NaN 2000-01-31 00:00:00 NaN freq 16 6 NaN 1 NaN first NaN NaN NaN 2000-01-03 00:00:00 NaN last NaN NaN NaN 2000-02-03 00:00:00 NaN mean NaN NaN 11.500000 NaN 8 days 04:00:00 std NaN NaN 7.071068 NaN 5.354416e+14 min NaN NaN 0.000000 NaN 0 days 00:00:00 25% NaN NaN 5.750000 NaN 2 days 18:00:00 50% NaN NaN 11.500000 NaN 7 days 12:00:00 75% NaN NaN 17.250000 NaN 13 days 06:00:00 max NaN NaN 23.000000 NaN 19 days 00:00:00
In [7]: df.describe() Out[7]: C_int E_tdelta count 24.000000 24 mean 11.500000 8 days 04:00:00 std 7.071068 5.354416e+14 min 0.000000 0 days 00:00:00 25% 5.750000 2 days 18:00:00 50% 11.500000 7 days 12:00:00 75% 17.250000 13 days 06:00:00 max 23.000000 19 days 00:00:00
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is fine (I am going to fix the std of the Timedelta right now actually). go ahead and make those slight changes (below) and we'll get this in.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on 2nd thought, even though std IS possible, its too tricky for right now (the problem is var is not allowed because it CAN overflow and not be represented by a Timedelta, so need special handling for std. punting for now). Put a wrapper around these ops to catch a TypeError
and turn it into a Nan
is the best soln I think.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm; in the example above, the float64 value ( 5.354416e+14 ) for std() is actually correct, and convert correctly with to_timedelta(df.std()) to 6 days 04:44:01.629993; [the var(), otoh, throw an OverflowError, but describe() doesn't use var()].
I'm not sure why the returned std value is not a TimeDelta64 but a float64; it looks like a problem with the std() call. Did you mean i should still catch that in describe() and recover ? (i may misunderstood since i'm not familiar with that part)
I fixed your other comment - Thanks !
@bthyreau small change.
some .std()
calls on a Timedelta will work (except for the returned inference). But that's the problem, some will also overflow. This stems from: std = sqrt(var)
and var can be a number that is really big.
In [7]: max_int = np.iinfo(np.int64).max
In [8]: max_int
Out[8]: 9223372036854775807
In [9]: pd.Timedelta(max_int)
Out[9]: Timedelta('106751 days 23:47:16.854775')
In [10]: big_float = 3e19
In [12]: pd.Timedelta(big_float)
---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
OverflowError: Python int too large to convert to C long
In [13]: pd.Timedelta(np.sqrt(big_float))
Out[13]: Timedelta('0 days 00:00:05.477225')
Going to create an issue to fix this, but don't have time right now.
So just wrap the .std() as I show above to catch this (it will raise TypeError
because it doesn't allow std/var ATM).
The complication is that std is allowd, but now var (but std CALLS var). So need to do this in a non-hacky way.
see here: #8471
lmk when you make that change and push.
@bthyreau of u can address this soon would be gr8
if not lmk - going to do a rc prob tonight and this goes in
I think #8476 will allow this to merge cleanly. so hold off
@bthyreau ok I think if u rebase this should work
ok great. Rebasing and pushing now
This enhance describe()'s output via new include/exclude list arguments, letting the user specify the dtypes to be summarized as output. This provides an simple way to overcome the automatic type-filtering done by default; it's also convenient with groupby(). Also includes documentation and changelog entries.
side issue:
I think we may need a rounding option or something to make some of the default
display better looking. This comes up most often in timedeltas, but maybe other dtypes too.
This example is from your tests.
In [23]: df['td'] = df['ts'].diff()
In [24]: df.describe(include='all')
Out[24]:
catA catB numC numD ts td
count 24 24 24.000000 24.000000 24 23
unique 2 4 NaN NaN 24 NaN
top foo d NaN NaN 2000-01-31 00:00:00 NaN
freq 16 6 NaN NaN 1 NaN
first NaN NaN NaN NaN 2000-01-03 00:00:00 NaN
last NaN NaN NaN NaN 2000-02-03 00:00:00 NaN
mean NaN NaN 11.500000 12.000000 NaN 1 days 08:20:52.173913
std NaN NaN 7.071068 7.071068 NaN 0 days 18:36:09.225425
min NaN NaN 0.000000 0.500000 NaN 1 days 00:00:00
25% NaN NaN 5.750000 6.250000 NaN 1 days 00:00:00
50% NaN NaN 11.500000 12.000000 NaN 1 days 00:00:00
75% NaN NaN 17.250000 17.750000 NaN 1 days 00:00:00
max NaN NaN 23.000000 23.500000 NaN 3 days 00:00:00
In [25]: df['td'].describe(include='all')
Out[25]:
count 23
mean 1 days 08:20:52.173913
std 0 days 18:36:09.225425
min 1 days 00:00:00
25% 1 days 00:00:00
50% 1 days 00:00:00
75% 1 days 00:00:00
max 3 days 00:00:00
Name: td, dtype: object
In [26]: df['td'].describe(include='all')['mean']
Out[26]: Timedelta('1 days 08:20:52.173913')
You can 'fix' this by rounding (and you can check Timedelta(...).resolution to make sure that you are not cutting things off, e.g. Timedelta('1min 1s').resolution -> 's'
In [27]: df['td'].describe(include='all')['mean'].round('s')
Out[27]: Timedelta('1 days 08:20:52')
so prob need to have a wrapper for various functions (e.g. mean/std) to do this (for numeric like)
with an option.
@bthyreau If you think this is worthwhile, pls create a new issue.