Pairwise versions for rolling_cov, ewmcov and expanding_cov by snth · Pull Request #4950 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation32 Commits5 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
I added the functions rolling_cov_pairwise(), expanding_cov_pairwise() and ewmcov_pairwise(). I also modified the behaviour of rolling_corr_pairwise(), expanding_corr_pairwise() and ewmcorr_pairwise() slightly so that now these can be computed between different DataFrames and the resulting matrices at each time slice will be rectangular rather than square.
I think this is what was asked for in the original discussion that gave rise to issue #1853.
Fixes #1853
It looks like these have nearly the same signature as the existing
functions. Would it make more sense to add a keyword argumentpairwise=False
to the public api and then dispatch to the appropriate
functions based on that?
I agree with you. I provided these implementations under the *_pairwise names to fit in with the current API as described on the Computational Tools page.
While the behaviour of rolling_cov
and rolling_corr
for the Series/Series
and DataFrame/Series
seems perfectly fine, the DataFrame/DataFrame
behaviour doesn't make much sense to me. I can't think of any scenarios where I would want to match on column names and then describe the result as a rolling_cov
rather than a rolling_var
. For me a covariance is calculation between two different Series so the name matching doesn't make much sense. The tabular layout of the *_pairwise functions makes it clear which two inputs were used to calculate the covariance or correlation.
However I clearly don't know where people are using this so I rather provided these pairwise implementations under different function names in line with the current rollin_corr_pairwise
API and let you core devs decide how to integrate it into the API.
Ok, I've been thinking more about this and think that rather than taking guidance from my own biased preferences, a better approach is to try and minimise cognitive impedance in the API along the lines of the "Don't make me think" principle.
Now DataFrame
already has cov
and corr
methods which compute pairwise covariances and correlations respectively. Presumably the raison d'etre of the rolling_*
and expanding_*
functions is a convenience to the user (to avoid having to loop over the index manually) as well as to take advantage of any performance improvements that might exist in an on-line algorithm over the naive loop implementation. As such it would seem that the rolling_cov
and rolling_corr
functions should preserve the API and behaviour of cov
and corr
respectively while adding an additional dimension to the result introduced by the rolling operation. Therefore {rolling|expanding}_cov
and {rolling|expanding}_corr
should produce Panels
so that the invariant expanding_cov(df, ...).ix[-1] == df.cov()
is maintained.
To deal with the case where the non-pairwise version is desired, I just learned that for DataFrame
there exists a corrwith
method which covers this case. Therefore how about renaming the existing rolling_corr
function to rolling_corrwith
and introducing covwith
as well as {rolling_|expanding_|ewm}covwith
. Then the *_pairwise
methods I provided could just become the standard API without the _pairwise
suffix.
@snth It could be hard sell to change the existing functions in the way you describe (since they'd produce different behavior with current defaults). if you couldn't change the existing API (i.e., pd.rolling_corr(...)
with the current defaults needs to return the same thing as it did in previous versions), are there still ways that you think are simple to make these changes?
It looks like there already is a rolling_corr_pairwise
function in the pandas namespace.
That said, why couldn't we change the signature such that if you only passed one object, it computes the rolling pairwise correlation and otherwise computes the correlation between the two objects. Shouldn't be that hard to implement. The only quirk would be that you'd have to either change the signature to: def rolling_corr(arg1, arg2=None, window=None, ...)
. And then if arg2 is None, do the pairwise correlation. It would mean that you'd always have to use keyword arguments for rolling pairwise functions, but I don't think that's such a big deal.
@jtratner I like your approach of overloading the signature. I can probably do this inside _flex_binary_moment()
and then it should become available to all these functions we're discussing here. However I noticed that ewmcov()
is currently not using _flex_binary_moment
so I'll see if I can change that first.
@jtratner I've tried to incorporate the API changes as you suggested. I still need to update the documentation but I ran out of time. I'll look at that tomorrow.
@jtratner I've updated the documentation for the API changes. Along the way I unified the 3 doc template. Hopefully everything unaffected function docstrings are still intact. I did a couple of spot checks and everything looked ok to me. Would be good to have some more eyeballs on this.
I noticed that the center keyword argument isn't documented but appears in the function signatures. Should we add this?
The only other thing I'd like to add is to make these functions part of the DataFrame API as I'd quite like to be able to say things like df.rolling_cov(window=5)
. How do you feel about that?
@snth can you rebase this....looks pretty good
@snth I just merged a PR (#6362) that also touches the docstrings in this file, which will make it more difficult to rebase (there will be some merge conflicts). If you need any help to resolve them, just let it know!
I tried a rebase of this onto on of the 0.13 release candidates some time ago and there were quite a few conflicts already then. I'm a little busy at the moment so I'll only be able to look at this next week at the earliest.
@snth when you have a chance
@jreback I finally got around to doing that rebase. I rebased off of master as of this morning for which the whole test suite passed on my machine. Post-rebase tests look all ok as well but let's see what Travis says.
@jorisvandenbossche I tried to incorporate your docstring changes but as I had changed the template slightly I could have missed copying some changes across. If you wouldn't mind having a look to check that everything looks ok.
this look great!
- can you add a 1-liner release note referncing the issue inImprovements and put in 0.14.0.txt as well.
- if you want to put a nice example in the docs / v0.14.0 would be great
- can you squash down
One remark: due to your (usefull!) refactoring of the docstrings, now the two notes I added (on "by default, result is set to right edge of window ..." and on "the freq keyword is used ...") are added to all functions, also to the expanding_ functions. However, there the center
keyword argument is not sensible (although it is in the function signature, it is also not in the docstring). So the note on that should also only be shown in the rolling_ docstrings.
A quick note about the additional behaviour I introduced.
Basically I took the pairwise behaviour from rolling_corr_pairwise() and moved it into _flex_binary_moment() in order to make it available to rolling_cov(), ewmcov(), expanding_cov(), ...
I am no statistician and cannot promise that this is statistically sound. In particular if there are missing values in the data then a different number of datapoints will be used in the calculation of different entries of the covariance matrices and these covariance matrices are not guaranteed to be positive semi-definite.
I just want to point that out as often the availability of something is taken as an implicit guarantee that it is correct, especially by novice users who are not trained in the field. In this case the user is responsible for ensuring that the results are suitable for his or her use-case.
- can you add a blurb to the docstring about this (and in the doc in the moments section)?
- can you add a release note / v0.14.0.txt
- squash down pls
thanks
otherwise looks good
I'm busy putting in release note. Should be done tomorrow.
I added a release note and some example usage for pairwise=True
to the docs.
I also marked rolling_corr_pairwise()
and expanding_corr_pairwise()
as deprecated since these simply call rolling_corr()
and expanding_corr()
respectively. If you don't want to deprecate these then I will simply omit that commit when I squash everything down.
I think you need to rebase (it can't be automatically merged)
correls[df.index[-50]] |
---|
.. note:: |
This was previously available through ``rolling_corr_pairwise`` which is |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put a prior to version 0.14
minor comments, pls rebase and I think good to go
The deprecation is fine (I marked a reminder to remove in #6581 at some point in the future)
looks good...ping when green
jreback added a commit that referenced this pull request
Pairwise versions for rolling_cov, ewmcov and expanding_cov
You're right, the mention of the 2 *_corr_pairwise() functions should have been deleted. Thanks for sorting it out. Looks good to me.
This was referenced
Apr 3, 2014
jreback added a commit to jreback/pandas that referenced this pull request