Pairwise versions for rolling_cov, ewmcov and expanding_cov by snth · Pull Request #4950 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation32 Commits5 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

I added the functions rolling_cov_pairwise(), expanding_cov_pairwise() and ewmcov_pairwise(). I also modified the behaviour of rolling_corr_pairwise(), expanding_corr_pairwise() and ewmcorr_pairwise() slightly so that now these can be computed between different DataFrames and the resulting matrices at each time slice will be rectangular rather than square.

I think this is what was asked for in the original discussion that gave rise to issue #1853.

Fixes #1853

It looks like these have nearly the same signature as the existing
functions. Would it make more sense to add a keyword argument
pairwise=False to the public api and then dispatch to the appropriate
functions based on that?

I agree with you. I provided these implementations under the *_pairwise names to fit in with the current API as described on the Computational Tools page.

While the behaviour of rolling_cov and rolling_corr for the Series/Series and DataFrame/Series seems perfectly fine, the DataFrame/DataFrame behaviour doesn't make much sense to me. I can't think of any scenarios where I would want to match on column names and then describe the result as a rolling_cov rather than a rolling_var. For me a covariance is calculation between two different Series so the name matching doesn't make much sense. The tabular layout of the *_pairwise functions makes it clear which two inputs were used to calculate the covariance or correlation.

However I clearly don't know where people are using this so I rather provided these pairwise implementations under different function names in line with the current rollin_corr_pairwise API and let you core devs decide how to integrate it into the API.

Ok, I've been thinking more about this and think that rather than taking guidance from my own biased preferences, a better approach is to try and minimise cognitive impedance in the API along the lines of the "Don't make me think" principle.

Now DataFrame already has cov and corr methods which compute pairwise covariances and correlations respectively. Presumably the raison d'etre of the rolling_* and expanding_* functions is a convenience to the user (to avoid having to loop over the index manually) as well as to take advantage of any performance improvements that might exist in an on-line algorithm over the naive loop implementation. As such it would seem that the rolling_cov and rolling_corr functions should preserve the API and behaviour of cov and corr respectively while adding an additional dimension to the result introduced by the rolling operation. Therefore {rolling|expanding}_cov and {rolling|expanding}_corr should produce Panels so that the invariant expanding_cov(df, ...).ix[-1] == df.cov() is maintained.

To deal with the case where the non-pairwise version is desired, I just learned that for DataFrame there exists a corrwith method which covers this case. Therefore how about renaming the existing rolling_corr function to rolling_corrwith and introducing covwith as well as {rolling_|expanding_|ewm}covwith. Then the *_pairwise methods I provided could just become the standard API without the _pairwise suffix.

@snth It could be hard sell to change the existing functions in the way you describe (since they'd produce different behavior with current defaults). if you couldn't change the existing API (i.e., pd.rolling_corr(...) with the current defaults needs to return the same thing as it did in previous versions), are there still ways that you think are simple to make these changes?

It looks like there already is a rolling_corr_pairwise function in the pandas namespace.

That said, why couldn't we change the signature such that if you only passed one object, it computes the rolling pairwise correlation and otherwise computes the correlation between the two objects. Shouldn't be that hard to implement. The only quirk would be that you'd have to either change the signature to: def rolling_corr(arg1, arg2=None, window=None, ...). And then if arg2 is None, do the pairwise correlation. It would mean that you'd always have to use keyword arguments for rolling pairwise functions, but I don't think that's such a big deal.

@jtratner I like your approach of overloading the signature. I can probably do this inside _flex_binary_moment() and then it should become available to all these functions we're discussing here. However I noticed that ewmcov() is currently not using _flex_binary_moment so I'll see if I can change that first.

@jtratner I've tried to incorporate the API changes as you suggested. I still need to update the documentation but I ran out of time. I'll look at that tomorrow.

@jtratner I've updated the documentation for the API changes. Along the way I unified the 3 doc template. Hopefully everything unaffected function docstrings are still intact. I did a couple of spot checks and everything looked ok to me. Would be good to have some more eyeballs on this.

I noticed that the center keyword argument isn't documented but appears in the function signatures. Should we add this?

The only other thing I'd like to add is to make these functions part of the DataFrame API as I'd quite like to be able to say things like df.rolling_cov(window=5). How do you feel about that?

@snth can you rebase this....looks pretty good

@snth I just merged a PR (#6362) that also touches the docstrings in this file, which will make it more difficult to rebase (there will be some merge conflicts). If you need any help to resolve them, just let it know!

I tried a rebase of this onto on of the 0.13 release candidates some time ago and there were quite a few conflicts already then. I'm a little busy at the moment so I'll only be able to look at this next week at the earliest.

@snth when you have a chance

@jreback I finally got around to doing that rebase. I rebased off of master as of this morning for which the whole test suite passed on my machine. Post-rebase tests look all ok as well but let's see what Travis says.

@jorisvandenbossche I tried to incorporate your docstring changes but as I had changed the template slightly I could have missed copying some changes across. If you wouldn't mind having a look to check that everything looks ok.

this look great!

can you add a 1-liner release note referncing the issue inImprovements and put in 0.14.0.txt as well.
if you want to put a nice example in the docs / v0.14.0 would be great
can you squash down

One remark: due to your (usefull!) refactoring of the docstrings, now the two notes I added (on "by default, result is set to right edge of window ..." and on "the freq keyword is used ...") are added to all functions, also to the expanding_ functions. However, there the center keyword argument is not sensible (although it is in the function signature, it is also not in the docstring). So the note on that should also only be shown in the rolling_ docstrings.

A quick note about the additional behaviour I introduced.

Basically I took the pairwise behaviour from rolling_corr_pairwise() and moved it into _flex_binary_moment() in order to make it available to rolling_cov(), ewmcov(), expanding_cov(), ...

I am no statistician and cannot promise that this is statistically sound. In particular if there are missing values in the data then a different number of datapoints will be used in the calculation of different entries of the covariance matrices and these covariance matrices are not guaranteed to be positive semi-definite.

I just want to point that out as often the availability of something is taken as an implicit guarantee that it is correct, especially by novice users who are not trained in the field. In this case the user is responsible for ensuring that the results are suitable for his or her use-case.

can you add a blurb to the docstring about this (and in the doc in the moments section)?
can you add a release note / v0.14.0.txt
squash down pls

thanks

otherwise looks good

I'm busy putting in release note. Should be done tomorrow.

I added a release note and some example usage for pairwise=True to the docs.

I also marked rolling_corr_pairwise() and expanding_corr_pairwise() as deprecated since these simply call rolling_corr() and expanding_corr() respectively. If you don't want to deprecate these then I will simply omit that commit when I squash everything down.

I think you need to rebase (it can't be automatically merged)

jreback