ENH: correlation function accepts method being a callable by shadiakiki1986 · Pull Request #22684 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation36 Commits1 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

other than the listed strings for the method argument, accept a callable for generic correlation calculations
closes #xxxx (no issue opened for PR)
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry (not sure if I should do this or the person merging this PR)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea. Does this close an issue?

We'll need a release note in 0.24.0.txt

@@ -14,7 +14,7 @@ lxml
matplotlib
nbsphinx
numexpr
openpyxl=2.5.5
openpyxl==2.5.5

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is autogenerated. #22689

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok should I roll this edit back?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you should.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

.. ipython:: python

# histogram intersection
histogram_intersection = lambda a, b: np.minimum( np.true_divide(a, a.sum()), np.true_divide(b, b.sum())).sum()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this pep8 compliant

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in new commit

@@ -6652,10 +6652,12 @@ def corr(self, method='pearson', min_periods=1):

Parameters
----------
method : {'pearson', 'kendall', 'spearman'}
method : {'pearson', 'kendall', 'spearman', callable}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'll be `}, or callable on the outside of the options

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved it to outside of the options

* pearson : standard correlation coefficient
* kendall : Kendall Tau correlation coefficient
* spearman : Spearman rank correlation
* callable: callable with input two numpy 1d-array

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"callable expecting two 1d ndarrays and returning a float" (does the callable get ndarrays or series?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ndarrays

@@ -789,6 +789,41 @@ def test_corr_invalid_method(self):
with tm.assert_raises_regex(ValueError, msg):
s1.corr(s2, method="____")

def test_corr_callable_method(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I care less about testing this exact way of computing the correlation, and more about ensure that the method is dispatched to.

Would it be possible to define a very simple "correlation" function that just returns something like the index of the columns? So the correlation of the nth or and mth column would be like (n + m). Not sure if that's possible.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok will re-write the test tomorrow

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test now includes a simpler correlation function. It is not possible to identify the nth/mth column as in your example because the correlation function itself does not know about the dataframe as a whole but only as each series on its own. The correlation function I chose is a simple 1 if exact equality else 0

This doesn't fix any issue, but I added a note in the 0.24.0 release notes

@@ -17,6 +17,10 @@ New features

- ``ExcelWriter`` now accepts ``mode`` as a keyword argument, enabling append to existing workbooks when using the ``openpyxl`` engine (:issue:`3441`)


- :meth:`DataFrame.corr` and :meth:`Series.corr` now accept a callable for generic calculation methods of correlation, e.g. histogram intersection

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use your PR as the issue number. Also, no new line above this sentence.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jreback if you have a chance.

This provides a nice, generic alternative to #21925.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments.

* pearson : standard correlation coefficient
* kendall : Kendall Tau correlation coefficient
* spearman : Spearman rank correlation
* callable: callable with input two 1d ndarrays

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an example in Examples

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -764,6 +764,9 @@ def nancorr(a, b, method='pearson', min_periods=None):


def get_corr_func(method):
if callable(method):
return method

if method in ['kendall', 'spearman']:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elif

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* pearson : standard correlation coefficient
* kendall : Kendall Tau correlation coefficient
* spearman : Spearman rank correlation
* callable: callable with input two 1d ndarray

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how to doc-string this signature here

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I just leave it as is?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably fine as is. The type would be Callable[[ndarray, ndarray], float], but I'm not sure how familiar people are with typing yet.


>>> df = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)],
... columns=['dogs', 'cats'])
>>> df.corr(method = histogram_intersection)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is failing our doctests. Is there an issue with the output?

Also, pep8: no spaces around the =

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

>>> s1 = pd.Series([1, 0, 2, 1])
>>> s2 = pd.Series([2, 3, 0, 1])
>>> s1.corr(s2, method = histogram_intersection)
0.416667

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to round this, or write it out at full precision.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

* pearson : standard correlation coefficient
* kendall : Kendall Tau correlation coefficient
* spearman : Spearman rank correlation
* callable: callable with input two 1d ndarray

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably fine as is. The type would be Callable[[ndarray, ndarray], float], but I'm not sure how familiar people are with typing yet.

# simple correlation example
# returns 1 if exact equality, 0 otherwise
my_corr = lambda a, b: 1. if (a == b).all() else 0.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use result= and expected= here, rather than expected_1 and such. its much easier to follow

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

other than the listed strings for the method argument, accept a callable for generic correlation calculations

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request

Oct 1, 2018

Labels

Algos

Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff

Enhancement