Issue 28956: return list of modes for a multimodal distribution instead of raising a StatisticsError (original) (raw)

Created on 2016-12-13 04:21 by sria91, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (15)

msg283071 - (view)

Author: Srikanth Anantharam (sria91) *

Date: 2016-12-13 04:21

return minimum of modes for a multimodal distribution

instead of raising a StatisticsError

msg283085 - (view)

Author: Wolfgang Maier (wolma) *

Date: 2016-12-13 08:50

What's the justification for this proposed change? Isn't it better to report the fact that there isn't an unambiguous result instead of returning a rather arbitrary one?

msg283089 - (view)

Author: Srikanth Anantharam (sria91) *

Date: 2016-12-13 09:35

A better choice would be to return a tuple of values (sliced from the table). And let the user decide which one to use.

Hope that's justifiable...

Thanks & Regards Srikanth Anantharam +91 7204 350429 https://sria91.github.io/

Sent from Android

On 13-Dec-2016 2:20 PM, "Wolfgang Maier" <report@bugs.python.org> wrote:

Wolfgang Maier added the comment:

What's the justification for this proposed change? Isn't it better to report the fact that there isn't an unambiguous result instead of returning a rather arbitrary one?


nosy: +steven.daprano, wolma versions: +Python 3.7 -Python 3.5


Python tracker <report@bugs.python.org> <http://bugs.python.org/issue28956>


msg283090 - (view)

Author: Steven D'Aprano (steven.daprano) * (Python committer)

Date: 2016-12-13 09:54

On Tue, Dec 13, 2016 at 09:35:22AM +0000, Srikanth Anantharam wrote:

Srikanth Anantharam added the comment:

A better choice would be to return a tuple of values (sliced from the table). And let the user decide which one to use.

The current mode() function is designed for a very basic use-case, where you have an obvious single mode from discrete data.

The problem with dealing with multiple modes is that its not easy to tell the difference between a genuinely multi-modal sample and one which just happens to have a few samples with the same value:

data = [1, 2, 3, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9]

Assuming the sampling is fair, 8 is clearly the mode; but is it bimodal with 4 the second mode? Or perhaps even four modes, 8, 4, 7 and 9?

I have plans for introducing a binning function to collect data into bins and run statistics on the bins. That might be a better way to deal with multi-modal samples: if you bin the data (for discrete data, use a bin size of 1) and then look at the frequencies, you can decide how many modes there are.

Thanks for the suggestion.

msg283091 - (view)

Author: Srikanth Anantharam (sria91) *

Date: 2016-12-13 10:08

Please see the updated pull request PR 50, with the changes.

Thanks & Regards Srikanth Anantharam +91 7204 350429 https://sria91.github.io/

Sent from Android

On 13-Dec-2016 3:26 PM, "Srikanth Anantharam" <report@bugs.python.org> wrote:

Changes by Srikanth Anantharam <sria91@gmail.com>:


pull_requests: +4


Python tracker <report@bugs.python.org> <http://bugs.python.org/issue28956>


msg283092 - (view)

Author: Srikanth Anantharam (sria91) *

Date: 2016-12-13 10:17

@steven:

data = [1, 2, 3, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9] is clearly unimodal with mode 8

data would have been bimodal if 4 repeated exactly the same (7) number of times as 8, like this: data = [1, 2, 3, 4, 4, 4, 4, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9]

in which case the new patch in PR 50 would return a tuple (4, 8)

Thanks & Regards Srikanth Anantharam +91 7204 350429 https://sria91.github.io/

Sent from Android

On 13-Dec-2016 3:24 PM, "Steven D'Aprano" <report@bugs.python.org> wrote:

Steven D'Aprano added the comment:

On Tue, Dec 13, 2016 at 09:35:22AM +0000, Srikanth Anantharam wrote:

Srikanth Anantharam added the comment:

A better choice would be to return a tuple of values (sliced from the table). And let the user decide which one to use.

The current mode() function is designed for a very basic use-case, where you have an obvious single mode from discrete data.

The problem with dealing with multiple modes is that its not easy to tell the difference between a genuinely multi-modal sample and one which just happens to have a few samples with the same value:

data = [1, 2, 3, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9]

Assuming the sampling is fair, 8 is clearly the mode; but is it bimodal with 4 the second mode? Or perhaps even four modes, 8, 4, 7 and 9?

I have plans for introducing a binning function to collect data into bins and run statistics on the bins. That might be a better way to deal with multi-modal samples: if you bin the data (for discrete data, use a bin size of 1) and then look at the frequencies, you can decide how many modes there are.

Thanks for the suggestion.



Python tracker <report@bugs.python.org> <http://bugs.python.org/issue28956>


msg283154 - (view)

Author: Steven D'Aprano (steven.daprano) * (Python committer)

Date: 2016-12-14 01:55

On Tue, Dec 13, 2016 at 10:08:10AM +0000, Srikanth Anantharam wrote:

Please see the updated pull request PR 50, with the changes.

I'm rejecting that pull request. As I said, mode() intentionally returns only the single, unique mode. I may add a more advanced API or a second function for dealing with multi-modal samples, but even if I do, your suggestion wouldn't be sufficient. Your pull request merely returns the entire list of unique values:

return tuple(value for value, frequency in table)

with no way for the caller to tell which values might be a mode and which are not.

(By the way, even if this function behaviour was acceptible, which I stress it is not, this would not be sufficient for me to accept as a patch. You should preferably update the documentation and the tests as well. At the very least, you should update the function's docstring to explain the changed return value.)

I'm sorry that I have to reject this, I am interested in having better support for multiple modes. I'm not closing this issue just yet, if you are interested in continuing the discussion, what would be VERY valuable for me would be for you or some other volunteer to do some research into numerical techniques for objectively determining the number and value of modes, rather than just plotting a graph and subjectively deciding whether a value is a peak or not.

Thanks for your interest.

msg283155 - (view)

Author: Steven D'Aprano (steven.daprano) * (Python committer)

Date: 2016-12-14 02:00

On Tue, Dec 13, 2016 at 10:17:21AM +0000, Srikanth Anantharam wrote:

Srikanth Anantharam added the comment:

@steven:

data = [1, 2, 3, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9] is clearly unimodal with mode 8

data would have been bimodal if 4 repeated exactly the same (7) number of times as 8, like this: data = [1, 2, 3, 4, 4, 4, 4, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9]

Bimodal distributions do not require both modes to be exactly the same height. And certainly when you have a sample from a bimodal distribution, you should not expect exactly the same frequency for the two modes. Just from random sampling error you will expect one or the other to have a larger frequency.

You shouldn't take my example too literally. With such a small sample of discrete values, it becomes a (hard) matter of personal judgement. The point I was attempting to make was that identifying sample modes outside of the simplest unimodal case is tricky and requires much thought.

msg283453 - (view)

Author: Terry J. Reedy (terry.reedy) * (Python committer)

Date: 2016-12-17 00:22

Srikanth, when you reply by email, please remove the quotation of the previous message. On the web page, it is just noise. The only exception should be when you reply to a specific sentence and need to quote that sentence for context.

In my particular experience, mode() is unusally reserved for crudely describing unordered categorical data, where the concept of 'minimum' does not apply.

Mode is useful for determining the winner of a vote (or other decision process), but in general, it is not a substitute for a more comprehensive look at a dataset.

Problems with possibly returning a tuple of data items instead of a data item include:

  1. The user then has to be prepared to handle a tuple instead of a data item. It would be better then to always return a tuple, even for 1 item.

  2. Data items can be tuples, making a tuple return ambiguous. Example use case: planar points with int coordinates.

mode(((0,0), (0,0), (0,1))) (0, 0)

So, while StatisticsError is a nuisance, so are the apparent alternatives. I think we should leave mode alone and close this.

msg312303 - (view)

Author: Steven D'Aprano (steven.daprano) * (Python committer)

Date: 2018-02-18 10:08

What makes the minimum mode better than the maximum?

msg312305 - (view)

Author: Srikanth Anantharam (sria91) *

Date: 2018-02-18 10:13

Please review the new PR with tests. I'll update the documentation if the PR is acceptable.

msg337620 - (view)

Author: Henry Chen (scotchka) *

Date: 2019-03-10 16:42

The problem remains that the function can return a number or a list for input that is a list of numbers. This means the user will need to handle both possibilities every time, which is a heavy burden for such a simple function.

SciPy's mode function does return the minimum mode when there is a tie, which as far as I can tell is an arbitrary choice. But in that context, since the input is almost always numerical, a minimum is at least well defined, which is not true for an input with a mix of types.

For the general use case, the current behavior - raising an exception - in case of tie conveys the most information.

msg337622 - (view)

Author: Henry Chen (scotchka) *

Date: 2019-03-10 17:10

Yes, the mode function could ALWAYS return a list, but that breaks backward compatibility, as does the currently proposed change.

msg337625 - (view)

Author: Raymond Hettinger (rhettinger) * (Python committer)

Date: 2019-03-10 18:04

See the competing proposal and PR at https://bugs.python.org/issue35892 and https://github.com/python/cpython/pull/12089

msg337656 - (view)

Author: Steven D'Aprano (steven.daprano) * (Python committer)

Date: 2019-03-11 11:01

I'm closing this issue in favour of Raymond's #35892, thank you to everyone even if your PRs didn't get used, I appreciate your efforts.

History

Date

User

Action

Args

2022-04-11 14:58:40

admin

set

github: 73142

2019-03-11 11:01:24

steven.daprano

set

status: open -> closed
resolution: rejected
messages: +

stage: patch review -> resolved

2019-03-10 18:04:21

rhettinger

set

nosy: + rhettinger
messages: +

2019-03-10 17:10:13

scotchka

set

messages: +

2019-03-10 16:42:03

scotchka

set

nosy: + scotchka
messages: +

2018-02-18 10:13:00

sria91

set

messages: +
title: return minimum of modes for a multimodal distribution instead of raising a StatisticsError -> return list of modes for a multimodal distribution instead of raising a StatisticsError

2018-02-18 10:08:48

steven.daprano

set

messages: +

2018-02-18 10:05:23

sria91

set

keywords: + patch
stage: patch review
pull_requests: + <pull%5Frequest5512>

2016-12-17 00:22:30

terry.reedy

set

nosy: + terry.reedy
messages: +

2016-12-14 02:00:57

steven.daprano

set

messages: +

2016-12-14 01:55:37

steven.daprano

set

messages: +

2016-12-13 10:17:21

sria91

set

messages: +

2016-12-13 10:08:10

sria91

set

messages: +

2016-12-13 09:56:46

sria91

set

pull_requests: + <pull%5Frequest4>

2016-12-13 09:54:04

steven.daprano

set

messages: +

2016-12-13 09:35:22

sria91

set

messages: +

2016-12-13 08:50:12

wolma

set

nosy: + steven.daprano, wolma

messages: +
versions: + Python 3.7, - Python 3.5

2016-12-13 04:21:41

sria91

create