msg198938 - (view) |
Author: Adam Davison (Adam.Davison) |
Date: 2013-10-04 10:07 |
If you pass an array containing nan to collections.Counter, rather than counting the number of 'nan's it outputs "'nan': 1" n times into the dictionary. I appreciate using this on an array of floats is a bit of an unusual case but I don't think this is the expected behaviour based on the documentation. To reproduce, try e.g.: a = [1, 1, 1, 2, 'nan', 'nan', 'nan'] collections.Counter(map(float, a)) Based on the documentation I expected to see: {1.0: 3, 2.0: 1, nan: 3} But it actually returns: {1.0: 3, 2.0: 1, nan: 1, nan: 1, nan: 1} Presumably this relates to the fact that nan != nan. I'm not 100% sure if this is a bug or maybe just something that should be mentioned in the documentation... Certainly it's not what I wanted it to do :) Thanks, Adam |
|
|
msg198939 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2013-10-04 10:51 |
> Presumably this relates to the fact that nan != nan. Yep. What you're seeing is pretty much expected behaviour, and it matches how NaNs behave with respect to containment in other Python contexts: >>> x = float('nan') >>> y = float('nan') >>> s = {x} >>> x in s True >>> y in s False There's a much-discussed compromise between object model sanity and respect for IEEE 754 here. You can find the discussions on the mailing lists, but the summary is that this isn't going to change in a hurry. One way you can work around this is to make sure you only have single NaN object (possibly referenced multiple times) in your list. Then you get the behaviour that you're looking for: >>> nan = float('nan') >>> a = [1, 1, 2, nan, nan, nan] >>> collections.Counter(a) Counter({nan: 3, 1: 2, 2: 1}) By the way, when you say 'array of floats', do you mean a NumPy ndarray, a standard library array.array object, or a plain Python list? The example you show is a list containing a mixture of ints and strings. I suggest closing this as 'wont fix'. Raymond? |
|
|
msg198940 - (view) |
Author: Adam Davison (Adam.Davison) |
Date: 2013-10-04 10:58 |
Thanks for the quick response. I'm really using a pandas Series, which is effectively a numpy array behind the scenes as far as I understand, the example I pasted was just to illustrate the behaviour. So the nans are being produced elsewhere, I don't really have control over that step. It seems like perhaps collections.Counter should handle nans as a special case. But I can appreciate the counter-arguments too. Thanks, Adam |
|
|
msg198947 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2013-10-04 13:19 |
> perhaps collections.Counter should handle nans as a special case I don't think that would be a good idea: I'd rather that collections.Counter didn't special case NaNs in any way, but instead treated NaNs following the same (admittedly somewhat awkward) rules that all the other Python collections do---namely, for NaNs, containment effectively works by object identity. > I'm really using a pandas Series Okay, that makes sense. It's a bit unfortunate that NumPy creates a new NaN object every time you read a NaN value out of an array, so that you get e.g., >>> from numpy import array, nan, isnan >>> import numpy as np >>> my_list = [1.2, 2.3, np.nan, np.nan] >>> my_list[2] is my_list[3] True >>> my_array = np.array(my_list) >>> my_array[2] is my_array[3] False Or even: >>> my_array[2] is my_array[2] False I guess you're stuck with using Pandas functionality like `dropna` and `isnull` to deal with missing and non-missing values separately. |
|
|
msg198980 - (view) |
Author: Terry J. Reedy (terry.reedy) *  |
Date: 2013-10-05 03:09 |
This is the sort of issue that makes me think that there should be a single nan object (perhaps an instance of a Nan(float) subclass with a special __eq__ method ;-). Pending that, I agree with closing as "won't fix". |
|
|