Resampler.nunique counting data more than once · Issue #13453 · pandas-dev/pandas (original) (raw)

xref addtl example in #13795

Pandas Resampler.nunique appears to be putting the same data in multiple bins:

import pandas as pd

Create a series with a datetime index

index = pd.date_range('1-1-2000', '2-15-2000', freq='h') index2 = pd.date_range('4-15-2000', '5-15-2000', freq='h') index3 = index.append(index2) s = pd.Series(range(len(index3)), index=index3)

Since all elements are unique, count and nunique should give the same result

count = s.resample('M').count() nunique = s.resample('M').nunique()

In pandas 0.18.1 and 0.18.0 these don't give the same results, when they should

In [3]: count
Out[3]:
2000-01-31    744
2000-02-29    337
2000-03-31      0
2000-04-30    384
2000-05-31    337
Freq: M, dtype: int64

In [4]: nunique
Out[4]:
2000-01-31    337
2000-02-29    744
2000-03-31      0
2000-04-30    744
2000-05-31    337
Freq: M, dtype: int64

In pandas 0.17.0 and 0.17.1 (adjusting to old style resample syntax), the nunique one fails due to a "ValueError: Wrong number of items passed 4, placement implies 5" somewhere in the depths of internals.py. If I go back to 0.16.2, I do get the same result for each.

I'm not sure what's going on here. Since the nunique results sum to larger than the length, it appears data is being counted more than once.


In [19]: pd.show_versions()

INSTALLED VERSIONS

commit: None python: 2.7.11.final.0 python-bits: 64 OS: Darwin OS-release: 15.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8

pandas: 0.18.1 nose: None pip: 8.1.2 setuptools: 23.0.0 Cython: None numpy: 1.10.4 scipy: None statsmodels: None xarray: None IPython: 4.2.0 sphinx: None patsy: None dateutil: 2.5.3 pytz: 2016.4 blosc: None bottleneck: None tables: None numexpr: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None boto: None pandas_datareader: None