Resampler.nunique counting data more than once · Issue #13453 · pandas-dev/pandas (original) (raw)
xref addtl example in #13795
Pandas Resampler.nunique
appears to be putting the same data in multiple bins:
import pandas as pd
Create a series with a datetime index
index = pd.date_range('1-1-2000', '2-15-2000', freq='h') index2 = pd.date_range('4-15-2000', '5-15-2000', freq='h') index3 = index.append(index2) s = pd.Series(range(len(index3)), index=index3)
Since all elements are unique, count
and nunique
should give the same result
count = s.resample('M').count() nunique = s.resample('M').nunique()
In pandas 0.18.1 and 0.18.0 these don't give the same results, when they should
In [3]: count
Out[3]:
2000-01-31 744
2000-02-29 337
2000-03-31 0
2000-04-30 384
2000-05-31 337
Freq: M, dtype: int64
In [4]: nunique
Out[4]:
2000-01-31 337
2000-02-29 744
2000-03-31 0
2000-04-30 744
2000-05-31 337
Freq: M, dtype: int64
In pandas 0.17.0 and 0.17.1 (adjusting to old style resample syntax), the nunique
one fails due to a "ValueError: Wrong number of items passed 4, placement implies 5
" somewhere in the depths of internals.py
. If I go back to 0.16.2, I do get the same result for each.
I'm not sure what's going on here. Since the nunique
results sum to larger than the length, it appears data is being counted more than once.
In [19]: pd.show_versions()
INSTALLED VERSIONS
commit: None python: 2.7.11.final.0 python-bits: 64 OS: Darwin OS-release: 15.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8
pandas: 0.18.1 nose: None pip: 8.1.2 setuptools: 23.0.0 Cython: None numpy: 1.10.4 scipy: None statsmodels: None xarray: None IPython: 4.2.0 sphinx: None patsy: None dateutil: 2.5.3 pytz: 2016.4 blosc: None bottleneck: None tables: None numexpr: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None boto: None pandas_datareader: None