Reading with read_stata in chunks messes up categories · Issue #31544 · pandas-dev/pandas (original) (raw)

Code Sample, a copy-pastable example if possible

In [2]: df = pd.DataFrame({'col{}'.format(k) : pd.Categorical(['a_label'] + ['another_label']*500) for k in range(2)})

In [3]: df.dtypes
Out[3]: col0 category col1 category dtype: object

In [4]: df.dtypes[0]
Out[4]: CategoricalDtype(categories=['a_label', 'another_label'], ordered=False)

In [5]: df.to_stata('/tmp/stata_test.dta', write_index=False)

In [6]: pd.read_stata('/tmp/stata_test.dta').dtypes
Out[6]: col0 category col1 category dtype: object

... that's good

In [7]: reader = pd.read_stata('/tmp/stata_test.dta', chunksize=100)

In [8]: reader.value_labels()
Out[8]: {'col0': {0: 'a_label', 1: 'another_label'}, 'col1': {0: 'a_label', 1: 'another_label'}}

... still all good

In [9]: out_chunks = [chunk for chunk in reader]

In [10]: out_chunks[1].dtypes[0]
Out[10]: CategoricalDtype(categories=['another_label'], ordered=True)

Ooops... where's the other label gone?

In [11]: reader.close()

In [12]: all_together = pd.concat(out_chunks)

In [13]: all_together.dtypes[0]
Out[13]: dtype('O')

Ouch!

Problem description

My data has categories, but they are lost only because I'm reading it in chunks. I noticed this because I was reading in chunks a large database of which I only needed a subset of columns: ironically, precisely the fact that I was reading it in chunks made memory usage explode when I reattached them.

An by the way, Out[8]: shows that pandas is aware of the actual categories, even before iterating... so this is the information that should be used to consistently recreate them, and all chunks should have exactly the same (as in is) categorical dtype.

Expected Output

Out[10] should feature both categories, and Out[13] should still be a categorical.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.19.0-6-amd64
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : it_IT.UTF-8
LOCALE : it_IT.UTF-8

pandas : 1.1.0.dev0+276.g2495068ad
numpy : 1.16.4
pytz : 2019.2
dateutil : 2.8.0
pip : 18.1
setuptools : 41.0.1
Cython : 0.29.13
pytest : 4.6.3
hypothesis : 3.71.11
sphinx : 1.8.4
blosc : 1.7.0
feather : None
xlsxwriter : 0.9.3
lxml.etree : 4.3.2
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.7.7 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.5.0
pandas_datareader: None
bs4 : 4.7.1
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.3.2
matplotlib : 3.0.2
numexpr : 2.6.9
odfpy : None
openpyxl : 2.4.9
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 4.6.3
pyxlsb : None
s3fs : None
scipy : 1.1.0
sqlalchemy : 1.2.18
tables : 3.4.4
tabulate : 0.8.3
xarray : 0.11.3
xlrd : 1.1.0
xlwt : 1.3.0
xlsxwriter : 0.9.3
numba : 0.45.0