BUG: pandas 1.1.0 MemoryError using .astype("string") which worked using pandas 1.0.5 · Issue #35499 · pandas-dev/pandas (original) (raw)
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of pandas.
- (optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
I tried to pinpoint the specific row which causes the error. The column has HTML data like this:
'''<div class="comment" dir="auto"><p dir="auto">Request <a href="/agent/tickets/test" target="_blank" rel="ticket">#test</a> "RE: ** M ..." Last comment in request </a>:</p> <p dir="auto"></p> <p dir="auto">Thank you</p> <p dir="auto">Stuff.</p> <p dir="auto">We will keep you posted .</p> <p dir="auto">Regards,</p> <p dir="auto">Name</p></div>'''
#This fails (memory error below): df['event_html_body'].astype("string")
#Filtering the dataframe to only convert rows which are not null for this field works x = df[~df.event_html_body.isnull()][['event_html_body']] x['event_html_body'].astype("string")
#Filling NAs with another value fails: df['event_html_body'].fillna('-').astype("string")
Problem description
I have code which has been converting columns to the "string" dtype and this has worked up until pandas 1.1.0
For example, I tried to process a file which I successfully processed in April and it works when I use .astype(str), but it fails when I use .astype("string") event though this worked in pandas 1.0.5.
The column does not need to be the new "string" type, but I wanted to raise this issue anyways.
Rows: 201368
Empty/null rows for the column in question: 189014 / 201368
So this column is quite sparse, and as I mentioned below if I filtered the nulls and then do .astype("string") then it will run fine. I am not sure why this worked before (same server, 64 GB of RAM), and this file was previous processed as a "string" before the update.
Error:
MemoryError Traceback (most recent call last)
<ipython-input-38-939f88862e64> in <module>
----> 1 df['event_html_body'].astype("string")
/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
5535 else:
5536 # else, only a single dtype is given
-> 5537 new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors,)
5538 return self._constructor(new_data).__finalize__(self, method="astype")
5539
/opt/conda/lib/python3.7/site-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
565 self, dtype, copy: bool = False, errors: str = "raise"
566 ) -> "BlockManager":
--> 567 return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
568
569 def convert(
/opt/conda/lib/python3.7/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, **kwargs)
394 applied = b.apply(f, **kwargs)
395 else:
--> 396 applied = getattr(b, f)(**kwargs)
397 result_blocks = _extend_blocks(applied, result_blocks)
398
/opt/conda/lib/python3.7/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
588 vals1d = values.ravel()
589 try:
--> 590 values = astype_nansafe(vals1d, dtype, copy=True)
591 except (ValueError, TypeError):
592 # e.g. astype_nansafe can fail on object-dtype of strings
/opt/conda/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
911 # dispatch on extension dtype if needed
912 if is_extension_array_dtype(dtype):
--> 913 return dtype.construct_array_type()._from_sequence(arr, dtype=dtype, copy=copy)
914
915 if not isinstance(dtype, np.dtype):
/opt/conda/lib/python3.7/site-packages/pandas/core/arrays/string_.py in _from_sequence(cls, scalars, dtype, copy)
215
216 # convert to str, then to object to avoid dtype like '<U3', then insert na_value
--> 217 result = np.asarray(result, dtype=str)
218 result = np.asarray(result, dtype="object")
219 if has_nans:
/opt/conda/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
83
84 """
---> 85 return array(a, dtype, copy=False, order=order)
86
87
MemoryError: Unable to allocate array with shape (201368,) and data type <U131880
Expected Output
Output of pd.show_versions()
INSTALLED VERSIONS
commit : d9fff27
python : 3.7.4.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-1028-azure
Version : #29~18.04.1-Ubuntu SMP Fri Jun 5 14:32:34 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.0
numpy : 1.17.5
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0
Cython : 0.29.14
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : None
fsspec : 0.6.2
fastparquet : None
gcsfs : None
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : 1.0.0
pytables : None
pyxlsb : 1.0.6
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : None
tabulate : 0.8.6
xarray : None
xlrd : 1.2.0
xlwt : None
numba : 0.48.0