BUG: pandas 1.1.0 MemoryError using .astype("string") which worked using pandas 1.0.5 · Issue #35499 · pandas-dev/pandas (original) (raw)


Code Sample, a copy-pastable example

I tried to pinpoint the specific row which causes the error. The column has HTML data like this:

'''<div class="comment" dir="auto"><p dir="auto">Request <a href="/agent/tickets/test" target="_blank" rel="ticket">#test</a> "RE: ** M ..." Last comment in request </a>:</p> <p dir="auto"></p> <p dir="auto">Thank you</p> <p dir="auto">Stuff.</p> <p dir="auto">We will keep you posted .</p> <p dir="auto">Regards,</p> <p dir="auto">Name</p></div>'''

#This fails (memory error below): df['event_html_body'].astype("string")

#Filtering the dataframe to only convert rows which are not null for this field works x = df[~df.event_html_body.isnull()][['event_html_body']] x['event_html_body'].astype("string")

#Filling NAs with another value fails: df['event_html_body'].fillna('-').astype("string")

Problem description

I have code which has been converting columns to the "string" dtype and this has worked up until pandas 1.1.0

For example, I tried to process a file which I successfully processed in April and it works when I use .astype(str), but it fails when I use .astype("string") event though this worked in pandas 1.0.5.

The column does not need to be the new "string" type, but I wanted to raise this issue anyways.

Rows: 201368
Empty/null rows for the column in question: 189014 / 201368

So this column is quite sparse, and as I mentioned below if I filtered the nulls and then do .astype("string") then it will run fine. I am not sure why this worked before (same server, 64 GB of RAM), and this file was previous processed as a "string" before the update.

Error:


MemoryError                               Traceback (most recent call last)
<ipython-input-38-939f88862e64> in <module>
----> 1 df['event_html_body'].astype("string")

/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
   5535         else:
   5536             # else, only a single dtype is given
-> 5537             new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors,)
   5538             return self._constructor(new_data).__finalize__(self, method="astype")
   5539 

/opt/conda/lib/python3.7/site-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
    565         self, dtype, copy: bool = False, errors: str = "raise"
    566     ) -> "BlockManager":
--> 567         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    568 
    569     def convert(

/opt/conda/lib/python3.7/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, **kwargs)
    394                 applied = b.apply(f, **kwargs)
    395             else:
--> 396                 applied = getattr(b, f)(**kwargs)
    397             result_blocks = _extend_blocks(applied, result_blocks)
    398 

/opt/conda/lib/python3.7/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
    588             vals1d = values.ravel()
    589             try:
--> 590                 values = astype_nansafe(vals1d, dtype, copy=True)
    591             except (ValueError, TypeError):
    592                 # e.g. astype_nansafe can fail on object-dtype of strings

/opt/conda/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
    911     # dispatch on extension dtype if needed
    912     if is_extension_array_dtype(dtype):
--> 913         return dtype.construct_array_type()._from_sequence(arr, dtype=dtype, copy=copy)
    914 
    915     if not isinstance(dtype, np.dtype):

/opt/conda/lib/python3.7/site-packages/pandas/core/arrays/string_.py in _from_sequence(cls, scalars, dtype, copy)
    215 
    216         # convert to str, then to object to avoid dtype like '<U3', then insert na_value
--> 217         result = np.asarray(result, dtype=str)
    218         result = np.asarray(result, dtype="object")
    219         if has_nans:

/opt/conda/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     83 
     84     """
---> 85     return array(a, dtype, copy=False, order=order)
     86 
     87 

MemoryError: Unable to allocate array with shape (201368,) and data type <U131880

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

commit : d9fff27
python : 3.7.4.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-1028-azure
Version : #29~18.04.1-Ubuntu SMP Fri Jun 5 14:32:34 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.0
numpy : 1.17.5
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0
Cython : 0.29.14
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : None
fsspec : 0.6.2
fastparquet : None
gcsfs : None
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : 1.0.0
pytables : None
pyxlsb : 1.0.6
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : None
tabulate : 0.8.6
xarray : None
xlrd : 1.2.0
xlwt : None
numba : 0.48.0