ENH: Allow different dtype in pandas.Series.str.get_dummies (original) (raw)

Feature Type

Problem Description

For pandas.Series.str.get_dummies now it will only return data type of numpy.int64. It would be nice if other data types can be specified.

Feature Description

Add a new parameter to str.get_dummies

Alternative Solutions

N/A

Additional Context

As pandas.Series.str.get_dummies is the easiest method in pandas to implement multi-encoding, it would be great if more data types are supported. The int64 used now can easily cause OOM problem in many cases. Indeed, it is this problem I came across that encouraged me to request this feature here.

Traceback (most recent call last):
  File "D:\CodeSpace\comp9727-assn2\preprocessing.py", line 13, in <module>
  File "C:\Users\JeffersonQin\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\strings\accessor.py", line 101, in wrapper
    return func(self, *args, **kwargs)
  File "C:\Users\JeffersonQin\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\strings\accessor.py", line 1919, in get_dummies
    result, name = self._data.array._str_get_dummies(sep)
  File "C:\Users\JeffersonQin\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\strings\object_array.py", line 369, in _str_get_dummies
    dummies = np.empty((len(arr), len(tags2)), dtype=np.int64)
numpy.core._exceptions.MemoryError: Unable to allocate 25.8 GiB for an array with shape (231637, 14942) and data type int64