ENH: add to_records() option to output NumPy string dtypes, not objects · Issue #18146 · pandas-dev/pandas (original) (raw)

DataFrame.to_records() outputs string columns with the object dtype, which is sometimes not efficient (e.g. for short, similar-length strings, or when storing with np.save()). I wrote the following function to fix this:

def to_records_plain(df):
    """Return a NumPy recarray like df.to_records() but with strings stored as bytes, not objects.
    This gives more compact storage and does not require pickling objects when saving to disk.
    Assumes all object arrays in df are strings.

    >>> df = pd.DataFrame({'a': [1, 2], 'b': [0.5, 0.9], 'c': ['x', 'yyy']})
    >>> to_records_plain(df)
    rec.array([(0, 1,  0.5, b'x'), (1, 2,  0.9, b'yyy')], 
              dtype=[('index', '<i8'), ('a', '<i8'), ('b', '<f8'), ('c', 'S3')])
    """
    records = df.to_records()
    descr = records.dtype.descr
    for ii, (name, dtype) in enumerate(descr):
        if dtype == '|O':
            length = df[name].str.len().max()
            descr[ii] = (name, 'S{}'.format(length))

    return records.astype(descr)

I suggest exposing something like this as an option in DataFrame.to_records(). An option to convert to Unicode ('U') too would be good too (NumPy's 'S' is effectively bytes in Python 3).