Unicode : change df.to_string() and friends to always return unicode objects · Pull Request #2224 · pandas-dev/pandas (original) (raw)
closes #2225
Note: Although all the tests pass with minor fixes, this PR has an above-average chance of
breaking things for people who have relied on broken behaviour thus far.
df.tidy_repr
combines several strings to produce a result. when one component is unicode
and other other is a non-ascii bytestring, it tries to convert the latter back to a unicode string
using the 'ascii' codec and fails.
I suggest that _get_repr
-> to_string
should always return unicode, as implemented by this PR,
and that the force_unicode
argument be deprecated everyhwere.
The force_unicode
argument in to_string
conflates two things:
- which codec to use to decode the string (which can only be a hopeful guess)
- whether to return a unicode() object or str() object,
The first is now no longer necessary since pprint_thing
already resorts to the same hack
of using utf-8 (with errors='replace') as a fallback.
I believe making the latter optional is wrong, precisely because it brings about situations
like the test case above.to_string
, like all internal functions , should utilize unicode objects, whenever feasible.