Unicode : change df.to_string() and friends to always return unicode objects · Pull Request #2224 · pandas-dev/pandas (original) (raw)

closes #2225

Note: Although all the tests pass with minor fixes, this PR has an above-average chance of
breaking things for people who have relied on broken behaviour thus far.

df.tidy_repr combines several strings to produce a result. when one component is unicode
and other other is a non-ascii bytestring, it tries to convert the latter back to a unicode string
using the 'ascii' codec and fails.

I suggest that _get_repr -> to_string should always return unicode, as implemented by this PR,
and that the force_unicode argument be deprecated everyhwere.

The force_unicode argument in to_string conflates two things:

The first is now no longer necessary since pprint_thing already resorts to the same hack
of using utf-8 (with errors='replace') as a fallback.
I believe making the latter optional is wrong, precisely because it brings about situations
like the test case above.
to_string, like all internal functions , should utilize unicode objects, whenever feasible.