BUG: to_clipboard text truncated for Python 3 on Windows for UTF-16 text by david-liu-brattle-1 · Pull Request #25040 · pandas-dev/pandas (original) (raw)
- closes #xxxx
- tests added / passed
- passes
git diff upstream/master -u -- "*.py" | flake8 --diff
- whatsnew entry
For windows users where Python is compiled with UCS-4 (Python 3 primarily), tables copied to clipboard are missing data from the end when there are any unicode characters in the dataframe that have a 4-byte representation in UTF-16 (i.e. in the U+010000 to U+10FFFF range). The bug can be reproduced here:
import pandas obj=pandas.DataFrame([u'\U0001f44d\U0001f44d', u'12345']) obj.to_clipboard()
where the clipboard text results in
One character is chopped from the end of the clipboard string for each 4-byte unicode character copied.
or more to the point:
pandas.io.clipboard.clipboard_set(u'\U0001f44d 12345')
produces
The cause of this issue is that len(u'\U0001f44d')==1
when python is in UCS-4, and Pandas allocates 2 bytes per python character in the clipboard buffer but the character consumes 4 bytes, displacing another character at the end of the string to be copied. In UCS-2 (most Python 2 builds), len(u'\U0001f44d')==2
and so 4 bytes are allocated and consumed by the character.
My proposed change (affecting only windows clipboard operations) first converts the text to UTF-16 little endian because that is the format used by windows, then measures the length of the resulting byte string, rather than using Python's len(text) * 2
to measure how many bytes should be allocated to the clipboard buffer.
I've tested this change in python 3.6 and 2.7 on windows 7 x64. I don't expect this causing other issues with other versions of windows but I would appreciate if anyone on older versions of windows would double check.