BUG: to_clipboard text truncated for Python 3 on Windows for UTF-16 text by david-liu-brattle-1 · Pull Request #25040 · pandas-dev/pandas (original) (raw)

For windows users where Python is compiled with UCS-4 (Python 3 primarily), tables copied to clipboard are missing data from the end when there are any unicode characters in the dataframe that have a 4-byte representation in UTF-16 (i.e. in the U+010000 to U+10FFFF range). The bug can be reproduced here:

import pandas obj=pandas.DataFrame([u'\U0001f44d\U0001f44d', u'12345']) obj.to_clipboard()

where the clipboard text results in

One character is chopped from the end of the clipboard string for each 4-byte unicode character copied.

or more to the point:

pandas.io.clipboard.clipboard_set(u'\U0001f44d 12345')

produces

The cause of this issue is that len(u'\U0001f44d')==1 when python is in UCS-4, and Pandas allocates 2 bytes per python character in the clipboard buffer but the character consumes 4 bytes, displacing another character at the end of the string to be copied. In UCS-2 (most Python 2 builds), len(u'\U0001f44d')==2 and so 4 bytes are allocated and consumed by the character.

My proposed change (affecting only windows clipboard operations) first converts the text to UTF-16 little endian because that is the format used by windows, then measures the length of the resulting byte string, rather than using Python's len(text) * 2 to measure how many bytes should be allocated to the clipboard buffer.

I've tested this change in python 3.6 and 2.7 on windows 7 x64. I don't expect this causing other issues with other versions of windows but I would appreciate if anyone on older versions of windows would double check.