BUG: long str unconvert fix by wabu · Pull Request #6166 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation17 Commits1 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

wabu

When storing long strings in hdf5, they get truncated at 64bytes when converted back to strings. The fix computes the itemsize of the strings and uses it before converting to strings. It would be possible to use the attribute information of the table, but that would require changes in the calling code.

Test that failed without the patch is included.

@jreback

@jreback

needs a perf check

iirc this was an odd bottleneck

course it only matters if the data is encoded

@wabu

I did an update on numpy and it fixed the issue. If you think the I should pref check the patch for numpy <= 1.7.1, I could do this, otherwise you can close the pr

@jreback

no...I like this PR!

I don't think its actually an issue (the issue is in decoding strings, which is why I used this method in the first place which is quite fast). True you have to figure out the max length of the string...

tell you what...why don't you put a conditional based on numpy <= 1.7.1 and then do your way, otherwise leave it? (use LooseVersion)

@wabu

I did the perf test, nothing changed significantly, and also implemented the check for the numpy version

jreback

@@ -4157,8 +4158,8 @@ def _convert_string_array(data, encoding, itemsize=None):
data = np.array(data, dtype="S%d" % itemsize)
return data
def _unconvert_string_array(data, nan_rep=None, encoding=None):
_numpy_needs_str_fix = LooseVersion(np.__version__) < '1.7.2'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name this like: _np_version_under_172 and put near the top of the file

@wabu

Did the changes and added release notes.

@jreback

This change is actually very tricky and subtle because of exactly 3.3 / 2.7 interact with encodings....

still failing on 3.2 on 1.7.1.....

jreback@6c268de

I am not sure how to specify a dtype on 3.2 like 'S100'....maybe you can figure this out?
its essentially a unicode dtype

@wabu

what about np.dtype((str,itemsize))? Results in '<U{size}' for python 3 and '|S{size}' for python 2, which seems to be ok?

@jreback

do u have the ability to try in py3.3 on windows
that was the one throwing the error originally (weird but not on Linux)

u can use the just released 3.1 and try modifying the source
as compiling in windows is a bit tricky

if u want to try compiling on windows I can help though

@wabu

can't help with this, have no windows machine around.

@jreback

ok no problem
pls update with the encoding tests and your change and I'll give a whirl in a couple of days

@jreback

I did a PR on your branch with some fixes to make this work on windows: wabu#1

@jreback

@wabu pls incorporate my changes when u can

@wabu

prior to numpy 1.7.2 long strings got truncated when read back from a hdf5 file

@jreback

@wabu can you incoporate the changes and rebase?

@jreback

closing in favor of #6821

2 participants

@wabu @jreback