Speed up max_len_string_array by cpcloud · Pull Request #10024 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation7 Commits8 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

cpcloud

Before:

In [5]: x = np.array(['abcd', 'abcde', 'abcdef', 'abcdefg'] * int(1e7), object)

In [2]: f = pd.lib.max_len_string_array

In [6]: %timeit f(x) 1 loops, best of 3: 501 ms per loop

After:

In [1]: x = np.array(['abcd', 'abcde', 'abcdef', 'abcdefg'] * int(1e7), object)

In [2]: f = pd.lib.max_len_string_array

In [3]: %timeit f(x) 10 loops, best of 3: 68.4 ms per loop

@jreback

looks good. release note in perf section & squash. merge when ready.

shoyer

@@ -896,23 +904,32 @@ def clean_index_list(list obj):
return maybe_convert_objects(converted), 0
ctypedef fused pandas_string:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to mention you could use a better name here :).

@jreback

my naive quick look does show some improvements to to_hdf when you have object dtypes, order is 10-15% impv. (only used in to_stata/to_hdf).

@cpcloud

@shoyer I didn't have a particular pandas use case for this. I'm going to start using it in some often called paths in odo and I wanted to see if I could squeeze out some more perf.

@cpcloud

I think @jreback had some ideas about using cython memoryviews in some of re csv code similar to how I use them here. IIRC he said there are quite a few places where we don't take full advantage of what cython has to offer. For example if you type a variable as just ndarray you incur the overhead of the fully general get item c API whereas if you type it as ndarray[type] the getitem syntax goes directly to the underlying raw pointer array.

@cpcloud

ok squashed. merging on pass

cpcloud added a commit that referenced this pull request

Apr 30, 2015

@cpcloud

Speed up max_len_string_array

@cpcloud cpcloud deleted the fixlen-string-faster branch

April 30, 2015 15:09