Speed up max_len_string_array by cpcloud · Pull Request #10024 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation7 Commits8 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Before:
In [5]: x = np.array(['abcd', 'abcde', 'abcdef', 'abcdefg'] * int(1e7), object)
In [2]: f = pd.lib.max_len_string_array
In [6]: %timeit f(x) 1 loops, best of 3: 501 ms per loop
After:
In [1]: x = np.array(['abcd', 'abcde', 'abcdef', 'abcdefg'] * int(1e7), object)
In [2]: f = pd.lib.max_len_string_array
In [3]: %timeit f(x) 10 loops, best of 3: 68.4 ms per loop
looks good. release note in perf section & squash. merge when ready.
@@ -896,23 +904,32 @@ def clean_index_list(list obj): |
---|
return maybe_convert_objects(converted), 0 |
ctypedef fused pandas_string: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to mention you could use a better name here :).
my naive quick look does show some improvements to to_hdf
when you have object
dtypes, order is 10-15% impv. (only used in to_stata/to_hdf
).
@shoyer I didn't have a particular pandas use case for this. I'm going to start using it in some often called paths in odo and I wanted to see if I could squeeze out some more perf.
I think @jreback had some ideas about using cython memoryviews in some of re csv code similar to how I use them here. IIRC he said there are quite a few places where we don't take full advantage of what cython has to offer. For example if you type a variable as just ndarray
you incur the overhead of the fully general get item c API whereas if you type it as ndarray[type]
the getitem syntax goes directly to the underlying raw pointer array.
ok squashed. merging on pass
cpcloud added a commit that referenced this pull request
Speed up max_len_string_array
cpcloud deleted the fixlen-string-faster branch