[Python-Dev] str.count is slow (original) (raw)
Ben Cartwright bencvt at gmail.com
Tue Feb 28 00:50:28 CET 2006
- Previous message: [Python-Dev] Making ascii the default encoding
- Next message: [Python-Dev] str.count is slow
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
From comp.lang.python: chrisperkins99 at gmail.com wrote: It seems to me that str.count is awfully slow. Is there some reason for this? Evidence:
######## str.count time test ######## import string import time import array s = string.printable * int(1e5) # 10**7 character string a = array.array('c', s) u = unicode(s) RIGHTANSWER = s.count('a') def main(): print 'str: ', timecall(s.count, 'a') print 'array: ', timecall(a.count, 'a') print 'unicode:', timecall(u.count, 'a') def timecall(f, *a): start = time.clock() assert RIGHTANSWER == f(*a) return time.clock()-start if name == 'main': main() ###### end ######## On my machine, the output is: str: 0.29365715475 array: 0.448095498171 unicode: 0.0243757237303 If a unicode object can count characters so fast, why should an str object be ten times slower? Just curious, really - it's still fast enough for me (so far). This is with Python 2.4.1 on WinXP.
Chris Perkins
Your evidence points to some unoptimized code in the underlying C implementation of Python. As such, this should probably go to the python-dev list (http://mail.python.org/mailman/listinfo/python-dev).
The problem is that the C library function memcmp is slow, and str.count calls it frequently. See lines 2165+ in stringobject.c (inside function string_count):
r = 0;
while (i < m) {
if (!memcmp(s+i, sub, n)) {
r++;
i += n;
} else {
i++;
}
}
This could be optimized as:
r = 0;
while (i < m) {
if (s[i] == *sub && !memcmp(s+i, sub, n)) {
r++;
i += n;
} else {
i++;
}
}
This tactic typically avoids most (sometimes all) of the calls to memcmp. Other string search functions, including unicode.count, unicode.index, and str.index, use this tactic, which is why you see unicode.count performing better than str.count.
The above might be optimized further for cases such as yours, where a single character appears many times in the string:
r = 0;
if (n == 1) {
/* optimize for a single character */
while (i < m) {
if (s[i] == *sub)
r++;
i++;
}
} else {
while (i < m) {
if (s[i] == *sub && !memcmp(s+i, sub, n)) {
r++;
i += n;
} else {
i++;
}
}
}
Note that there might be some subtle reason why neither of these optimizations are done that I'm unaware of... in which case a comment in the C source would help. :-)
--Ben
- Previous message: [Python-Dev] Making ascii the default encoding
- Next message: [Python-Dev] str.count is slow
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]