[Python-3000] Making more effective use of slice objects in Py3k (original) (raw)

Guido van Rossum guido at python.org
Mon Aug 28 22:07:55 CEST 2006

Previous message: [Python-3000] Making more effective use of slice objects in Py3k
Next message: [Python-3000] Making more effective use of slice objects in Py3k
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Those are all microbenchmarks. It's easy to prove the superiority of an approach that way. But what about realistic applications? What if your views don't end up saving memory or time for an application, but still cost in terms of added complexity in all string operations?

Anyway, let me begin with your microbenchmark.

The first one pits a linear algorithm against a quadratic algorithm with the expected result.

The second one is more interesting; your version doesn't copy while the split() version copies, and that gives your version the expected speedup. I never doubted this.

But your code has a worst-case problem: if you take a single short view of a really long string and then drop the long string, the view keeps it around. Something like this:

rest = ... # your mailbox file results = [] for i in range(1000): x = rest + "." # Just to force a copy results.append(x.partition("\r\n.\r\n")[0]) Save the first message over and over

Now watch the memory growth with your version vs. with standard partition.

Now fix this in your code and re-run your benchmark.

Then I come with another worst-case scenario, etc.

Then I ask you to make it so that string views are 99.999% indistinguishable from strings -- they have all the same methods, are usable everywhere else, etc.

--Guido

On 8/28/06, Josiah Carlson <jcarlson at uci.edu> wrote:

"Guido van Rossum" <guido at python.org> wrote: > > Josiah (and other supporters of string views), > > You seem to be utterly convinced of the superior performance of your > proposal without having done any measurements. > > You appear to have a rather naive view on what makes code execute fast > or slow (e.g. you don't seem to appreciate the savings due to a string > object header and its data being consecutive in memory). > > Unless you have serious benchmark data (for realistic Python code) I > can't continue to participate in this discussion, where you have said > nothing new in many posts. Put up or shut up, eh? I have written a simple extension module using Pyrex (my manual C extension writing is awful). Here are some sample interactions showing that string views are indeed quite fast. In all of these examples, a naive implementation using only stringview.partition() was able to beat Python 2.5 str.partition, str.split, and re.finditer. Attached you will find the implementation of stringview I used, along with sufficient build scripts to get it working using Python 2.3 and Pyrex 0.9.3 . Aside from replacing int usage with Pyssizet for 2.5, and *nix users performing a dos2unix call, it should work without change with the most recent Python and Pyrex versions. - Josiah

Using 2.3 : >>> x = stringview(40000*' ') >>> if 1: ... t = time.time() ... while x: ... 1, 2, x = x.partition(' ') ... print time.time()-t ... 0.18700003624 >>> Compared with Python 2.5 beta 2 >>> x = 40000*' ' >>> if 1: ... t = time.time() ... while x: ... 1, 2, x = x.partition(' ') ... print time.time()-t ... 0.625 >>> But that's about as bad for Python 2.5 as it can get. What about something else? Like a mail file? In my 21.5 meg archive of py3k, which contains 3456 messages, I wanted to discover all messages. Python 2.3.5 (#62, Feb 8 2005, 16:23:02) [MSC v.1200 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> from stringview import * >>> rest = stringview(open('mail', 'rb').read()) >>> import time >>> if 1: ... x = [] ... t = time.time() ... while rest: ... cur, found, rest = rest.partition('\r\n.\r\n') ... x.append(cur) ... print time.time()-t, len(x) ... 0.0780000686646 3456 >>> What about Python 2.5 using split? That should be fast... Python 2.5b2 (r25b2:50512, Jul 11 2006, 10:16:14) [MSC v.1310 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> rest = open('mail', 'rb').read() >>> import time >>> if 1: ... t = time.time() ... x = rest.split('\r\n.\r\n') ... print time.time()-t, len(x) ... 0.109999895096 3457 >>> Hrm...what about using re? >>> import re >>> pat = re.compile('\r\n.\r\n') >>> rest = open('mail', 'rb').read() >>> import time >>> if 1: ... x = [] ... t = time.time() ... for i in pat.finditer(rest): ... x.append(i) ... print time.time()-t, len(x) ... 0.125 3456 >>> Even that's not as good as Python 2.3 + string views.

-- --Guido van Rossum (home page: http://www.python.org/~guido/)

Previous message: [Python-3000] Making more effective use of slice objects in Py3k
Next message: [Python-3000] Making more effective use of slice objects in Py3k
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-3000 mailing list