[Python-3000] Poll: Lazy Unicode Strings For Py3k (original) (raw)

Larry Hastings larry at hastings.org
Wed Jan 31 11:40:52 CET 2007

Previous message: [Python-3000] reference leak when pressing Enter at interpreter prompt
Next message: [Python-3000] Poll: Lazy Unicode Strings For Py3k
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

I'd like to start a (hopefully final) round of discussion on the "lazy strings" series of patches. What follows is a summary on the current state of the patches, followed by five poll questions.

I apologize in advance for the length of this posting.

A NOTE BEFORE WE BEGIN

The "lazy strings" patches are actually two separate patches: "lazy concatentation" and "lazy slices". I did them together to save work, but I fear I have created some confusion by doing so. Please keep in mind that "lazy concatenation" and "lazy slices" are separate things, and should be independently considered for inclusion.

"LAZY CONCATENATION"

Lazy string concatenation changes the + operator when adding two strings. Instead of returning the result of the concatenation, it returns a placeholder (a PyUnicodeConcatenationObject) that represents the concatenation. The placeholder holds references to the two strings; the first time any caller asks for the string's value, the string is "rendered": memory is allocated for the string, its value is copied in from the referenced strings (recursively only where necessary), the references to the strings are freed, and the newly computed value is returned.

Lazy concatenation changes the behavior of the Python C API in two subtle ways:

All C API users asking for the value of a string must use the macro PyUnicode_AS_UNICODE() or the function PyUnicode_AsUnicode(). It is no longer permissable to directly access the object's "str" member.
It is now possible for PyUnicode_AS_UNICODE() and PyUnicode_AsUnicode() to fail, returning NULL under low-memory conditions. All code calling either of these would have to be amended to check for NULL; few of these checks have been added as of yet. Josiah Carlson suggests there are 400+ such places in the Py3k source tree alone.

The implementation also adds complexity underlying a PyUnicodeObject, though this is effectively hidden from anyone not peering into "Objects/unicodeobject.c".

With this patch in place, Unicode string concatenation is approximately 300% faster. The Pybench "ConcatUnicode" benchmark finishes in 1/4 the time when compared to an unpatched Py3k. This makes using + for string concatenation fast enough that the Python idiom of "".join([]) is no longer necessary, thus making + the "one obvious way to do it".

"LAZY SLICES"

Lazy string slicing changes the slicing operator when acting upon a string and all string methods that return a slice of the string (split(), strip(), partition(), etc). Instead of returning the result of the slice, it returns a placeholder (a PyUnicodeSliceObject) that represents the slice. The placeholder holds a reference to the original string, and the desired slice's start and end offsets. (Lazy slice objects only support slices with a stride of 1; slices with a stride greater than 1 are computed directly as before.) The first time any caller asks for the value of the string, the slice is "rendered", and its value is returned.

This by itself would not speed anything up; it would actually be a net slowdown. The speedup is achieved by avoiding rendering wherever possible. Most Unicode string processing is done entirely within "Objects/unicodeobject.c", and it is relatively simple to change code within this file such that it operates on unrendered slices. With such changes in place, most slices are never rendered, thus obviating the need for allocating memory and memcpy()ing to store the slice's value.

Lazy concatenation changes the behavior of Python in three subtle ways. First, it adds the same two changes in the C API behavior that "lazy concatenation" does: requiring use of the accessor macro/function, and stipulating that these can now fail.

Lazy slices also add a third wrinkle, a change to Python's memory allocation behavior: slicing a string means the slice holds a reference to the original string. If the slice is long-lived, that means the original string becomes long-lived. This can produce undesirable memory consumption when a program contains long-lived small slices to short-lived big strings. In this patch's defence, I point out there are other scenarios when this behavior would actually result in less overall memory use, mainly when taking multiple long-lived slices of a single string. Also, it is possible to tune your memory use of lazy slices by explicitly forcing the slice to render. Still, the worst-case behavior here is bad enough to give definite pause. (As We Shall Soon See.)

Also, as with lazy concatenation, lazy slices add complexity to PyUnicodeObject, though again this is all hidden within unicodeobject.c.

With this patch in place, Unicode string slicing is approximately 110% faster. The Pybench "UnicodeSlicing" benchmark finishes in less than 1/2 the time when compared to an unpatched Py3k.

GUIDO'S EXAMINATION

Guido examined a "rollup" patch that contained both lazy slices and lazy concatenation. His major criticism of the patch was the "third wrinkle" added by lazy slices, where long-lived small slices of short-lived big strings cause the big strings to become long-lived. Here's his test case: a = [] while True: x = u"x"*1000000 x = x[30:60] # Short slice of long string a.append(x)

An unpatched Py3k can run all day doing this; add the "lazy slice" patch and it quickly throws a MemoryError (after 1035 iterations on my XP box).

Guido was unwilling to accept a patch with this behavior. He agreed (via a short email exchange) that more discussion on the Py3k list was a good idea.

"V2 LAZY SLICES"

After meditating under a waterfall on a distant mountaintop, I concieved of an alternate implementation of lazy slices that wouldn't have this memory behavior while still preserving much of the benefit of the original lazy slices patch. I call this approach "V2 lazy slices".

The implementation is, in a word, complicated. In V2 lazy slices, a lazy slice is still a placeholder, with a pointer back to the original string. However, it does not retain an official Python reference to the original string object-- ob_refcnt is unchanged. Instead, there is a sinister second internal reference, hidden from the outside world. The original string maintains a circularly-linked list of all slices from it. When the original string is dealloc()'d, it walks this list and forces each slices to render. As if that weren't enough, the code must handle some gnarly boundary cases in low-memory conditions. This approach necessitated a change to the object structure; no longer are lazy slices (and lazy concatenations) internal subtypes of PyUnicodeObject. Instead, a PyUnicodeObject has a pointer to a "PyUnusualUnicodeObject" structure which reflects the string's "unusual" state:

I am a lazy concatenation.
I am a ("V2") lazy slice.
I am a conventional string, but lazy slices point to me. Apart from this change in reference counting and storage, the V2 lazy slice patch is much the same as the original lazy slice patch.

As before, V2 lazy slices change the behavior of the Python C API in the same two ways that "lazy concatenation" does: requiring use of the accessor macro/function, and stipulating that these can now fail. It does not share the change in memory consumption added by the original "lazy slices" patch; Guido's test case above now runs for as many iterations as you're willing to sit around for. It does still hide all this complexity inside unicodeobject.c.

There is one more wrinkle to consider. Consider the following: def isNotFoo(fileHandle): a = unicode(fileHandle.readline()) x = a.strip() return x != u"foo" Here we allocate the string a, then the V2 lazy slice of a. When the function exits, it pops the frame, and the locals are freed. But! "a" is freed before "x", which means the Unicode object gets dereferenced, which means x is forced to render-- even though it's about to be freed. So this rendering is a waste of time.

I asked on Python-Dev about this. The consensus I got was:

In CPython locals are freed in the order they were added to the dictionary, and
this behavior is not a defined part of Python and should not be relied on.

So I did what any loser hacker would do: I reversed the order. Now local variables are freed in the reverse order of their being added to the locals dictionary. Obviously this does not guarantee that all local slices will be freed before their local original string is. But I suspect that it's better for V2 lazy slices than the other order, and I've been assured that nobody can rely on the order here, so it should be a safe (cheap) hack. This change certainly doesn't seem to break anything.

LOCAL FREELIST RETOOLING

While working on the V2 lazy slice patch, I realized that the local freelist code in unicodeobject.c could use some renovation. To make a long story slightly less long, I revamped it to have four freelists: one for PyUnicodeObjects themselves, one for very short string buffers, and two for the specialized objects I use with lazy evaluation. Even if we don't keep any of the rest of my patches, I suspect this would be of interest. I speculate that it will make Python slightly faster and use slightly less memory, but I admit I have not tried it in isolation.

THE NET RESULT

I have a Py3k with "V2 lazy slices", "lazy concatenation" (retooled to mate better with the V2 lazy slices), and the "local freelist retooling", all applied at once. Preliminary benchmarks from one run of Pybench, 10 rounds at "warp 10", on my main XP box: "ConcatUnicode": 249.1% faster "CreateUnicodeWithConcat": 72.0% faster "UnicodeSlicing": 90.2% faster Overall: 3.3% faster I would be remiss if I didn't point out: these tests are really just microbenchmarks. "UnicodeSlicing" creates 49 slices of the same string, then throws them away--neither examining the slices' values nor storing them in a local variable. The effect of these patches on real-world code is as yet untested as far as I know.

I haven't published this rollup patch because it flunks the regression tests right now.

THE POLLS

Finally, the polls.

#1: Should Py3k incorporate the "lazy concatenation" patch?

#2: Should Py3k incorporate the original "lazy slices" patch?

#3: Should Py3k incorporate the "V2 lazy slices" patch?

#4: Assuming you are +0 or better on #3, is it okay to change the frame object such that locals are released in the reverse order of their being added to the dictionary?

#5: Should Py3k incorporate the "local freelist retooling" patch?

My votes are as follows:

1: +1 2: -1 3: +0.5 4: +1 4: +1

The reason I'm not fully behind "V2 lazy slices" is because they're just so gosh-darned complicated. Yes, it is faster, but at the cost of a lot of possible programmer puzzlement. My feeling is that we Python core developers are willing to shoulder the brunt of a certain amount of complexity to make the lives of Python users happier, but this only goes so far. Really I only implemented them because of a misguided dogged relentlessness. But I don't claim to have my finger on the pulse of the Python core development community; perhaps you'll surprise me with a "damn the complexity, full speed ahead" attitude and express great interest.

If you'd prefer to see the "V2 lazy slices" patch before passing sentence, post a request to the list (or email me) and I'll make one happen.

I await your judgment,

/larry/

Previous message: [Python-3000] reference leak when pressing Enter at interpreter prompt
Next message: [Python-3000] Poll: Lazy Unicode Strings For Py3k
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-3000 mailing list