I think that the cached default encoding version of the unicode object should be limited in size. It's probably a bad idea to cache a 100MB of data. For large amount strings and unicode objects the user should do explicit caching if required.
I don't see a patch. And I think you cannot do this without compromising correctness, since _PyUnicode_AsDefaultEncodedString() returns the cached value without incrementing its refcount. (The only refcount that keeps it alive is the cache entry.)
The default encoding version is generated lazily, and only from a couple of places (if I believe my grepping through the py3k sources). So we can: * choose not to care, as the conversion looks rather rare * incref the return value of _PyUnicode_AsDefaultEncodedString(), and convert the 20 or so places in which that function is used to properly decref the value when done
> * choose not to care, as the conversion looks rather rare Yes. > * incref the return value of _PyUnicode_AsDefaultEncodedString(), > and convert the 20 or so places in which that function is used to > properly decref the value when done No. I suspect you'll find it quite difficult to pick a place where to do the decref in some cases.
For Py3k you can get rid of the cached default encoded version of the Unicode object altogether: This was only needed to make the Unicode/string auto-coercion mechanism efficient in Python 2.x. In Py3k, you'll only do such conversions at the IO-boundaries and explicitly, so caching the converted value is no longer necessary.