[Python-Dev] RFD: how to build strings from lots of slices? (original) (raw)

Fredrik Lundh Fredrik Lundh" <effbot@telia.com
Sun, 27 Feb 2000 13:01:38 +0100


when hacking on SRE's substitution code, I stumbled upon a problem. to do a substitution, SRE needs to merge slices from the target strings and from the sub- stitution pattern.

here's a simple example:

re.sub(
    "(perl|tcl|java)",
    "python (not \\1)",
    "perl rules"
)

contains a "substitution pattern" consisting of three parts:

"python (not " (a slice from the substitution string)
group 1 (a slice from the target string)
")" (a slice from the substitution string)

PCRE implements this by doing the slicing (thus creating three new strings), and then doing a "join" by hand into a PyString buffer.

this isn't very efficient, and it also doesn't work for uni- code strings.

in other words, this needs to be fixed. but how?

...

here's one proposal, off the top of my head:

  1. introduce a PySliceListObject, which behaves like a simple sequence of strings, but stores them as slices. the type structure looks something like this:

    typedef struct { PyObject* string; int start; int end; } PySliceListItem;

    typedef struct { PyObject_VAR_HEAD PySliceListItem item[1]; } PySliceListObject;

where start and end are normalized (0..len(string))

__len__ returns self->ob_size
__getitem__ calls PySequence_GetSlice()

PySliceListObjects are only used internally; they have no Python-level interface.

  1. tweak string.join and unicode.join to look for PySliceListObject's, and have special code that copies slices directly from the source strings.

(note that a slice list can still be used with any method that expects a sequence of strings, but at a cost)

...

give the above, the substitution engine can now create a slice list by combining slices from the match object and the substitution object, and hand the result off to the string implementation; e.g:

sep =3D PySequence_GetSlice(subst_string, 0, 0):
result =3D PyObject_CallMethod(sep, "join", "O", slice_list)
Py_DECREF(sep);

(can anyone come up with something more elegant than the [0:0] slice?)

comments? better ideas?