[Python-3000] bytes & Py_TPFLAGS_BASETYPE (original) (raw)

Mathieu Fenniak mathieu.fenniak at gmail.com
Sun Sep 16 22:19:57 CEST 2007


On 16-Sep-07, at 12:38 PM, Guido van Rossum wrote:

I'm not doubting that your subclass works well enough. The problem is that it must robust in the light of any subclass, no matter how crazy.

I understand that, but I'm not sure what kind of problems can be
created by crazy subclasses. But my imagination of "crazy subclass"
is pretty limited.

I'd have to understand more about your app to see whether subclassing truly makes sense.

I didn't want to flood too many pointless details into the
discussion, so here's the minimum that I think is relevant. The
project is pyPdf, a library for reading and writing PDF files. I've
been working on making the library support unicode text strings
within PDF documents.

In a PDF file, a "string" can either be a text string, or a byte
string. A string is a text string if it starts with a UTF-16BE BOM
marker, or if it can be decoded using an encoding called
PDFDocEncoding (which is specified by the PDF reference, similar to
Latin-1 but different just to make life difficult). pyPdf needs to
be capable of reading and writing these string objects. Whether a
string is a byte or a text string, writing out the raw bytes is the
same process after the text has been encoded. This lends itself to a
common StringObject base class:

class StringObject(PdfObject): # contains common behavior for both types of strings, such as
the ability to serialize out a byte array, encrypt/decrypt strings
for "secure" PDF files # also contains reading code that attempts to autodetect whether
the string is a byte or text string

class ByteStringObject(bytes, StringObject): # adds the byte array storage, and passes self back to
StringObject for serialization output

class TextStringObject(str, StringObject): # overrides the default output serialization to encode the
unicode string to match PDF's requirements, # passes the resulting byte array up for serialization.

(complete source code, if you're interested: http://hg.pybrary.net/ pyPdf-py3/file/fe0dc2014a1b/pyPdf/generic.py)

Deriving from the bytes type provides storage, and also direct & easy
access to the byte array content. I think in this case using bytes
as a base type makes sense, at least as much as using str as a base
type. pyPdf derives from list and dict for different PDF object
types in a similar manner as well.

Mathieu Fenniak



More information about the Python-3000 mailing list