[Python-Dev] Split unicodeobject.c into subfiles (original) (raw)

Victor Stinner victor.stinner at gmail.com
Tue Oct 23 02:50:32 CEST 2012


Hi,

I forked CPython repository to work on my "split unicodeobject.c" project: http://hg.python.org/sandbox/split-unicodeobject.c

The result is 10 files (included the existing unicodeobject.c):

1176 Objects/unicodecharmap.c 1678 Objects/unicodecodecs.c 1362 Objects/unicodeformat.c 253 Objects/unicodeimpl.h 733 Objects/unicodelegacy.c 1836 Objects/unicodenew.c 2777 Objects/unicodeobject.c 2421 Objects/unicodeoperators.c 1235 Objects/unicodeoscodecs.c 1288 Objects/unicodeutfcodecs.c 14759 total

This is just a proposition (and work in progress). Everything can be changed :-)

"unicodenew.c" is not a good name. Content of this file may be moved somewhere else.

Some files may be merged again if the separation is not justified.

I don't like the "unicode" prefix for filenames, I would prefer a new directory.

--

Shorter files are easier to review and maintain. The compilation is faster if only one file is modified.

The MBCS codec requires windows.h. The whole unicodeobject.c includes it just for this codec. With the split, only unicodeoscodecs.c includes this file.

The MBCS codec needs also a "winver" variable. This variable is defined between the BLOOM filter and the unicode_result_unchanged() function. How can you explain how these things are sorted? Where should I add a new function or variable? With the split, the variable is now defined very close to where is it used. You don't have to scroll 7000 lines to see where it is used.

If you would like to work on a specific function, you don't have to use the search function of your editor to skip thousands to lines. For example, the 18 functions and 2 types related to the charmap codec are now grouped into one unique and short C file.

It was already possible to extend and maintain unicodeobject.c (some people proved it!), but it should now be much simpler with shorter files.

Note: unicodeobject.c is also composed by the huge stringlib library (4000 lines), which is shared with the bytes type.

--

Private macros and prototype of private functions.

Many unicode_xxx() functions has been renamed to _PyUnicode_xxx() to be able to reuse them in different files.

Functions to create a new Unicode string (PyUnicode_New), convert from/to UCS4 and wchar_t*, resize a string. The ugly part of the PEP 393.

find, replace, compare, split, fill, etc.

"str" type with all methods, _string module and unicodeiter type.

PyUnicode_FromFormat() and PyUnicode_Format()

Text codecs for Python Unicode strings:

Character Mapping Codec:

Operating system codecs: MBCS codec, locale (FS) codec => FS encode/decode.

UTF-7/8/16/32 codecs and ASCII decoder.

Legacy and deprecated Unicode API: Py_UNICODE type.

Victor



More information about the Python-Dev mailing list