[Python-Dev] PEP 3147 ready for pronouncement and merging (original) (raw)

Barry Warsaw barry at python.org
Sat Apr 17 01:00:30 CEST 2010


On Apr 15, 2010, at 08:01 PM, Guido van Rossum wrote:

Byte code files contain two 32-bit numbers followed by the marshaled big-endian

Done.

[2] code object.  The 32-bit numbers represent a magic number and a timestamp.  The magic number changes whenever Python changes the byte code format, e.g. by adding new byte codes to its virtual machine. This ensures that pyc files built for previous versions of the VM won't cause problems.  The timestamp is used to make sure that the pyc file is not older than the py file that was used to create it.  When is not older than -> matches (Obscure fact: the timestamp in the pyc file must match the source's mtime exactly.)

Done.

Rationale =========

Linux distributions such as Ubuntu [4] and Debian [5] provide more than one Python version at the same time to their users.  For example, Ubuntu 9.10 Karmic Koala users can install Python 2.5, 2.6, and 3.1, with Python 2.6 being the default. This causes a conflict for Python source files installed by the system (including third party packages), because you cannot compile a I'd say only 3rd part packages right? (And code written by the distro, which from Python's POV is also 3rd party.) At least ought to clarify that the stdlib is unaffected by this conflict, because multiple versions of the stdlib are installed.

Yes, good point. Clarified.

single Python source file for more than one Python version at a time. Thus if your system wanted to install a /usr/share/python/foo.py, it could not create a /usr/share/python/foo.pyc file usable across all installed Python versions. Note that (due to the magic#) Python doesn't crash, it just falls back on the slower approach of compiling from source. Perhaps more important is that different Python versions (if the user has write permission) will fight over the pyc file and rewrite it each time the source is compiled. Worse, even though the magic# is initially written as zero and then rewritten with the correct value, concurrent processes running different Python versions can actually end up reading corrupt bytecode. (Alex Martelli diagnosed this at Google years ago.)

Good point; I've made this more clear.

Furthermore, in order to ease the burden on operating system packagers for these distributions, the distribution packages do not contain Python version numbers [6]; they are shared across all Python versions installed on the system.  Putting Python version numbers in the packages would be a maintenance nightmare, since all the packages - and their dependencies - would have to be updated every time a new Python release was added or removed from the distribution.  Because of the sheer number of packages available, this amount of work is infeasible.

C extensions can be source compatible across multiple versions of Python.  Compiled extension modules are usually not compatible though, Actually we typically make every effort to support backwards compatibility for compiled modules, and the module initialization API contains a version# check. This is a different version# than the import magic# and historically has changed much less frequently.

I've rewritten this paragraph a bit. It's not particularly relevant to this PEP. (I'll be look at PEP 384 soon.)

and PEP 384 [7] has been proposed to address this by defining a stable ABI for extension modules.

Proposal ========

Python's import machinery is extended to write and search for byte code cache files in a single directory inside every Python package directory.  This directory will be called _pycache_. Further, pyc files will contain a magic string that differentiates the Clarify that the magic string is in the filename, not in the file contents.

Yep.

Python version they were compiled for.  This allows multiple byte compiled cache files to co-exist for a single Python source file.

This scheme has the added benefit of reducing the clutter in a Python package directory. When a Python source file is imported for the first time, a _pycache_ directory will be created in the package directory, if Is this still true? ISTR there was a lot of discussion about the auto-creation and possible security concerns.

It is still true. I think we determined it will usually not be an issue because the umask will not be altered, and because normal installation procedures typically involve byte compilation (and thus pycache creation) during installation time via tools like compileall. This really is describing what happens when you run Python over pure Python source code for the first time, and it's no different from what happens now with the automatic creation of pyc files.

one does not already exist.  The pyc file for the imported source will be written to the _pycache_ directory, using the magic-tag By now the magic-tag format should have been defined (or a "see below" inserted).

Based on this and your following comment, I've moved the description of the magic tag format to here, and rewritten it to fit in context. The section discussing the hexadecimal representation is moved to the (rejected) "Alternatives" section.

Case 1: The first import ------------------------

When Python is asked to import module foo, it searches for a foo.py file (or foo package, but that's not important for this discussion) along its sys.path.  When Python locates the foo.py file it will look for a _pycache_ directory in the directory where it found the foo.py.  If the _pycache_ directory is missing, Python will create it.  Then it will parse and byte compile the foo.py file and save the byte code in _pycache_/foo.<magic>.pyc, where is defined by the Python implementation, but will be a human readable string such as cpython-32. (Aside: at first I read this as a description of the full algorithm. But there is a step missing -- the pycache/foo..pyc file is searched and not found.)

I added a Case 0 for the "steady state" which should clarify this.

Magic identifiers =================

pyc files inside of the _pycache_ directories contain a magic identifier in their file names.  These are mnemonic tags for the actual magic numbers used by the importer.  For example, in Python 3.2, we could use the hexlified [10] magic number as a unique (Aside: when you search Wikipedia for "hexlify" it says "did you mean: heavily?" :-)

:) Emacs is where I first encountered this term, e.g. M-x hexlify-buffer. It got carried over to the binascii module. But in this case "hexadecimal representation of the binary magic number" is probably a better term to use.

identifier::

 >>> from binascii import hexlify  >>> from imp import getmagic  >>> 'foo.{}.pyc'.format(hexlify(getmagic()).decode('ascii'))  'foo.580c0d0a.pyc' This isn't particularly human friendly though.  Instead, this PEP This section reads a bit weird -- first it describes the solution we didn't pick. I'd move that to a "Alternatives Considered and Rejected" section or some such.

Agreed; see above.

proposes a magic tag that uniquely defines .pyc files for the current version of Python.  Whenever the magic number is bumped, a new magic tag is defined which is unique among all versions and implementations of Python.  The actual contents of the magic tag is left up to the implementation, although it is recommended that the tag include the implementation name and a version shorthand.  In general, magic numbers never change between Python micro releases, but the convention can be extended to handle magic number changes between pre-release development versions.

For example, CPython 3.2 would have a magic tag of cpython-32 and write pyc files like this: foo.cpython-32.pyc.  When the -O flag is used, it would write foo.cpython-32.pyo.  For backports of this feature to Python 2, when the -U flag is used, a file such as foo.cpython-27u.pyc can be written. Does all of this match the implementation?

Yes. Well, except for the -U part, since I haven't backported this to Python 2... yet :).

Implementation strategy =======================

This feature is targeted for Python 3.2, solving the problem for those and all future versions.  It may be back-ported to Python 2.7. Is there time given that 2.7b1 was released?

See my previous response.

This PEP proposes the addition of an _cached_ attribute to modules, which will always point to the actual pyc file that was read or written.  When the environment variable $PYTHONDONTWRITEBYTECODE is set, or the -B option is given, or if the source lives on a read-only filesystem, then the _cached_ attribute will point to the location that the pyc file would have been written to if it didn't exist.  This location of course includes the _pycache_ subdirectory in its path. Hm. I wish there was a way to find out whether the bytecode (or whatever) actually was read from this file. file in Python 2 supports this (though not in Python 3).

Do you have a use case for that? It might be interesting to know, but I can't think of a good way to infer that from file and cached, or of a good way to expose that on module objects. Of course, it would be totally Python implementation dependent too.

Backports ---------

For versions of Python earlier than 3.2 (and possibly 2.7), it is possible to backport this PEP.  However, in Python 3.2 (and possibly 2.7), this behavior will be turned on by default, and in fact, it will replace the old behavior.  Backports will need to support the old layout by default.  We suggest supporting PEP 3147 through the use of an environment variable called $PYTHONENABLECACHEDIR or the command line switch -Xenablecachedir to enable the feature. I would be okay if a distro decided to turn it on by default, as long as there was a way to opt out.

For Python 2.6, even for a distro-specific backport, I think I'd want to enable it only with a switch. It might be weird for example if Python 2.6 in Ubuntu 10.04 produced traditional pyc files, but Python 2.6 in Ubuntu 10.10 produced PEP 3147 file names. For a backport to Python 2.7 though (which e.g. would be new in Ubuntu 10.10), it might make sense to enable it by default.

Either way, we're really talking about the effects on user code only. We'll definitely enable it as part of the package installation tools.

Thanks again Guido. I think this hits all your feedback. Now to land the code!

-Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: not available URL: <http://mail.python.org/pipermail/python-dev/attachments/20100416/5d5e5799/attachment-0001.pgp>



More information about the Python-Dev mailing list