PEP 711: PyBI: a standard format for distributing Python Binaries (original) (raw)

Hey all, finally got around to posting this properly!

If anyone else is excited about making this real, I could very much use some help with two things:

Anyway: PEP text is available at PEP 711 – PyBI: a standard format for distributing Python Binaries | peps.python.org, or here it is inline so you can use Discourse quoting to comment on particular parts.


PEP 711 – PyBI: a standard format for distributing Python Binaries

Abstract

“Like wheels, but instead of a pre-built python package, it’s a pre-built python interpreter”

Motivation

End goal: Pypi.org has pre-built packages for all Python versions on all popular platforms, so automated tools can easily grab any of them and set it up. It becomes quick and easy to try Python prereleases, pin Python versions in CI, make a temporary environment to reproduce a bug report that only happens on a specific Python point release, etc.

First step (this PEP): define a standard packaging file format to hold pre-built Python interpreters, that reuses existing Python packaging standards as much as possible.

Examples

Example pybi builds are available at pybi.vorpus.org. They’re zip files, so you can unpack them and poke around inside if you want to get a feel for how they’re laid out.

You can also look at the tooling I used to create them.

Specification

Filename

Filename: {distribution}-{version}[-{build tag}]-{platform tag}.pybi

This matches the wheel file format defined in PEP 427, except dropping the {python tag} and {abi tag} and changing the extension from .whl.pybi.

For example:

Just like for wheels, if a pybi supports multiple platforms, you can separate them by dots to make a “compressed tag set”:

(Though in practice this probably won’t be used much, e.g. the above filename is more idiomatically written as cpython-3.9.5-macosx_11_0_universal2.pybi.)

File contents

A .pybi file is a zip file, that can be unpacked directly into an arbitrary location and then used as a self-contained Python environment. There’s no .data directory or install scheme keys, because the Python environment knows which install scheme it’s using, so it can just put things in the right places to start with.

The “arbitrary location” part is important: the pybi can’t contain any hardcoded absolute paths. In particular, any preinstalled scripts MUST NOT embed absolute paths in their shebang lines.

Similar to wheels’ <package>-<version>.dist-info directory, the pybi archive must contain a top-level directory named pybi-info/. (Rationale: calling it pybi-info instead dist-info makes sure that tools don’t get confused about which kind of metadata they’re looking at; leaving off the {name}-{version} part is fine because only one pybi can be installed into a given directory.) The pybi-info/ directory contains at least the following files:

Pybi-Version: 1.0  
Generator: {name} {version}  
Tag: {platform tag}  
Tag: {another platform tag}  
Tag: {...and so on...}  
Build: 1   # optional  

And also there are some new, required keys described below.

Pybi-specific core metadata

Here’s an example of the new METADATA fields, before we give the full details:

Pybi-Environment-Marker-Variables: {"implementation_name": "cpython", "implementation_version": "3.10.8", "os_name": "posix", "platform_machine": "x86_64", "platform_system": "Linux", "python_full_version": "3.10.8", "platform_python_implementation": "CPython", "python_version": "3.10", "sys_platform": "linux"}
Pybi-Paths: {"stdlib": "lib/python3.10", "platstdlib": "lib/python3.10", "purelib": "lib/python3.10/site-packages", "platlib": "lib/python3.10/site-packages", "include": "include/python3.10", "platinclude": "include/python3.10", "scripts": "bin", "data": "."}
Pybi-Wheel-Tag: cp310-cp310-PLATFORM
Pybi-Wheel-Tag: cp310-abi3-PLATFORM
Pybi-Wheel-Tag: cp310-none-PLATFORM
Pybi-Wheel-Tag: cp39-abi3-PLATFORM
Pybi-Wheel-Tag: cp38-abi3-PLATFORM
Pybi-Wheel-Tag: cp37-abi3-PLATFORM
Pybi-Wheel-Tag: cp36-abi3-PLATFORM
Pybi-Wheel-Tag: cp35-abi3-PLATFORM
Pybi-Wheel-Tag: cp34-abi3-PLATFORM
Pybi-Wheel-Tag: cp33-abi3-PLATFORM
Pybi-Wheel-Tag: cp32-abi3-PLATFORM
Pybi-Wheel-Tag: py310-none-PLATFORM
Pybi-Wheel-Tag: py3-none-PLATFORM
Pybi-Wheel-Tag: py39-none-PLATFORM
Pybi-Wheel-Tag: py38-none-PLATFORM
Pybi-Wheel-Tag: py37-none-PLATFORM
Pybi-Wheel-Tag: py36-none-PLATFORM
Pybi-Wheel-Tag: py35-none-PLATFORM
Pybi-Wheel-Tag: py34-none-PLATFORM
Pybi-Wheel-Tag: py33-none-PLATFORM
Pybi-Wheel-Tag: py32-none-PLATFORM
Pybi-Wheel-Tag: py31-none-PLATFORM
Pybi-Wheel-Tag: py30-none-PLATFORM
Pybi-Wheel-Tag: py310-none-any
Pybi-Wheel-Tag: py3-none-any
Pybi-Wheel-Tag: py39-none-any
Pybi-Wheel-Tag: py38-none-any
Pybi-Wheel-Tag: py37-none-any
Pybi-Wheel-Tag: py36-none-any
Pybi-Wheel-Tag: py35-none-any
Pybi-Wheel-Tag: py34-none-any
Pybi-Wheel-Tag: py33-none-any
Pybi-Wheel-Tag: py32-none-any
Pybi-Wheel-Tag: py31-none-any
Pybi-Wheel-Tag: py30-none-any

Specification:

To handle this, the installer needs to somehow understand that a manylinux_2_12_x86_64 pybi can use a manylinux_2_17_x86_64 wheel as long as those are both valid tags on the current machine, but a win32 pybi can’t use a win_amd64 wheel, even if those are both valid tags on the current machine.

You can probably generate these metadata values by running this script on the built interpreter:

import packaging.markers
import packaging.tags
import sysconfig
import os.path
import json
import sys

marker_vars = packaging.markers.default_environment()
# Delete any keys that depend on the final installation
del marker_vars["platform_release"]
del marker_vars["platform_version"]
# Darwin binaries are often multi-arch, so play it safe and
# delete the architecture marker. (Better would be to only
# do this if the pybi actually is multi-arch.)
if marker_vars["sys_platform"] == "darwin":
    del marker_vars["platform_machine"]

# Copied and tweaked version of packaging.tags.sys_tags
tags = []
interp_name = packaging.tags.interpreter_name()
if interp_name == "cp":
    tags += list(packaging.tags.cpython_tags(platforms=["xyzzy"]))
else:
    tags += list(packaging.tags.generic_tags(platforms=["xyzzy"]))

tags += list(packaging.tags.compatible_tags(platforms=["xyzzy"]))

# Gross hack: packaging.tags normalizes platforms by lowercasing them,
# so we generate the tags with a unique string and then replace it
# with our special uppercase placeholder.
str_tags = [str(t).replace("xyzzy", "PLATFORM") for t in tags]

(base_path,) = sysconfig.get_config_vars("installed_base")
# For some reason, macOS framework builds report their
# installed_base as a directory deep inside the framework.
while "Python.framework" in base_path:
    base_path = os.path.dirname(base_path)
paths = {key: os.path.relpath(path, base_path).replace("\\", "/") for (key, path) in sysconfig.get_paths().items()}

json.dump({"marker_vars": marker_vars, "tags": str_tags, "paths": paths}, sys.stdout)

This emits a JSON dict on stdout with separate entries for each set of pybi-specific tags.

Currently, symlinks are used by default in all Unix Python installs (e.g., bin/python3 -> bin/python3.9). And furthermore, symlinks are required to store macOS framework builds in .pybi files. So, unlike wheel files, we absolutely have to support symlinks in .pybi files for them to be useful at all.

The de-facto standard for representing symlinks in zip files is the Info-Zip symlink extension, which works as follows:

So if using Python’s zipfile module, you can check whether a ZipInfo represents a symlink by doing:

(zip_info.external_attr >> 16) & 0xf000 == 0xa000

Or if using Rust’s zip crate, the equivalent check is:

fn is_symlink(zip_file: &zip::ZipFile) -> bool {
    match zip_file.unix_mode() {
        Some(mode) => mode & 0xf000 == 0xa000,
        None => false,
    }
}

If you’re on Unix, your zip and unzip commands probably understands this format already.

Normally, a RECORD file lists each file + its hash + its length:

my/favorite/file,sha256=...,12345

For symlinks, we instead write:

name/of/symlink,symlink=path/to/symlink/target,

That is: we use a special “hash function” called symlink, and then store the actual symlink target as the “hash value”. And the length is left empty.

Rationale: we’re already committed to the RECORD file containing a redundant check on everything in the main archive, so for symlinks we at least need to store some kind of hash, plus some kind of flag to indicate that this is a symlink. Given that symlink target strings are roughly the same size as a hash, we might as well store them directly. This also makes the symlink information easier to access for tools that don’t understand the Info-Zip symlink extension, and makes it possible to losslessly unpack and repack a Unix pybi on a Windows system, which someone might find handy at some point.

When a pybi creator stores a symlink, they MUST use both of the mechanisms defined above: storing it in the zip archive directly using the Info-Zip representation, and also recording it in the RECORD file.

Pybi consumers SHOULD validate that the symlinks in the archive and RECORD file are consistent with each other.

We also considered using only the RECORD file to store symlinks, but then the vanilla unzip tool wouldn’t be able to unpack them, and that would make it hard to install a pybi from a shell script.

Limitations

Symlinks enable a lot of potential messiness. To keep things under control, we impose the following restrictions:

Unpackers MUST verify that these rules are followed, because without them attackers could create evil symlinks like foo -> /etc/passwd or foo -> ../../../../../etc + foo/passwd -> ... and cause havoc.

Non-normative comments

Why not just use conda?

This isn’t really in the scope of this PEP, but since conda is a popular way to distribute binary Python interpreters, it’s a natural question.

The simple answer is: conda is great! But, there are lots of python users who aren’t conda users, and they deserve nice things too. This PEP just gives them another option.

The deeper answer is: the maintainers who upload packages to PyPI are the backbone of the Python ecosystem. They’re the first audience for Python packaging tools. And one thing they want is to upload a package once, and have it be accessible across all the different ways Python is deployed: in Debian and Fedora and Homebrew and FreeBSD, in Conda environments, in big companies’ monorepos, in Nix, in Blender plugins, in RenPy games, …… you get the idea.

All of these environments have their own tooling and strategies for managing packages and dependencies. So what’s special about PyPI and wheels is that they’re designed to describe dependencies in a standard, abstract way, that all these downstream systems can consume and convert into their local conventions. That’s why package maintainers use Python-specific metadata and upload to PyPI: because it lets them address all of those systems simultaneously. Every time you build a Python package for conda, there’s an intermediate wheel that’s generated, because wheels are the common language that Python package build systems and conda can use to talk to each other.

But then, if you’re a maintainer releasing an sdist+wheels, then you naturally want to test what you’re releasing, which may depend on arbitrary PyPI packages and versions. So you need tools that build Python environments directly from PyPI, and conda is fundamentally not designed to do that. So conda and pip are both necessary for different cases, and this proposal happens to be targeting the pip side of that equation.

Sdists (or not)

It might be cool to have an “sdist” equivalent for pybis, i.e., some kind of format for a Python source release that’s structured-enough to let tools automatically fetch and build it into a pybi, for platforms where prebuilt pybis aren’t available. But, this isn’t necessary for the MVP and opens a can of worms, so let’s worry about it later.

What packages should be bundled inside a pybi?

Pybi builders have the power to pick and choose what exactly goes inside. For example, you could include some preinstalled packages in the pybi’s site-packages directory, or prune out bits of the stdlib that you don’t want. We can’t stop you! Though if you do preinstall packages, then it’s strongly recommended to also include the correct metadata (.dist-info etc.), so that it’s possible for Pip or other tools to understand out what’s going on.

For my prototype “general purpose” pybi’s, what I chose is:

Backwards Compatibility

No backwards compatibility considerations.

Security Implications

No security implications, beyond the fact that anyone who takes it upon themselves to distribute binaries has to come up with a plan to manage their security (e.g., whether they roll a new build after an OpenSSL CVE drops). But collectively, we core Python folks are already maintaining binary builds for all major platforms (macOS + Windows through python.org, and Linux builds through the official manylinux image), so even if we do start releasing official CPython builds on PyPI it doesn’t really raise any new security issues.

How to Teach This

This isn’t targeted at end-users; their experience will simply be that e.g. their pyenv or tox invocation magically gets faster and more reliable (if those projects’ maintainers decide to take advantage of this PEP).

This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.