[Python-ideas] Updated PEP 342: Simplifying the CPython update sequence (original) (raw)
Nick Coghlan ncoghlan at gmail.com
Wed Jan 2 12:40:26 CET 2013
- Previous message: [Python-ideas] Order in the documentation search results
- Next message: [Python-ideas] Updated PEP 432: Simplifying the CPython update sequence
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I've updated the PEP heavily based on the previous thread and miscellanous comments in response to checkins.
Latest version is at http://www.python.org/dev/peps/pep-0432/ and inline below.
The biggest change in the new version is moving from a Python dictionary to a C struct as the storage for the full low level interpreter configuration as Antoine suggested. The individual settings are now either C integers for the various flag values (defaulting to -1 to indicate "figure this out"), or pointers to the appropriate specific Python type (defaulting to NULL to indicate "figure this out").
I'm happy enough with the design now that I think it's worth starting to implement it before I tinker with the PEP any further.
Cheers, Nick.
================================ PEP: 432 Title: Simplifying the CPython startup sequence Version: RevisionRevisionRevision Last-Modified: DateDateDate Author: Nick Coghlan <ncoghlan at gmail.com> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 28-Dec-2012 Python-Version: 3.4 Post-History: 28-Dec-2012, 2-Jan-2013
Abstract
This PEP proposes a mechanism for simplifying the startup sequence for CPython, making it easier to modify the initialization behaviour of the reference interpreter executable, as well as making it easier to control CPython's startup behaviour when creating an alternate executable or embedding it as a Python execution engine inside a larger application.
Note: TBC = To Be Confirmed, TBD = To Be Determined. The appropriate resolution for most of these should become clearer as the reference implementation is developed.
Proposal
This PEP proposes that CPython move to an explicit multi-phase initialization process, where a preliminary interpreter is put in place with limited OS interaction capabilities early in the startup sequence. This essential core remains in place while all of the configuration settings are determined, until a final configuration call takes those settings and finishes bootstrapping the interpreter immediately before locating and executing the main module.
In the new design, the interpreter will move through the following well-defined phases during the startup sequence:
- Pre-Initialization - no interpreter available
- Initialization - interpreter partially available
- Initialized - full interpreter available, main related metadata incomplete
- Main Execution - optional state, main related metadata populated, bytecode executing in the main module namespace
As a concrete use case to help guide any design changes, and to solve a known
problem where the appropriate defaults for system utilities differ from those
for running user scripts, this PEP also proposes the creation and
distribution of a separate system Python (spython) executable which, by
default, ignores user site directories and environment variables, and does
not implicitly set sys.path[0] based on the current directory or the
script being executed.
To keep the implementation complexity under control, this PEP does not propose wholesale changes to the way the interpreter state is accessed at runtime, nor does it propose changes to the way subinterpreters are created after the main interpreter has already been initialized. Changing the order in which the existing initialization steps occur in order to make the startup sequence easier to maintain is already a substantial change, and attempting to make those other changes at the same time will make the change significantly more invasive and much harder to review. However, such proposals may be suitable topics for follow-on PEPs or patches - one key benefit of this PEP is decreasing the coupling between the internal storage model and the configuration interface, so such changes should be easier once this PEP has been implemented.
Background
Over time, CPython's initialization sequence has become progressively more complicated, offering more options, as well as performing more complex tasks (such as configuring the Unicode settings for OS interfaces in Python 3 as well as bootstrapping a pure Python implementation of the import system).
Much of this complexity is accessible only through the Py_Main and
Py_Initialize APIs, offering embedding applications little opportunity
for customisation. This creeping complexity also makes life difficult for
maintainers, as much of the configuration needs to take place prior to the
Py_Initialize call, meaning much of the Python C API cannot be used
safely.
A number of proposals are on the table for even more sophisticated
startup behaviour, such as better control over sys.path initialization
(easily adding additional directories on the command line in a cross-platform
fashion, as well as controlling the configuration of sys.path[0]), easier
configuration of utilities like coverage tracing when launching Python
subprocesses, and easier control of the encoding used for the standard IO
streams when embedding CPython in a larger application.
Rather than attempting to bolt such behaviour onto an already complicated system, this PEP proposes to instead simplify the status quo first, with the aim of making these further feature requests easier to implement.
Key Concerns
There are a couple of key concerns that any change to the startup sequence needs to take into account.
Maintainability
The current CPython startup sequence is difficult to understand, and even
more difficult to modify. It is not clear what state the interpreter is in
while much of the initialization code executes, leading to behaviour such
as lists, dictionaries and Unicode values being created prior to the call
to Py_Initialize when the -X or -W options are used [1_].
By moving to an explicitly multi-phase startup sequence, developers should only need to understand which features are not available in the core bootstrapping state, as the vast majority of the configuration process will now take place in that state.
By basing the new design on a combination of C structures and Python data types, it should also be easier to modify the system in the future to add new configuration options.
Performance
CPython is used heavily to run short scripts where the runtime is dominated by the interpreter initialization time. Any changes to the startup sequence should minimise their impact on the startup overhead.
Experience with the importlib migration suggests that the startup time is dominated by IO operations. However, to monitor the impact of any changes, a simple benchmark can be used to check how long it takes to start and then tear down the interpreter::
python3 -m timeit -s "from subprocess import call" "call(['./python', '-c', 'pass'])"
Current numbers on my system for 2.7, 3.2 and 3.3 (using the 3.3 subprocess and timeit modules to execute the check, all with non-debug builds)::
# Python 2.7
$ py33/python -m timeit -s "from subprocess import call""call(['py27/python', '-c', 'pass'])" 100 loops, best of 3: 17.8 msec per loop # Python 3.2 $ py33/python -m timeit -s "from subprocess import call" "call(['py32/python', '-c', 'pass'])" 10 loops, best of 3: 39 msec per loop # Python 3.3 $ py33/python -m timeit -s "from subprocess import call" "call(['py33/python', '-c', 'pass'])" 10 loops, best of 3: 25.3 msec per loop
Improvements in the import system and the Unicode support already resulted in a more than 30% improvement in startup time in Python 3.3 relative to 3.2. Python 3.3 is still slightly slower to start than Python 2.7 due to the additional infrastructure that needs to be put in place to support the Unicode based text model.
This PEP is not expected to have any significant effect on the startup time, as it is aimed primarily at reordering the existing initialization sequence, without making substantial changes to the individual steps.
However, if this simple check suggests that the proposed changes to the initialization sequence may pose a performance problem, then a more sophisticated microbenchmark will be developed to assist in investigation.
Required Configuration Settings
A comprehensive configuration scheme requires that an embedding application be able to control the following aspects of the final interpreter state:
Whether or not to use randomised hashes (and if used, potentially specify a specific random seed)
The "Where is Python located?" elements in the
sysmodule:sys.executablesys.base_exec_prefixsys.base_prefixsys.exec_prefixsys.prefix
The path searched for imports from the filesystem (and other path hooks):
sys.path
The command line arguments seen by the interpeter:
sys.argv
The filesystem encoding used by:
sys.getfsencodingos.fsencodeos.fsdecode
The IO encoding (if any) and the buffering used by:
sys.stdinsys.stdoutsys.stderr
The initial warning system state:
sys.warnoptions
Arbitrary extended options (e.g. to automatically enable
faulthandler):sys._xoptions
Whether or not to implicitly cache bytecode files:
sys.dont_write_bytecode
Whether or not to enforce correct case in filenames on case-insensitive platforms
os.environ["PYTHONCASEOK"]
The other settings exposed to Python code in
sys.flags:debug(Enable debugging output in the pgen parser)inspect(Enter interactive interpreter after main terminates)interactive(Treat stdin as a tty)optimize(debug status, write .pyc or .pyo, strip doc strings)no_user_site(don't add the user site directory to sys.path)no_site(don't implicitly import site during startup)ignore_environment(whether environment vars are used during config)verbose(enable all sorts of random output)bytes_warning(warnings/errors for implicit str/bytes interaction)quiet(disable banner output even if verbose is also enabled or stdin is a tty and the interpreter is launched in interactive mode)
Whether or not CPython's signal handlers should be installed
What code (if any) should be executed as
__main__:- Nothing (just create an empty module)
- A filesystem path referring to a Python script (source or bytecode)
- A filesystem path referring to a valid
sys.pathentry (typically a directory or zipfile) - A given string (equivalent to the "-c" option)
- A module or package (equivalent to the "-m" option)
- Standard input as a script (i.e. a non-interactive stream)
- Standard input as an interactive interpreter session
<TBD: Did I miss anything?>
Note that this just covers settings that are currently configurable in some manner when using the main CPython executable. While this PEP aims to make adding additional configuration settings easier in the future, it deliberately avoids adding any new settings of its own.
The Status Quo
The current mechanisms for configuring the interpreter have accumulated in a fairly ad hoc fashion over the past 20+ years, leading to a rather inconsistent interface with varying levels of documentation.
(Note: some of the info below could probably be cleaned up and added to the C API documentation - it's all CPython specific, so it doesn't belong in the language reference)
Ignoring Environment Variables
The -E command line option allows all environment variables to be
ignored when initializing the Python interpreter. An embedding application
can enable this behaviour by setting Py_IgnoreEnvironmentFlag before
calling Py_Initialize().
In the CPython source code, the Py_GETENV macro implicitly checks this
flag, and always produces NULL if it is set.
<TBD: I believe PYTHONCASEOK is checked regardless of this setting > <TBD: Does -E also ignore Windows registry keys? >
Randomised Hashing
The randomised hashing is controlled via the -R command line option (in
releases prior to 3.3), as well as the PYTHONHASHSEED environment
variable.
In Python 3.3, only the environment variable remains relevant. It can be used to disable randomised hashing (by using a seed value of 0) or else to force a specific hash value (e.g. for repeatability of testing, or to share hash values between processes)
However, embedding applications must use the Py_HashRandomizationFlag
to explicitly request hash randomisation (CPython sets it in Py_Main()
rather than in Py_Initialize()).
The new configuration API should make it straightforward for an
embedding application to reuse the PYTHONHASHSEED processing with
a text based configuration setting provided by other means (e.g. a
config file or separate environment variable).
Locating Python and the standard library
The location of the Python binary and the standard library is influenced by several elements. The algorithm used to perform the calculation is not documented anywhere other than in the source code [3_,4_]. Even that description is incomplete, as it failed to be updated for the virtual environment support added in Python 3.3 (detailed in PEP 420).
These calculations are affected by the following function calls (made
prior to calling Py_Initialize()) and environment variables:
Py_SetProgramName()Py_SetPythonHome()PYTHONHOME
The filesystem is also inspected for pyvenv.cfg files (see PEP 420) or,
failing that, a lib/os.py (Windows) or lib/python$VERSION/os.py
file.
The build time settings for PREFIX and EXEC_PREFIX are also relevant, as are some registry settings on Windows. The hardcoded fallbacks are based on the layout of the CPython source tree and build output when working in a source checkout.
Configuring sys.path
An embedding application may call Py_SetPath() prior to
Py_Initialize() to completely override the calculation of
sys.path. It is not straightforward to only allow some of the
calculations, as modifying sys.path after initialization is
already complete means those modifications will not be in effect
when standard library modules are imported during the startup sequence.
If Py_SetPath() is not used prior to the first call to Py_GetPath()
(implicit in Py_Initialize()), then it builds on the location data
calculations above to calculate suitable path entries, along with
the PYTHONPATH environment variable.
<TBD: On Windows, there's also a bunch of stuff to do with the registry>
The site module, which is implicitly imported at startup (unless
disabled via the -S option) adds additional paths to this initial
set of paths, as described in its documentation [5_].
The -s command line option can be used to exclude the user site
directory from the list of directories added. Embedding applications
can control this by setting the Py_NoUserSiteDirectory global variable.
The following commands can be used to check the default path configurations for a given Python executable on a given system:
./python -c "import sys, pprint; pprint.pprint(sys.path)"- standard configuration
./python -s -c "import sys, pprint; pprint.pprint(sys.path)"- user site directory disabled
./python -S -c "import sys, pprint; pprint.pprint(sys.path)"- all site path modifications disabled
(Note: you can see similar information using -m site instead of -c,
but this is slightly misleading as it calls os.abspath on all of the
path entries, making relative path entries look absolute. Using the site
module also causes problems in the last case, as on Python versions prior to
3.3, explicitly importing site will carry out the path modifications -S
avoids, while on 3.3+ combining -m site with -S currently fails)
The calculation of sys.path[0] is comparatively straightforward:
- For an ordinary script (Python source or compiled bytecode),
sys.path[0]will be the directory containing the script. - For a valid
sys.pathentry (typically a zipfile or directory),sys.path[0]will be that path - For an interactive session, running from stdin or when using the
-cor-mswitches,sys.path[0]will be the empty string, which the import system interprets as allowing imports from the current directory
Configuring sys.argv
Unlike most other settings discussed in this PEP, sys.argv is not
set implicitly by Py_Initialize(). Instead, it must be set via an
explicitly call to Py_SetArgv().
CPython calls this in Py_Main() after calling Py_Initialize(). The
calculation of sys.argv[1:] is straightforward: they're the command line
arguments passed after the script name or the argument to the -c or
-m options.
The calculation of sys.argv[0] is a little more complicated:
- For an ordinary script (source or bytecode), it will be the script name
- For a
sys.pathentry (typically a zipfile or directory) it will initially be the zipfile or directory name, but will later be changed by therunpymodule to the full path to the imported__main__module. - For a module specified with the
-mswitch, it will initially be the string"-m", but will later be changed by therunpymodule to the full path to the executed module. - For a package specified with the
-mswitch, it will initially be the string"-m", but will later be changed by therunpymodule to the full path to the executed__main__submodule of the package. - For a command executed with
-c, it will be the string"-c" - For explicitly requested input from stdin, it will be the string
"-" - Otherwise, it will be the empty string
Embedding applications must call Py_SetArgv themselves. The CPython logic
for doing so is part of Py_Main() and is not exposed separately.
However, the runpy module does provide roughly equivalent logic in
runpy.run_module and runpy.run_path.
Other configuration settings
TBD: Cover the initialization of the following in more detail:
The initial warning system state:
sys.warnoptions- (-W option, PYTHONWARNINGS)
Arbitrary extended options (e.g. to automatically enable
faulthandler):sys._xoptions- (-X option)
The filesystem encoding used by:
sys.getfsencodingos.fsencodeos.fsdecode
The IO encoding and buffering used by:
sys.stdinsys.stdoutsys.stderr- (-u option, PYTHONIOENCODING, PYTHONUNBUFFEREDIO)
Whether or not to implicitly cache bytecode files:
sys.dont_write_bytecode- (-B option, PYTHONDONTWRITEBYTECODE)
Whether or not to enforce correct case in filenames on case-insensitive platforms
os.environ["PYTHONCASEOK"]
The other settings exposed to Python code in
sys.flags:debug(Enable debugging output in the pgen parser)inspect(Enter interactive interpreter after main terminates)interactive(Treat stdin as a tty)optimize(debug status, write .pyc or .pyo, strip doc strings)no_user_site(don't add the user site directory to sys.path)no_site(don't implicitly import site during startup)ignore_environment(whether environment vars are used during config)verbose(enable all sorts of random output)bytes_warning(warnings/errors for implicit str/bytes interaction)quiet(disable banner output even if verbose is also enabled or stdin is a tty and the interpreter is launched in interactive mode)
Whether or not CPython's signal handlers should be installed
Much of the configuration of CPython is currently handled through C level global variables::
Py_BytesWarningFlag (-b)
Py_DebugFlag (-d option)
Py_InspectFlag (-i option, PYTHONINSPECT)
Py_InteractiveFlag (property of stdin, cannot be overridden)
Py_OptimizeFlag (-O option, PYTHONOPTIMIZE)
Py_DontWriteBytecodeFlag (-B option, PYTHONDONTWRITEBYTECODE)
Py_NoUserSiteDirectory (-s option, PYTHONNOUSERSITE)
Py_NoSiteFlag (-S option)
Py_UnbufferedStdioFlag (-u, PYTHONUNBUFFEREDIO)
Py_VerboseFlag (-v option, PYTHONVERBOSE)For the above variables, the conversion of command line options and
environment variables to C global variables is handled by Py_Main,
so each embedding application must set those appropriately in order to
change them from their defaults.
Some configuration can only be provided as OS level environment variables::
PYTHONSTARTUP
PYTHONCASEOK
PYTHONIOENCODINGThe Py_InitializeEx() API also accepts a boolean flag to indicate
whether or not CPython's signal handlers should be installed.
Finally, some interactive behaviour (such as printing the introductory banner) is triggered only when standard input is reported as a terminal connection by the operating system.
TBD: Document how the "-x" option is handled (skips processing of the first comment line in the main script)
Also see detailed sequence of operations notes at [1_]
Design Details
(Note: details here are still very much in flux, but preliminary feedback is appreciated anyway)
The main theme of this proposal is to create the interpreter state for the main interpreter much earlier in the startup process. This will allow most of the CPython API to be used during the remainder of the initialization process, potentially simplifying a number of operations that currently need to rely on basic C functionality rather than being able to use the richer data structures provided by the CPython C API.
In the following, the term "embedding application" also covers the standard CPython command line application.
Interpreter Initialization Phases
Four distinct phases are proposed:
Pre-Initialization:
- no interpreter is available.
Py_IsInitializing()returns0Py_IsInitialized()returns0Py_IsRunningMain()returns0- The embedding application determines the settings required to create the
main interpreter and moves to the next phase by calling
Py_BeginInitialization.
Initialization:
- the main interpreter is available, but only partially configured.
Py_IsInitializing()returns1Py_IsInitialized()returns0Py_RunningMain()returns0- The embedding application determines and applies the settings
required to complete the initialization process by calling
Py_ReadConfigurationandPy_EndInitialization.
Initialized:
- the main interpreter is available and fully operational, but
__main__related metadata is incomplete. Py_IsInitializing()returns0Py_IsInitialized()returns1Py_IsRunningMain()returns0- Optionally, the embedding application may identify and begin
executing code in the
__main__module namespace by callingPy_RunPathAsMain,Py_RunModuleAsMainorPy_RunStreamAsMain.
- the main interpreter is available and fully operational, but
Main Execution:
- bytecode is being executed in the
__main__namespace Py_IsInitializing()returns0Py_IsInitialized()returns1Py_IsRunningMain()returns1
- bytecode is being executed in the
As indicated by the phase reporting functions, main module execution is an optional subphase of Initialized rather than a completely distinct phase.
All 4 phases will be used by the standard CPython interpreter and the
proposed System Python interpreter. Other embedding applications may
choose to skip the step of executing code in the __main__ namespace.
An embedding application may still continue to leave initialization almost
entirely under CPython's control by using the existing Py_Initialize
API. Alternatively, if an embedding application wants greater control
over CPython's initial state, it will be able to use the new, finer
grained API, which allows the embedding application greater control
over the initialization process::
/* Phase 1: Pre-Initialization */
Py_CoreConfig core_config = Py_CoreConfig_INIT;
Py_Config config = Py_Config_INIT;
/* Easily control the core configuration */
core_config.ignore_environment = 1; /* Ignore environment variables */
core_config.use_hash_seed = 0; /* Full hash randomisation */
Py_BeginInitialization(&core_config);
/* Phase 2: Initialization */
/* Optionally preconfigure some settings here - they will then be
* used to derive other settings */
Py_ReadConfiguration(&config);
/* Can completely override derived settings here */
Py_EndInitialization(&config);
/* Phase 3: Initialized */
/* If an embedding application has no real concept of a main module
* it can leave the interpreter in this state indefinitely.
* Otherwise, it can launch __main__ via the Py_Run*AsMain functions.
*/Pre-Initialization Phase
The pre-initialization phase is where an embedding application determines the settings which are absolutely required before the interpreter can be initialized at all. Currently, the only configuration settings in this category are those related to the randomised hash algorithm - the hash algorithms must be consistent for the lifetime of the process, and so they must be in place before the core interpreter is created.
The specific settings needed are a flag indicating whether or not to use a
specific seed value for the randomised hashes, and if so, the specific value
for the seed (a seed value of zero disables randomised hashing). In addition,
due to the possible use of PYTHONHASHSEED in configuring the hash
randomisation, the question of whether or not to consider environment
variables must also be addressed early.
The proposed API for this step in the startup sequence is::
void Py_BeginInitialization(const Py_CoreConfig *config);Like Py_Initialize, this part of the new API treats initialization failures as fatal errors. While that's still not particularly embedding friendly, the operations in this step really shouldn't be failing, and changing them to return error codes instead of aborting would be an even larger task than the one already being proposed.
The new Py_CoreConfig struct holds the settings required for preliminary
configuration::
/* Note: if changing anything in Py_CoreConfig, also update
* Py_CoreConfig_INIT */
typedef struct {
int ignore_environment; /* -E switch */
int use_hash_seed; /* PYTHONHASHSEED */
unsigned long hash_seed; /* PYTHONHASHSEED */
} Py_CoreConfig;
#define Py_CoreConfig_INIT {0, -1, 0}The core configuration settings pointer may be NULL, in which case the
default values are ignore_environment = 0 and use_hash_seed = -1.
The Py_CoreConfig_INIT macro is designed to allow easy initialization
of a struct instance with sensible defaults::
Py_CoreConfig core_config = Py_CoreConfig_INIT;ignore_environment controls the processing of all Python related
environment variables. If the flag is zero, then environment variables are
processed normally. Otherwise, all Python-specific environment variables
are considered undefined (exceptions may be made for some OS specific
environment variables, such as those used on Mac OS X to communicate
between the App bundle and the main Python binary).
use_hash_seed controls the configuration of the randomised hash
algorithm. If it is zero, then randomised hashes with a random seed will
be used. It it is positive, then the value in hash_seed will be used
to seed the random number generator. If the hash_seed is zero in this
case, then the randomised hashing is disabled completely.
If use_hash_seed is negative (and ignore_environment is zero),
then CPython will inspect the PYTHONHASHSEED environment variable. If it
is not set, is set to the empty string, or to the value "random", then
randomised hashes with a random seed will be used. If it is set to the string
"0" the randomised hashing will be disabled. Otherwise, the hash seed is
expected to be a string representation of an integer in the range
[0; 4294967295].
To make it easier for embedding applications to use the PYTHONHASHSEED
processing with a different data source, the following helper function
will be added to the C API::
int Py_ReadHashSeed(char *seed_text,
int *use_hash_seed,
unsigned long *hash_seed);This function accepts a seed string in seed_text and converts it to
the appropriate flag and seed values. If seed_text is NULL,
the empty string or the value "random", both use_hash_seed and
hash_seed will be set to zero. Otherwise, use_hash_seed will be set to
1 and the seed text will be interpreted as an integer and reported as
hash_seed. On success the function will return zero. A non-zero return
value indicates an error (most likely in the conversion to an integer).
The aim is to keep this initial level of configuration as small as possible
in order to keep the bootstrapping environment consistent across
different embedding applications. If we can create a valid interpreter state
without the setting, then the setting should go in the config dict passed
to Py_EndInitialization() rather than in the core configuration.
A new query API will allow code to determine if the interpreter is in the bootstrapping state between the creation of the interpreter state and the completion of the bulk of the initialization process::
int Py_IsInitializing();Attempting to call Py_BeginInitialization() again when
Py_IsInitializing() or Py_IsInitialized() is true is a fatal error.
While in the initializing state, the interpreter should be fully functional except that:
- compilation is not allowed (as the parser and compiler are not yet configured properly)
- creation of subinterpreters is not allowed
- creation of additional thread states is not allowed
- The following attributes in the
sysmodule are all either missing orNone:sys.pathsys.argvsys.executablesys.base_exec_prefixsys.base_prefixsys.exec_prefixsys.prefixsys.warnoptionssys.flagssys.dont_write_bytecodesys.stdinsys.stdout
- The filesystem encoding is not yet defined
- The IO encoding is not yet defined
- CPython signal handlers are not yet installed
- only builtin and frozen modules may be imported (due to above limitations)
sys.stderris set to a temporary IO object using unbuffered binary mode- The
warningsmodule is not yet initialized - The
__main__module does not yet exist
<TBD: identify any other notable missing functionality>
The main things made available by this step will be the core Python datatypes, in particular dictionaries, lists and strings. This allows them to be used safely for all of the remaining configuration steps (unlike the status quo).
In addition, the current thread will possess a valid Python thread state, allow any further configuration data to be stored on the interpreter object rather than in C process globals.
Any call to Py_BeginInitialization() must have a matching call to
Py_Finalize(). It is acceptable to skip calling Py_EndInitialization() in
between (e.g. if attempting to read the configuration settings fails)
Determining the remaining configuration settings
The next step in the initialization sequence is to determine the full settings needed to complete the process. No changes are made to the interpreter state at this point. The core API for this step is::
int Py_ReadConfiguration(PyConfig *config);The config argument should be a pointer to a Python dictionary. For any supported configuration setting already in the dictionary, CPython will sanity check the supplied value, but otherwise accept it as correct.
Unlike Py_Initialize and Py_BeginInitialization, this call will raise
an exception and report an error return rather than exhibiting fatal errors
if a problem is found with the config data.
Any supported configuration setting which is not already set will be
populated appropriately. The default configuration can be overridden
entirely by setting the value before calling Py_ReadConfiguration. The
provided value will then also be used in calculating any settings derived
from that value.
Alternatively, settings may be overridden after the
Py_ReadConfiguration call (this can be useful if an embedding
application wants to adjust a setting rather than replace it completely,
such as removing sys.path[0]).
Supported configuration settings
The new Py_Config struct holds the settings required to complete the
interpreter configuration. All fields are either pointers to Python
data types (not set == NULL) or numeric flags (not set == -1)::
/* Note: if changing anything in Py_Config, also update Py_Config_INIT */
typedef struct {
/* Argument processing */
PyList *raw_argv;
PyList *argv;
PyList *warnoptions; /* -W switch, PYTHONWARNINGS */
PyDict *xoptions; /* -X switch */
/* Filesystem locations */
PyUnicode *program_name;
PyUnicode *executable;
PyUnicode *prefix; /* PYTHONHOME */
PyUnicode *exec_prefix; /* PYTHONHOME */
PyUnicode *base_prefix; /* pyvenv.cfg */
PyUnicode *base_exec_prefix; /* pyvenv.cfg */
/* Site module */
int no_site; /* -S switch */
int no_user_site; /* -s switch, PYTHONNOUSERSITE */
/* Import configuration */
int dont_write_bytecode; /* -B switch, PYTHONDONTWRITEBYTECODE */
int ignore_module_case; /* PYTHONCASEOK */
PyList *import_path; /* PYTHONPATH (etc) */
/* Standard streams */
int use_unbuffered_io; /* -u switch, PYTHONUNBUFFEREDIO */
PyUnicode *stdin_encoding; /* PYTHONIOENCODING */
PyUnicode *stdin_errors; /* PYTHONIOENCODING */
PyUnicode *stdout_encoding; /* PYTHONIOENCODING */
PyUnicode *stdout_errors; /* PYTHONIOENCODING */
PyUnicode *stderr_encoding; /* PYTHONIOENCODING */
PyUnicode *stderr_errors; /* PYTHONIOENCODING */
/* Filesystem access */
PyUnicode *fs_encoding;
/* Interactive interpreter */
int stdin_is_interactive; /* Force interactive behaviour */
int inspect_main; /* -i switch, PYTHONINSPECT */
PyUnicode *startup_file; /* PYTHONSTARTUP */
/* Debugging output */
int debug_parser; /* -d switch, PYTHONDEBUG */
int verbosity; /* -v switch */
int suppress_banner; /* -q switch */
/* Code generation */
int bytes_warnings; /* -b switch */
int optimize; /* -O switch */
/* Signal handling */
int install_sig_handlers;
} Py_Config;
/* Struct initialization is pretty ugly in C89. Avoiding this mess would
* be the most attractive aspect of using a PyDict* instead... */
#define _Py_ArgConfig_INIT NULL, NULL, NULL, NULL
#define _Py_LocationConfig_INIT NULL, NULL, NULL, NULL, NULL, NULL
#define _Py_SiteConfig_INIT -1, -1
#define _Py_ImportConfig_INIT -1, -1, NULL
#define _Py_StreamConfig_INIT -1, NULL, NULL, NULL, NULL, NULL, NULL
#define _Py_FilesystemConfig_INIT NULL
#define _Py_InteractiveConfig_INIT -1, -1, NULL
#define _Py_DebuggingConfig_INIT -1, -1, -1
#define _Py_CodeGenConfig_INIT -1, -1
#define _Py_SignalConfig_INIT -1
#define Py_Config_INIT {_Py_ArgConfig_INIT, _Py_LocationConfig_INIT,
_Py_SiteConfig_INIT, _Py_ImportConfig_INIT,
_Py_StreamConfig_INIT, _Py_FilesystemConfig_INIT,
_Py_InteractiveConfig_INIT,
_Py_DebuggingConfig_INIT, _Py_CodeGenConfig_INIT,
_Py_SignalConfig_INIT}<TBD: did I miss anything?>
Completing the interpreter initialization
The final step in the initialization process is to actually put the configuration settings into effect and finish bootstrapping the interpreter up to full operation::
int Py_EndInitialization(const PyConfig *config);Like Py_ReadConfiguration, this call will raise an exception and report an error return rather than exhibiting fatal errors if a problem is found with the config data.
All configuration settings are required - the configuration struct
should always be passed through Py_ReadConfiguration() to ensure it
is fully populated.
After a successful call, Py_IsInitializing() will be false, while
Py_IsInitialized() will become true. The caveats described above for the
interpreter during the initialization phase will no longer hold.
However, some metadata related to the __main__ module may still be
incomplete:
sys.argv[0]may not yet have its final value- it will be
-mwhen executing a module or package with CPython - it will be the same as
sys.path[0]rather than the location of the__main__module when executing a validsys.pathentry (typically a zipfile or directory)
- it will be
the metadata in the
__main__module will still indicate it is a builtin module
Executing the main module
Initial thought is that hiding the various options behind a single API would make that API too complicated, so 3 separate APIs is more likely::
Py_RunPathAsMain
Py_RunModuleAsMain
Py_RunStreamAsMainQuery API to indicate that sys.argv[0] is fully populated::
Py_IsRunningMain()Internal Storage of Configuration Data
The interpreter state will be updated to include details of the configuration
settings supplied during initialization by extending the interpreter state
object with an embedded copy of the Py_CoreConfig and Py_Config
structs.
For debugging purposes, the configuration settings will be exposed as
a sys._configuration simple namespace (similar to sys.flags and
sys.implementation. Field names will match those in the configuration
structs, exception for hash_seed, which will be deliberately excluded.
These are snapshots of the initial configuration settings. They are not consulted by the interpreter during runtime.
Stable ABI
All of the APIs proposed in this PEP are excluded from the stable ABI, as embedding a Python interpreter involves a much higher degree of coupling than merely writing an extension.
Backwards Compatibility
Backwards compatibility will be preserved primarily by ensuring that Py_ReadConfiguration() interrogates all the previously defined configuration settings stored in global variables and environment variables, and that Py_EndInitialization() writes affected settings back to the relevant locations.
One acknowledged incompatiblity is that some environment variables which are currently read lazily may instead be read once during interpreter initialization. As the PEP matures, these will be discussed in more detail on a case by case basis. The environment variables which are currently known to be looked up dynamically are:
PYTHONCASEOK: writing toos.environ['PYTHONCASEOK']will no longer dynamically alter the interpreter's handling of filename case differences on import (TBC)PYTHONINSPECT:os.environ['PYTHONINSPECT']will still be checked after execution of the__main__module terminates
The Py_Initialize() style of initialization will continue to be
supported. It will use (at least some elements of) the new API
internally, but will continue to exhibit the same behaviour as it
does today, ensuring that sys.argv is not populated until a subsequent
PySys_SetArgv call. All APIs that currently support being called
prior to Py_Initialize() will
continue to do so, and will also support being called prior to
Py_BeginInitialization().
To minimise unnecessary code churn, and to ensure the backwards compatibility is well tested, the main CPython executable may continue to use some elements of the old style initialization API. (very much TBC)
Open Questions
- Is
Py_IsRunningMain()worth keeping? - Should the answers to
Py_IsInitialized()andPy_RunningMain()be exposed via thesysmodule? - Is the
Py_Configstruct too unwieldy to be practical? Would a Python dictionary be a better choice? - Would it be better to manage the flag variables in
Py_Configas Python integers so the struct can be initialized with a simplememset(&config, 0, sizeof(*config))?
A System Python Executable
When executing system utilities with administrative access to a system, many of the default behaviours of CPython are undesirable, as they may allow untrusted code to execute with elevated privileges. The most problematic aspects are the fact that user site directories are enabled, environment variables are trusted and that the directory containing the executed file is placed at the beginning of the import path.
Currently, providing a separate executable with different default behaviour
would be prohibitively hard to maintain. One of the goals of this PEP is to
make it possible to replace much of the hard to maintain bootstrapping code
with more normal CPython code, as well as making it easier for a separate
application to make use of key components of Py_Main. Including this
change in the PEP is designed to help avoid acceptance of a design that
sounds good in theory but proves to be problematic in practice.
Cleanly supporting this kind of "alternate CLI" is the main reason for the proposed changes to better expose the core logic for deciding between the different execution modes supported by CPython:
- script execution
- directory/zipfile execution
- command execution ("-c" switch)
- module or package execution ("-m" switch)
- execution from stdin (non-interactive)
- interactive stdin
Implementation
None as yet. Once I have a reasonably solid plan of attack, I intend to work on a reference implementation as a feature branch in my BitBucket sandbox [2_]
References
.. [1] CPython interpreter initialization notes (http://wiki.python.org/moin/CPythonInterpreterInitialization)
.. [2] BitBucket Sandbox (https://bitbucket.org/ncoghlan/cpython_sandbox)
.. [3] *nix getpath implementation (http://hg.python.org/cpython/file/default/Modules/getpath.c)
.. [4] Windows getpath implementation (http://hg.python.org/cpython/file/default/PC/getpathp.c)
.. [5] Site module documentation (http://docs.python.org/3/library/site.html)
Copyright
This document has been placed in the public domain.
-- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia
- Previous message: [Python-ideas] Order in the documentation search results
- Next message: [Python-ideas] Updated PEP 432: Simplifying the CPython update sequence
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]