[Python-Dev] second draft of sandboxing design doc (original) (raw)

Brett Cannon brett at python.org
Sat Jul 8 00:51:11 CEST 2006


OK, lots of revisions. The approach to handling 'file' I have left up in the air. Biggest change is switching to "unprotected" and "sandboxed" for terms when referring to the interpreter. Also added a Threat Model section to explain assumptions about the basics of the interpreter. Hopefully it is also more clear about the competing approaches for dealing with 'file'.

I am planning on starting work next week on implementation, but I will start with the least controversial and work my way up.

One thing that is open that I would like some feedback on immediately is whether people would rather pass in PyObjects or C level types to the API. The latter makes implementing the Python wrapper much easier, but makes embedding a more heavy-handed. People have a preference or think that the Python API will be used more often than the C one? I am leaning towards making the C API simpler by using C level types (const char *, etc.) and just deal with the Python wrappings requiring more rejiggering between types.

Once again, I have a branch going (bcannon-sandboxing) where the work is going to be done and where this doc lives. I don't plan on doing another post of this doc until another major revision.


  Restricted Execution for Python

#######################################

About This Document

This document is meant to lay out the general design for re-introducing a sandboxing model for Python. This document should provide one with enough information to understand the goals for sandboxing, what considerations were made for the design, and the actual design itself. Design decisions should be clear and explain not only why they were chosen but possible drawbacks from taking a specific approach.

If any of the above is found not to be true, please email me at brett at python.org and let me know what problems you are having with the document.

XXX TO DO

Goal

A good sandboxing model provides enough protection to prevent malicious harm to come to the system, and no more. Barriers should be minimized so as to allow most code that does not do anything that would be regarded as harmful to run unmodified. But the protections need to be thorough enough to prevent any unintended changes or information of the system to come about.

An important point to take into consideration when reading this document is to realize it is part of my (Brett Cannon's) Ph.D. dissertation. This means it is heavily geared toward sandboxing when the interpreter is working with Python code embedded in a web page as viewed in Firefox. While great strides have been taken to keep the design general enough so as to allow all previous uses of the 'rexec' module [#rexec]_ to be able to use the new design, it is not the focused goal. This means if a design decision must be made for the embedded use case compared to sandboxing Python code in a pure Python application, the former will win out over the latter.

Throughout this document, the term "resource" is used to represent anything that deserves possible protection. This includes things that have a physical representation (e.g., memory) to things that are more abstract and specific to the interpreter (e.g., sys.path).

When referring to the state of an interpreter, it is either "unprotected" or "sandboxed". A unprotected interpreter has no restrictions imposed upon any resource. A sandboxed interpreter has at least one, possibly more, resource with restrictions placed upon it to prevent unsafe code that is running within the interpreter to cause harm to the system.

.. contents::

Use Cases /////////////////////////////

All use cases are based on how many sandboxed interpreters are running in a single process and whether an unprotected interpreter is also running. The use cases can be broken down into two categories: when the interpreter is embedded and only using sandboxed interpreters, and when pure Python code is running in an unprotected interpreter and uses sandboxed interpreters.

When the Interpreter Is Embedded

Single Sandboxed Interpreter

This use case is when an application embeds the interpreter and never has more than one interpreter running which happens to be sandboxed.

Multiple Sandboxed Interpreters

When multiple interpreters, all sandboxed at varying levels, need to be running within a single application. This is the key use case that this proposed design is targeted for.

Stand-Alone Python

When someone has written a Python program that wants to execute Python code in an sandboxed interpreter(s). This is the use case that 'rexec' attempted to fulfill.

Issues to Consider

Common to all use cases, resources that the interpreter requires to function at a level below user code cannot be exposed to a sandboxed interpreter. For instance, the interpreter might need to stat a file to see if it is possible to import. If the ability to stat a file is not allowed to a sandboxed interpreter, it should not be allowed to perform that action, regardless of whether the interpreter at a level below user code needs that ability.

When multiple interpreters are involved (sandboxed or not), not allowing an interpreter to gain access to resources available in other interpreters without explicit permission must be enforced.

Resources to Protect /////////////////////////////

It is important to make sure that the proper resources are protected from a sandboxed interpreter. If you don't there is no point to sandboxing.

Filesystem

All facets of the filesystem must be protected. This means restricting reading and writing to the filesystem (e.g., files, directories, etc.). It should be allowed in controlled situations where allowing access to the filesystem is desirable, but that should be an explicit allowance.

There must also be protection to prevent revealing any information about the filesystem. Disclosing information on the filesystem could allow one to infer what OS the interpreter is running on, for instance.

Memory

Memory should be protected. It is a limited resource on the system that can have an impact on other running programs if it is exhausted. Being able to restrict the use of memory would help alleviate issues from denial-of-service (DoS) attacks on the system.

Networking

Networking is somewhat like the filesystem in terms of wanting similar protections. You do not want to let unsafe code make socket connections unhindered or accept them to do possibly nefarious things. You also want to prevent finding out information about the network your are connected to.

Interpreter

One must make sure that the interpreter is not harmed in any way from sandboxed code. This usually takes the form of crashing the program that the interpreter is embedded in or the unprotected interpreter that started the sandbox interpreter. Executing hostile bytecode that might lead to undesirable effects is another possible issue.

There is also the issue of taking it over. One should not able to gain escalated privileges in any way without explicit permission.

Types of Security ///////////////////////////////////////

As with most things, there are multiple approaches one can take to tackle a problem. Security is no exception. In general there seem to be two approaches to protecting resources.

Resource Hiding

By never giving code a chance to access a resource, you prevent it from being (ab)used. This is the idea behind resource hiding; you can't misuse something you don't have in the first place.

The most common implementation of resource hiding is capabilities. In this type of system a resource's reference acts as a ticket that represents the right to use the resource. Once code has a reference it is considered to have full use of resource that reference represents and no further security checks are directly performed (using delegates and other structured ways one can actually have a security check for each access of a resource, but this is not a default behaviour).

As an example, consider the 'file' type as a resource we want to protect. That would mean that we did not want a reference to the 'file' type to ever be accessible without explicit permission. If one wanted to provide read-only access to a temp file, you could have open() perform a check on the permissions of the current interpreter, and if it is allowed to, return a proxy object for the file that only allows reading from it. The 'file' instance for the proxy would need to be properly hidden so that the reference was not reachable from outside so that 'file' access could still be controlled.

Python, as it stands now, unfortunately does not work well for a pure capabilities system. Capabilities require the prohibition of certain abilities, such as "direct access to another's private state" [#paradigm regained]_. This obviously is not possible in Python since, at least at the Python level, there is no such thing as private state that is persistent (one could argue that local variables that are not cell variables for lexical scopes are private, but since they do not survive after a function call they are not usable for keeping persistent state). One can hide references at the C level by storing it in the struct for the instance of a type and not providing a function to access that attribute.

Python's introspection abilities also do not help make implementing capabilities that much easier. Consider how one could access 'file' even when it is deleted from builtin. You can still get to the reference for 'file' through the sequence returned by object.__subclasses__().

Resource Crippling

Another approach to security is to not worry about controlling access to the reference of a resource. One can have a resource perform a security check every time someone tries to use a method on that resource. This pushes the security check to a lower level; from a reference level to the method level.

By performing the security check every time a resource's method is called the worry of a specific resource's reference leaking out to insecure code is alleviated. This does add extra overhead, though, by having to do so many security checks. It also does not handle the situation where an unexpected exposure of a type occurs that has not been properly crippled.

FreeBSD's jail system provides a protection scheme similar to this. Various system calls allow for basic usage, but knowing or having access to the system call is not enough to grant usage. Every call to a system call requires checking that the proper rights have been granted to the use in order to allow for the system call to perform its action.

An even better example in FreeBSD's jail system is its protection of sockets. One can only bind a single IP address to a jail. Any attempt to do more or perform uses with the one IP address that is granted is prevented. The check is performed at every call involving the one granted IP address.

Using 'file' as the example again, one could cripple the type so that instantiation is not possible for the type in Python. One could also provide a permission check on each call to a unsafe method call and thus allow the type to be used in normal situations (such as type checking), but still feel safe that illegal operations are not performed. Regardless of which approach you take, you do not need to worry about a reference to the type being exposed unexpectedly since the reference is not the security check but the actual method calls.

Comparison of the Two Approaches

From the perspective of Python, the two approaches differ on what would be the most difficult thing to analyze from a security standpoint: all of the ways to gain access to various types from a sandboxed interpreter with no imports, or finding all of the types that can lead to possibly dangerous actions and thus need to be crippled.

Some Python developers, such as Armin Rigo, feel that truly hiding objects in Python is "quite hard" [#armin-hiding]_. This sentiment means that making a pure capabilities system in Python that is secure is not possible as people would continue to find new ways to get a hold of the reference to a protected resource.

Others feel that by not going the capabilities route we will be constantly chasing down new types that require crippling. The thinking is that if we cannot control the references for 'file', how are we to know what other types might become exposed later on and thus require more crippling?

It essentially comes down to what is harder to do: find all the ways to access the types in Python in a sandboxed interpreter with no imported modules, or to go through the Python code base and find all types that should be crippled?

The 'rexec' Module ///////////////////////////////////////

The 'rexec' module [#rexec]_ was the original attempt at providing a sandbox environment for Python code to run in. It's design was based on Safe-Tcl which was essentially a capabilities system [#safe-tcl]_. Safe-Tcl allowed you to launch a separate interpreter where its global functions were specified at creation time. This prevented one from having any abilities that were not explicitly provided.

For 'rexec', the Safe-Tcl model was tweaked to better match Python's situation. An RExec object represented a sandboxed environment. Imports were checked against a whitelist of modules. You could also restrict the type of modules to import based on whether they were Python source, bytecode, or C extensions. Built-ins were allowed except for a blacklist of built-ins to not provide. One could restrict whether stdin, stdout, and stderr were provided or not on a per-RExec basis. Several other protections were provided; see documentation for the complete list.

The ultimate undoing of the 'rexec' module was how access to objects that in normal Python require no imports to reach was handled. Importing modules requires a direct action, and thus can be protected against directly in the import machinery. But for built-ins, they are accessible by default and require no direct action to access in normal Python; you just use their name since they are provided in all namespaces.

For instance, in a sandboxed interpreter, one only had to del __builtins__ to gain access to the full set of built-ins. Another way is through using the gc module: gc.get_referrers(''.__class__.__bases__[0])[6]['file']. While both of these could be fixed (the former was a bug in 'rexec' that was fixed and the latter could be handled by not allowing 'gc' to be imported), they are examples of things that do not require proactive actions on the part of the programmer in normal Python to gain access to a resource. This was an unfortunate side-effect of having all of that wonderful reflection in Python.

There is also the issue that 'rexec' was written in Python which provides its own problems based on reflection and the ability to modify the code at run-time without security protection.

Much has been learned since 'rexec' was written about how Python tends to be used and where security issues tend to appear. Essentially Python's dynamic nature does not lend itself very well to a security implementation that does not require a constant checking of permissions.

Threat Model ///////////////////////////////////////

Below is a list of what the security implementation assumes, along with what section of this document that addresses that part of the security model (if not already true in Python by default). The term "bare" when in regards to an interpreter means an interpreter that has not performed a single import of a module. Also, all comments refer to a sandboxed interpreter unless otherwise explicitly stated.

This list does not address specifics such as how 'file' will be protected or whether memory should be protected. This list is meant to make clear at a more basic level what the security model is assuming is true.

There are also some features that might be desirable, but are not being addressed by this security model.

The Proposed Approach ///////////////////////////////////////

In light of where 'rexec' succeeded and failed along with what is known about the two main approaches to security and how Python tends to operate, the following is a proposal on how to secure Python for sandboxing.

Implementation Details

Support for sandboxed interpreters will require a compilation flag. This allows the more common case of people not caring about protections to not take a performance hit. And even when Python is compiled for sandboxed interpreter restrictions, when the running interpreter is unprotected, there will be no accidental triggers of protections. This means that developers should be liberal with the security protections without worrying about there being issues for interpreters that do not need/want the protection.

At the Python level, the sandboxed built-in will be set based on whether the interpreter is sandboxed or not. This will be set for all interpreters, regardless of whether sandboxed interpreter support was compiled in or not.

For setting what is to be protected, the PyThreadState for the sandboxed interpreter must be passed in. This makes the protection very explicit and helps make sure you set protections for the exact interpreter you mean to. All functions that set protections begin with the prefix PySandbox_Set*(). These functions are meant to only work with sandboxed interpreters that have not been used yet to execute any Python code. The calls must be made by the code creating and handling the sandboxed interpreter before the sandboxed interpreter is used to execute any Python code.

The functions for checking for permissions are actually macros that take in at least an error return value for the function calling the macro. This allows the macro to return on behalf of the caller if the check fails and cause the SandboxError exception to be propagated automatically. This helps eliminate any coding errors from incorrectly checking a return value on a rights-checking function call. For the rare case where this functionality is disliked, just make the check in a utility function and check that function's return value (but this is strongly discouraged!).

Functions that check that an operation is allowed implicitly operate on the currently running interpreter as returned by PyInterpreter_Get() and are to be used by any code (the interpreter, extension modules, etc.) that needs to check for permission to execute. They have the common prefix of `PySandbox_Allowed*()``.

API

Memory

Protection

A memory cap will be allowed.

Modification to pymalloc will be needed to properly keep track of the allocation and freeing of memory. Same goes for the macros around the system malloc/free system calls. This provides a platform-independent system for protection of memory instead of relying on the operating system to provide a service for capping memory usage of a process. It also allows the protection to be at the interpreter level instead of at the process level.

Why

Protecting excessive memory usage allows one to make sure that a DoS attack against the system's memory is prevented.

Possible Security Flaws

If code makes direct calls to malloc/free instead of using the proper PyMem_*() macros then the security check will be circumvented. But C code is supposed to use the proper macros or pymalloc and thus this issue is not with the security model but with code not following Python coding standards.

API

Reading/Writing Files

Protection

XXX

To open a file, one will have to use open(). This will make open() a factory function that controls reference access to the 'file' type in terms of creating new instances. When an attempted file opening fails (either because the path does not exist or of security reasons), SandboxError will be raised. The same exception must be raised to prevent filesystem information being gleaned from the type of exception returned (i.e., returning IOError if a path does not exist tells the user something about that file path).

What open() returns may not be an instance of 'file' but a proxy that provides the security measures needed. While this might break code that uses type checking to make sure a 'file' object is used, taking a duck typing approach would be better. This is not only more Pythonic but would also allow the code to use a StringIO instance.

It has been suggested to allow for a passed-in callback to be called when a specific path is to be opened. While this provides good flexibility in terms of allowing custom proxies with more fine-grained security (e.g., capping the amount of disk write), this has been deemed unneeded in the initial security model and thus is not being considered at this time.

Why

Allowing anyone to be able to arbitrarily read, write, or learn about the layout of your filesystem is extremely dangerous. It can lead to loss of data or data being exposed to people whom should not have access.

Possible Security Flaws

XXX

API

Extension Module Importation

Protection

A whitelist of extension modules that may be imported must be provided. A default set is given for stdlib modules known to be safe.

A check in the import machinery will check that a specified module name is allowed based on the type of module (Python source, Python bytecode, or extension module). Python bytecode files are never directly imported because of the possibility of hostile bytecode being present. Python source is always considered safe based on the assumption that all resource harm is eventually done at the C level, thus Python source code directly cannot cause harm without help of C extension modules. Thus only C extension modules need to be checked against the whitelist.

The requested extension module name is checked in order to make sure that it is on the whitelist if it is a C extension module. If the name is not correct a SandboxError exception is raised. Otherwise the import is allowed.

Even if a Python source code module imports a C extension module in an unprotected interpreter it is not a problem since the Python source code module is reloaded in the sandboxed interpreter. When that Python source module is freshly imported the normal import check will be triggered to prevent the C extension module from becoming available to the sandboxed interpreter.

For the 'os' module, a special sandboxed version will be used if the proper C extension module providing the correct abilities is not allowed. This will default to '/' as the path separator and provide as much reasonable abilities as possible from a pure Python module.

The 'sys' module is specially addressed in Changing the Behaviour of the Interpreter_.

By default, the whitelisted modules are:

Why

Because C code is considered unsafe, its use should be regulated. By using a whitelist it allows one to explicitly decide that a C extension module is considered safe.

Possible Security Flaws

If a whitelisted C extension module imports a non-whitelisted C extension module and makes it an attribute of the whitelisted module there will be a breach in security. Luckily this a rarity in extension modules.

There is also the issue of a C extension module calling the C API of a non-whitelisted C extension module.

Lastly, if a whitelisted C extension module is loaded in an unprotected interpreter and then loaded into a sandboxed interpreter then there is no checks during module initialization for possible security issues in the sandboxed interpreter that would have occurred had the sandboxed interpreter done the initial import.

All of these issues can be handled by never blindly whitelisting a C extension module. Added support for dealing with C extension modules comes in the form of Extension Module Crippling_.

API

Extension Module Crippling

Protection

By providing a C API for checking for allowed abilities, modules that have some useful functionality can do proper security checks for those functions that could provide insecure abilities while allowing safe code to be used (and thus not fully deny importation).

Why

Consider a module that provides a string processing ability. If that module provides a single convenience function that reads its input string from a file (with a specified path), the whole module should not be blocked from being used, just that convenience function. By whitelisting the module but having a security check on the one problem function, the user can still gain access to the safe functions. Even better, the unsafe function can be allowed if the security checks pass.

Possible Security Flaws

If a C extension module developer incorrectly implements the security checks for the unsafe functions it could lead to undesired abilities.

API

Use PySandbox_Allowed() to protect unsafe code from being executed.

Hostile Bytecode

Protection

XXX

Why

Without implementing a bytecode verification tool, there is no way of making sure that bytecode does not jump outside its bounds, thus possibly executing malicious code. It also presents the possibility of crashing the interpreter.

Possible Security Flaws

None known.

API

N/A

Changing the Behaviour of the Interpreter

Protection

Only a subset of the 'sys' module will be made available to sandboxed interpreters. Things to allow from the sys module:

Why

Filesystem information must be removed. Any settings that could possibly lead to a DoS attack (e.g., sys.setrecursionlimit()) or risk crashing the interpreter must also be removed.

Possible Security Flaws

Exposing something that could lead to future security problems (e.g., a way to crash the interpreter).

API

None.

Socket Usage

Protection

Allow sending and receiving data to/from specific IP addresses on specific ports.

open() is to be used as a factory function to open a network connection. If the connection is not possible (either because of an invalid address or security reasons), SandboxError is raised.

A socket object may not be returned by the call. A proxy to handle security might be returned instead.

XXX

Why

Allowing arbitrary sending of data over sockets can lead to DoS attacks on the network and other machines. Limiting accepting data prevents your machine from being attacked by accepting malicious network connections. It also allows you to know exactly where communication is going to and coming from.

Possible Security Flaws

If someone managed to influence the used DNS server to influence what IP addresses were used after a DNS lookup.

API

Network Information

Protection

Limit what information can be gleaned about the network the system is running on. This does not include restricting information on IP addresses and hosts that are have been explicitly allowed for the sandboxed interpreter to communicate with.

XXX

Why

With enough information from the network several things could occur. One is that someone could possibly figure out where your machine is on the Internet. Another is that enough information about the network you are connected to could be used against it in an attack.

Possible Security Flaws

As long as usage is restricted to only what is needed to work with allowed addresses, there are no security issues to speak of.

API

Filesystem Information

Protection

Do not allow information about the filesystem layout from various parts of Python to be exposed. This means blocking exposure at the Python level to:

Why

Exposing information about the filesystem is not allowed. You can figure out what operating system one is on which can lead to vulnerabilities specific to that operating system being exploited.

Possible Security Flaws

Not finding every single place where a file path is exposed.

API

Stdin, Stdout, and Stderr

Protection

By default, sys.stdin, sys.stdout, and sys.stderr will be set to instances of StringIO. Explicit allowance of the process' stdin, stdout, and stderr is possible.

This will protect the 'print' statement, and the built-ins input() and raw_input().

Why

Interference with stdin, stdout, or stderr should not be allowed unless desired. No one wants uncontrolled output sent to their screen.

Possible Security Flaws

Unless StringIO instances can be used maliciously, none to speak of.

API

Adding New Protections

.. note:: This feature has the lowest priority and thus will be the last feature implemented (if ever).

Protection

Allow for extensibility in the security model by being able to add new types of checks. This allows not only for Python to add new security protections in a backwards-compatible fashion, but to also have extension modules add their own as well.

An extension module can introduce a group for its various values to check, with a type being a specific value within a group. The "Python" group is specifically reserved for use by the Python core itself.

Why

We are all human. There is the possibility that a need for a new type of protection for the interpreter will present itself and thus need support. By providing an extensible way to add new protections it helps to future-proof the system.

It also allows extension modules to present their own set of security protections. That way one extension module can use the protection scheme presented by another that it is dependent upon.

Possible Security Flaws

Poor definitions by extension module users of how their protections should be used would allow for possible exploitation.

API

Python API

sandboxed

A built-in that flags whether the interpreter currently running is sandboxed or not. Set to a 'bool' value that is read-only. To mimic working of debug.

sandbox module

XXX

References ///////////////////////////////////////

.. [#rexec] The 'rexec' module (http://docs.python.org/lib/module-rexec.html)

.. [#safe-tcl] The Safe-Tcl Security Model (http://research.sun.com/technical-reports/1997/abstract-60.html)

.. [#ctypes] 'ctypes' module (http://docs.python.org/dev/lib/module-ctypes.html)

.. [#paradigm regained] "Paradigm Regained: Abstraction Mechanisms for Access Control" (http://erights.org/talks/asian03/paradigm-revised.pdf)

.. [#armin-hiding] [Python-Dev] what can we do to hide the 'file' type? (http://mail.python.org/pipermail/python-dev/2006-July/067076.html) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-dev/attachments/20060707/a32678f1/attachment-0001.html



More information about the Python-Dev mailing list