[Python-Dev] Python-acceleration instructions on ARM (original) (raw)
"Martin v. Löwis" martin at v.loewis.de
Wed Feb 11 08:27:34 CET 2009
- Previous message: [Python-Dev] Python-acceleration instructions on ARM
- Next message: [Python-Dev] Python-acceleration instructions on ARM
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
ARM is specifically claiming that these instructions can be used to accelerate Python interpretation.
Wow, really? One of the links below mention that?
I'm skeptical though that you can really produce speedups for CPython, though; ISTM that they added Python only as a front-end language for Parrot, and added Parrot only because it looks similar to JVM and .NET (i.e. without actually testing that you can gain performance).
From reading the paper, ISTM that you can expect speedups for your JIT-generated code. In ThumbEE, you have the following additional features:
- fast null pointer checks: any register-indirect addressing in ThumbEE mode checks whether the base register is NULL; if it is, a callback is invoked (which could then throw NullPointerException). This is irrelevant in Python, because we don't use NULL as the value for "no object"
- fast array bounds check: there is an instruction that checks whether 0 <= Rm <= Rn, and invokes a callback if it's not; this would then throw ArrayOutOfBoundsException. This instruction would be emitted by JIT just before any array access. In Python, you cannot easily JIT array access into a direct machine instruction (as you need to go through tp_as_sequence->sq_item); the array bounds check would likely disappear in white noise.
- fast switch instruction: there is an efficient way to switch 256 different byte code operations, with an optional immediate parameter. It will call/jump to 256 byte code handlers. This allows for a straight-forward JIT compiler which essentially compiles all byte codes into such switch instructions. That would work for Python as well, but require that ceval gets rewritten entirely.
- fast locals: efficient access to a local-variables array, for JIT generation of ldloc.i4 (in .NET, not sure what the Java byte code for local variables is). Would work as well for Python, assuming there is a JIT compiler in the first place. R9 holds the fastlocals pointer (which is good use of the register, since you cannot access it in Thumb mode, anyway)
- fast instance variables: likewise, with R10 holding the this pointer. Not applicable to Python, since there is no byte code for instance variable access.
- efficient array indexing: they give shift-and-index back to Thumb mode, for a shift by 2, allowing to index arrays with 4-byte elements in a single instruction (rather than requiring a separate multipy-by-four). Again useful for JIT of array access instructions, not applicable to Python - although it would be nice if the C compiler knew how to emit that.
Regards, Martin
- Previous message: [Python-Dev] Python-acceleration instructions on ARM
- Next message: [Python-Dev] Python-acceleration instructions on ARM
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]