Issue 652586: New import hooks + Import from Zip files (original) (raw)

Created on 2002-12-12 10:34 by jvr, last changed 2022-04-10 16:05 by admin. This issue is now closed.

Messages (34)

msg41948 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2002-12-12 10:34

This patch implements two things:

The new set of hooks probably need a better document explaining them (perhaps a PEP). My motivations have been posted to python-dev.

Here's a brief description.

Three new objects are added to the sys module:

sys.path_hooks is a list of callable objects that take a string as their only argument. A hook will be called with a sys.path or pkg.path item. It should return an "importer" object (see below), or raise ImportError or return None if it can't deal with the path item. By default, sys.path_hooks only contains the zipimporter type, if the zipimport module is available.

sys.path_importer_cache is a dict that caches the results of sys.path_hooks to avoid repeated hook lookups.

sys.meta_path is a list of importer objects that are invoked before the builtin import mechanism kicks in. This allows overriding of builtin module and frozen module import, but the main feature is that it allows importer objects without a corresponding sys.path item (just like builtin and frozen modules).

Importer objects must conform to the following protocol:

i.find_module(fullname) -> None or an importer object i.load_module(fullname) -> the imported module (or raise ImportError)

The 'fullname' is always the fully qualified module name, ie. a dotted name for a submodule.

This patch adds one more feature: a sys.path item may itself be an importer object. This is convenient for experimentation, but using it may break third-party code that assumes sys.path contains only strings.

msg41949 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2002-12-12 10:37

Logged In: YES user_id=92689

btw. the attached file contains a patch for various files as well as a new file: zipimporter.c. Place the latter in the Modules/ directory.

msg41950 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2002-12-12 11:15

Logged In: YES user_id=21627

Some comments:

msg41951 - (view)

Author: Paul Moore (paul.moore) * (Python committer)

Date: 2002-12-12 11:23

Logged In: YES user_id=113328

I can provide details on how to patch the Windows build process - it's not hard. The question of whether to have zipimport as builtin or dynamic should be resolved first (the processes are different in the 2 cases).

It should be pointed out that if zipimport is builtin, it still relies on a dynamic module (zlib) for importing from compressed zips. I don't know whether this affects the decision, though...

msg41952 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2002-12-12 12:35

Logged In: YES user_id=92689

Martin:

Thank you very much for the quick reply!

msg41953 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2002-12-12 12:39

Logged In: YES user_id=92689

Hm, regarding gc again: zipimporter objects can only theoretically be involved in cycles, and only if people muck with the "files" attribute. As it is, the "files" dict only contains strings (keys) and tuples (values) which contain strings and ints. So is it really neccesary?

msg41954 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2002-12-12 13:16

Logged In: YES user_id=21627

recursive imports: having foo.zip as the first thing in sys.path, and having a compressed zlib.py in there was indeed the case I was thinking of (without actually trying).

GC: It is absolutely necessary. If this is not done now, somebody will spend days of research five years from now, because somebody else thought that invoking .update on this files attribute was a good idea. This could be reduced to C- level documentation if the dictionary was not exposed to Python.

builtin: I think it ought to be builtin. It's a small module, it does not rely on additional libraries, it is not maintained externally, and it reduces bootstrap dependencies to have it builtin. OTOH, zlib can't be builtin, as it relies on an additional library which may not always be present.

get_long: you are right; the 0xff does make the values positive, again. I somehow thought the result might still be negative.

msg41955 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2002-12-12 14:05

Logged In: YES user_id=92689

I've attached test_zipimport.py, a simple test script, to be placed in Lib/test/.

msg41956 - (view)

Author: Neal Norwitz (nnorwitz) * (Python committer)

Date: 2002-12-12 14:28

Logged In: YES user_id=33168

Just, in case you didn't know, you can do a cvs diff -N to include new/removed files in a patch. I think you need to do a cvs add/remove before -N works though.

msg41957 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2002-12-12 16:34

Logged In: YES user_id=92689

I didn't know, thanks! I actually did find -N but it "didn't work" ;-)

I've attached a new patch (slightly updated as well) as one file. Also included is Lib/test/output/test_zipimport.

msg41958 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2002-12-12 19:34

Logged In: YES user_id=92689

I've uploaded a fresh version, with GC support added. Tested to the extent that it doesn't crash ;-)

msg41959 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2002-12-12 21:59

Logged In: YES user_id=92689

I've uploaded a slightly updated version: the potential zlib recursion has been fixed (I was in fact able to trigger it); I've added a test for it, although it's tricky to do. Also cleaned up a few exception messages in zipimport.

msg41960 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2002-12-12 23:19

Logged In: YES user_id=92689

Yet another new version:

msg41961 - (view)

Author: Guido van Rossum (gvanrossum) * (Python committer)

Date: 2002-12-13 17:52

Logged In: YES user_id=6380

Thanks -- I'm reviewing this now (if not too many distracting email arrives :-).

msg41962 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2002-12-13 19:29

Logged In: YES user_id=92689

I've attached a new version that fixes a problem that Thomas Heller noticed: package imports failed when the zip archive didn't contain a separate entry for the package directory.

msg41963 - (view)

Author: James C. Ahlstrom (ahlstromjc)

Date: 2002-12-13 19:55

Logged In: YES user_id=64929

The PEP 273 addition of the standard zip archive name /usr/local/lib/python23.zip (or Windows name) to sys.path to support import of the Python standard lib (including site.py) from a zip archive is missing. That was the point of my patch to site.py, getpath.c, getpathp.c, main.c, pythonrun.c, and sysmodule.c. I believe those patches can be used with this import.c if you want the feature.

msg41964 - (view)

Author: Guido van Rossum (gvanrossum) * (Python committer)

Date: 2002-12-13 20:02

Logged In: YES user_id=6380

Just: it looks like your latest version requires that the entries for package directories are present. Is that really a good idea?

msg41965 - (view)

Author: Paul Moore (paul.moore) * (Python committer)

Date: 2002-12-13 21:20

Logged In: YES user_id=113328

I'd say it's pretty much guaranteed not to be a good idea. I am fairly certain that at least some Windows zip utilities fail to have a separate zip entry for the directory itself.

msg41966 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2002-12-13 22:44

Logged In: YES user_id=92689

The previous version requires it and failed if they weren't, the present version still requires it, but missing entries are added in the read_directory() function (around line 542). So it's guaranteed they're there if needed.

I could have solved it differently at import time, but I thought it be better to do slightly more work at zipdirectory-read time so we have to do less work later. I doubt it matters much in reality, this was just the simplest way to solve the problem...

(I'll update the test to reflect the solution of this problem.)

msg41967 - (view)

Author: Paul Moore (paul.moore) * (Python committer)

Date: 2002-12-14 19:39

Logged In: YES user_id=113328

I have some concerns about a design issue.

I've been playing about writing my own hooks in Python, and I have found that for most of my tests, the find_module method is simply not useful. I understand that it can be cheaper (sometimes a lot cheaper) to just check for the existence of a module than actually loading up the code, etc, etc. So in those cases, find_module() offers a shortcut if the module isn't found.

But if the module is found, and find_module has to do significant work, then all this work has to be repeated at load_module time. There's no mechanism for the result of find_module (say, a pointer to the module in a zip directory, or whatever) to be passed to load_module. So load_module has no alternative than to redo the work (short of the hook implementing some form of internal cache mechanism, which could be quite messy to do).

I don't know what the likely uses of hooks will be. So I've no idea if find_module will turn out in practice to be a cheap operation. But I can certainly imagine cases (the hypthetical "load a module from a URL" hook) where find_module could be far from cheap.

Is it possible to let find_module be optional? If the hook implements it, let it be used as a quick check for existence. If the hook doesn't implement it, just go for it - call load_module and be prepared for a "can't find it" response.

This probably means that more extensive changes to import.c will be needed. But once we expose the design, we're going to be stuck with backward compatibility issues forever, so it's important we get it right now, rather than later. (Although having find_module mandatory initially, but made optional later, is probably less problematic than most other forms of design change...)

msg41968 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2002-12-14 20:17

Logged In: YES user_id=92689

Paul, this is the reason why the find_module() method returns an object (and I guess now is a good time to explain this ; it also answers a question from Guido he asked in private mail). zipimporter returns self, but you don't have to do that: you can return anything with a load_module() method. So all state you gather during find_module() you can simply pack in a different object and return that. Something like this

class MyHook: ... def find_module(self, fullname): info = self._find_module(fullname) if info is None: return None return Loader(info)

class Loader: def init(self, info): self.info = info def load_module(self, fullname): ...load module using self.info...

A good showcase for this pattern would be a wrapper for the builtin import mechanism: it could stuff the (file, filename, (suffix, mode, type)) tuple that imp.find_module() returns in a loader object, whose load_module() method would simply call imp.load_module() with that stuff. (I think such a wrapper is needed in import.c if we want to properly expose the new hooks in the imp module.)

You can see the importer protocol as a further abstraction (and OO-ification) of the current imp.find_module()/imp.load_module() calls.

msg41969 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2002-12-14 20:46

Logged In: YES user_id=92689

Btw. it would be nicer if I indeed provided an "importer" wrapper for the builtin mechanism and simply placed it on and invoke it through sys.meta_path. The reason I haven't done that yet is that A) this would change import.c even more (although it would be better factored then) and B) it would have some impact on the performance of the builtin import mechanism, as all imports would then pass through yet another Python call. But I hope to play with this a bit later next week. It could also be a project for 2.4.

Note that this is exactly what Gordon does in iu.py, except he goes even further: he has separate importer objects on the metapath for builtin modules, frozen modules and file system imports. I think that's fantastic, but it would be more controversial as it might affect performance and might be harder to integrate with the current imp module.

msg41970 - (view)

Author: Paul Moore (paul.moore) * (Python committer)

Date: 2002-12-14 21:52

Logged In: YES user_id=113328

Thanks for the explanation of find_module. That sounds very useful.

On a separate note, why do you expose sys.path_importer_cache? I can't imagine any reason why the user would ever touch it manually (other than for interactive experimentation).

msg41971 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2002-12-14 22:12

Logged In: YES user_id=92689

If you install a hook on sys.path_hooks and want it to affect sys.path items that may already have been used you'd need to clear sys.path_importer_cache. Say if you install a directory caching hook after site.py has been imported. Tricky, but possibly not uncommon. But at the very least the test script needs it .

I think it's on a similar level as sys.modules: you normally don't need it, except when you're mucking with custom import hooks... (I know, that's not an entirely fair comparison, but still.)

msg41972 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2002-12-15 14:28

Logged In: YES user_id=92689

Uploaded a new version. Many changes, but most are fairly small. Addressed a fairly long list of comments from Guido.

General:

zipimport.c:

msg41973 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2002-12-16 18:41

Logged In: YES user_id=92689

Uploaded new patch; no code changes, but a new test script: test_importhooks.py (uses unittest). Rewrote test_zipimport.py to also use unittest. test_importhooks.py contains some examples of trivial importer objects as well as an importer wrapper for imp.find_module/imp.load_module calls. Also contains some notes about the role and constraints of pkg.path.

msg41974 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2002-12-17 22:28

Logged In: YES user_id=92689

Fresh patch. Added zipimporter.get_data(path) method: returns the (uncompressed) data for a file as a string.

Implemented Wiktor Sadowski's suggestion: a module loaded with zipimport will have an importer attribute, which is set to the zipimporter instance that did the load. Possible quirk: for a package init.py file, this is the zipimporter that handles the directory containing the package dir, so importer.get_data(filename) looks for filename in that directory, not the package directory. An importer of an actual submodule does look in the package directory itself.

(Btw. I did this without adding a new PyImport_ExecCodeModuleEx2 API.)

msg41975 - (view)

Author: Paul Moore (paul.moore) * (Python committer)

Date: 2002-12-18 09:26

Logged In: YES user_id=113328

Be aware that this doesn't solve all of the issues (although it does solve a large class of them). Some applications require a real filename (wxPython did until recently for image data) and some need to stat the file (Guido quoted Zope).

At the very least it should be made clear that this is an optional part of the interface. Application code should always be prepared for an importer not to support get_data (if only by saying it can't work from a data store managed by this sort of hook).

My preference is for keeping the importer interface minimal and clean. At the moment 2 functions (find_module and get_module) are all that is required - and these can be in separate classes. Which class should have the get_data method? Which class gets assigned to importer? Is it the hook's responsibility to set importer (yet another responsibility)?

I think it's time for a PEP. Not for zipimporter, but purely for the import hook interface. That would be the place to argue such issues as I raised in the paragraph above. We need to remain clear on what the API requires vs what zipimporter does, to give writers of new hooks a decent chance.

msg41976 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2002-12-18 09:47

Logged In: YES user_id=92689

I totally agree that get_data() should be an optional part of the protocol. I consider it an experimental addition right now.

Quick thoughts: importer should be set to the object with the load_module() method. Yes, load_module() is responsible for setting it. It has to be, as it is load_module() that actually runs the code, and import must be available to the running code.

PEP: Yes. I started on it last night, doing some work on it as we speak.

msg41977 - (view)

Author: Guido van Rossum (gvanrossum) * (Python committer)

Date: 2002-12-18 13:31

Logged In: YES user_id=6380

OTOH I will certainly be able to work around the lacking stat()ability; but I need to access the data.

I'm occupied with urgent Zope Corp work this week, so I don't have time to review more of this code, alas. But I recommend that Just incorporate the changes that Jim Ahlstrom suggested (from Jim's patch) to allow the standard library to be a zip archive (adding a standard zip file on the path).

If after that you still have time and energy, I'd like the search for builtins and for frozen modules to be turned into meta-importers (or whatever you call them) that are placed on the meta-importers hook list. That would be a nice refactoring of import.c!

msg41978 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2002-12-18 14:54

Logged In: YES user_id=92689

Std zip entry on sys.path: agreed, I've been meaning to look at that for a while now...

Refactoring: would be nice (but needs a third item: PathImporter), but it's quite a bit of extra work and makes it harder (I think) to keep imp.find_module/load_module compatible. I will have a closer look, but I tend to think this is a better project for 2.4. I currently prefer working on a PEP, if only to increase the chance my work hasn't been a waste of time ;-)

Btw. zi.get_data() currently raises ZipImporterError if a file is not found. IOError is probably better. Will also consider list_dir() & list_modules(). (Hm, I really dislike those underscores there, but I also don't want to remove them from find/load_module...)

msg41979 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2002-12-19 21:29

Logged In: YES user_id=92689

I added the default sys.path zip file item for unix builds, as PEP 273 prescribes. I've not done the refactoring that the PEP 273 implementation does (I really don't see why), but this means I can't use Jim Ahlstrom's PC/getpathp.c patch as is. I don't develop on Windows, so I wouldn't be able to test. Paul, do you think you can provide a minimal patch for PC/getpathp.c that adds the right thing to the default path? No rush.

Other changes:

msg41980 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2002-12-23 22:31

Logged In: YES user_id=92689

Several changes/fixes.

general:

import.c:

zipimporter.c

msg41981 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2002-12-30 22:34

Logged In: YES user_id=92689

Checked in with several changes:

Commit details: Include/pythonrun.h, rev. 2.57 Lib/site.py, rev. 1.47 Lib/test/test_importhooks.py, rev. 1.1 Lib/test/test_zipimport.py, rev. 1.1 Modules/Setup.dist, rev. 1.35 Modules/getpath.c, rev. 1.45 Modules/zipimport.c, rev. 1.1 PC/config.c, rev. 1.37 PC/getpathp.c, rev. 1.29 PCbuild/pythoncore.dsp, rev. 1.40 Python/import.c, rev. 2.215 Python/importdl.h, rev. 2.19 Python/pythonrun.c, rev. 2.172