Reorganize test_util and make xfail marks precise by EliahKagan · Pull Request #1729 · gitpython-developers/GitPython (original) (raw)

(I've posted in the review thread above about the narrower and separate issue of why I use paths like tmppath / "testdir" instead of just tmppath.)

I'm curious if there are any areas where the test suite is written in a way that prevents parallelism that would otherwise be feasible and desirable. If so, I would be interested in trying to remedy that.

Some of the use of temporary files is done with insecure deprecated functions, which would best be replaced, as noted in gitpython-developers/smmap#41. That issue is in smmap, but GitPython and gitdb also contain such code. Although these are subject to a race condition, it is mostly an issue between the process that uses the insecure function and other processes running other code, rather than in the affected code performing multithreading or multiprocessing. The same temporary paths are not automatically reused, except by strange chance or deliberate arrangement; new pseudorandom names for temporary files and directories are still generated. The biggest reason it can be a problem is security-related.

The facility that was previously used in the specific test code changed in this PR was TemporaryDirectory from the Python standard library. Like pytest's tmp_path, this is free of that security weakness, and like all the techniques for getting usable temporary file and directory paths discussed here, it gives a new path and can be used in parallel (including, but not limited to, across threads of the same process).

I'm aware of two factors that limit GitPython itself in multithreaded use. The first is that it changes directory temporarily to do some operations, as reported in #1052. The second is that the fix I made for #1635 (in #1636, and improved in #1650 to fix #1646) creates a race condition on os.environ, as pointed out in #1650 (comment) (see also #1650 (comment)). So that the tests also change directory and patch os.environ doesn't seem like a problem.

The test suite does also do various monkey-patching that would cause problems if tests were run using multithreading, and this goes beyond any problem that I am aware of already existing in the code under test. It is possible to attempt to run tests using multithreading, and I think there are plugins for doing that, though I'm not familiar with the details of any of them. What I think is more common, and more useful for Python tests, is to run them using multiprocessing, such as with pytest-xdist. This retains isolation across tests, so it should work fine even when tests change directory, mutate os.environ, and monkey-patch GitPython and standard library facilities. Multiprocessing is also typically faster than multithreading in Python, except for code that is IO-bound or that delegates CPU-bound operations to native-code extension modules that release the GIL, though this situation will eventually improve.

If the same temporary directory/files are reused in a way that can clash, or if code is taking what amounts to a lock on rorepo's shared backing repository (currently the GitPython repository), etc., then this would get in the way of multiprocessing just as much as multithreading, thereby preventing test parallelism from being achieved through multiprocessing using pytest-xdist or other such tools. This is the sort of thing I'd be especially interested in becoming aware of and improving, though if there are reasons to specifically want multi_threaded_ test runners to work, then I am of course interested.