Initial Migration of Fuzz Tests & Integration Scripts From the OSS-Fuzz Project Repo by DaveLak · Pull Request #1901 · gitpython-developers/GitPython (original) (raw)

Ok, sorry for the delay. Below are my thoughts regarding fuzz_tree.py, but also, a little extra context on the OSS-Fuzz execution environment because it may be helpful for thinking about #1901 (comment).

About the OSS-Fuzz Environment

You both touched on it, but for the sake of shared understanding, the scripts and tests look like they do in part because of the OSS-Fuzz container environment implementation details.

The key points to understand are:

Build Stage Execution Environment

As per this page of the docs:

build.sh script environment
When your build.sh script is executed, the following locations are available within the image:

Location Env Variable Description
/out/ $OUT Directory to store build artifacts (fuzz targets, dictionaries, options files, seed corpus archives).
/src/ $SRC Directory to checkout source files.
/work/ $WORK Directory to store intermediate files.

Everything else is read-only.

Fuzzer Execution Stage Environment

As per this page of the docs:

File system
Everything except /tmp is read-only, including the directory that your fuzz target executable lives in.
/dev is also unavailable.

Regarding the Fuzz Tests

When I initially started on this, my primary objective was really only about fixing the build. I did try to clean them up a bit in the process and apply some best practices, but those changes were mostly incidental and I intentionally avoided any significant refactors.

After we decided to bring them here, my focus shifted to lowering the barrier to entry for contributors because of the nature of the project. In particular, I tried to distill the useful information from various place that I haven't seen documented in one place yet. One such tip is the direct execution section, but in hindsight I overlooked the environment assumptions in fuzz_tree.py.

On fuzz_tree.py Specifically

This test is less than ideal for several reasons.

  1. Writing to the /tmp dir

In addition to what has been mentioned, modifying global state and I/O operations (e.g., writing to the temp dir) are problematic for fuzzing in general.

Further, @EliahKagan's remark in #1901 (comment) about "implicating other things in /tmp" made me realize this may already be causing issues inside the OSS-Fuzz execution environment. A non-obvious implementation detail specific to Python projects in OSS-Fuzz that the compile_python_fuzzer function in the build.sh script uses Pyinstaller to bundle the fuzz harness and all related dependencies into a single file archive that gets transferred to the fuzzer execution stage, where it is unpacked into a subdirectory of /tmp for execution. Which, I believe means, the test execution is actually using the test executable as a fixture 🥴

I have yet to confirm this, but I wouldn't be surprised if that is the cause of the issue I described as a possible " circular dependency " in google/oss-fuzz#11763

  1. It's very slow

With an average execs per second of ~4 after the seed corpus loads on a short local run, it is far, far, below what is hoped for. In contrast, I'm seeing about 1432 exec/s on fuzz_config.py in a similar local run. fuzz_config.py creates a config fake in-memory but never writes it to disk.

I haven't dug in yet, but I think this is partly related to point 1 above and partly related to some of the implementation details of the legacy code in GitPython.

This is less of a concern/priority than the /tmp dir issue though because even if it's slow, it's still at least testing the API.

Thoughts on Next Steps

I'm fully in agreement that the /tmp issue and documentation related to it must be fixed before other updates are considered. I think it can be broken down into smaller problems in the following prioritized order:

  1. Address adverse potential that the local execution documentation exposes unaware contributors to.

Possible quick fix options:

  1. Address the /tmp writing implementation issues in fuzz_tree.py

I think the most robust solution would be some combination of both points above, but I don't know enough to comment on the potential issues that we may face goinf the TemporaryDirectory route inside OSS-Fuzz. The Dockerfile solution on the other hand is something I've already been toying with locally.

I also looked at options like MemoryFS from https://github.com/PyFilesystem/pyfilesystem2 because an in-memory solution would be ideal, but I think the git executable invocation may cause issues with that approach.

What do you both think?