Fix version_info cache invalidation, typing, parsing, and serialization by EliahKagan · Pull Request #1838 · gitpython-developers/GitPython (original) (raw)

Fixes #1829 - cache invalidation
Fixes #1830 - typing
Fixes #1833 - parsing
Fixes #1836 - serialization

Much of the design in the fixes for #1830, #1833, and #1836, as well as why I think it makes sense to include them here with #1829, is already largely covered in the discussion/comments in those three issues.

Design goals for #1829

For #1829, which I would regard to be the biggest of the changes here, and which is the change for which the most additional code is added to the test suite, I made sure to preserve three important properties of the code:

Refreshing is global but affects all Git instances. Changing the way refreshing itself fundamentally works is beyond the scope of this PR and would also be a breaking change that I think could only be made in a new major release, and there are good reasons for refreshing to be global. Allowing a per-instance overriding refresh might be useful, but would be much more complicated than the current system unless accompanied by other design changes that would likely be breaking.
Caching is performed for version_info attributes. This has been explicitly promised in the version_info docstring for a long time, with the intention that accessing version_info be understood as cheap, and it is likely that substantial code exists that assumes it is cheap (or that throttles parts of GitPython that use it and benefit from it being cheap).
Caching does not take place across Git instances, and accessing version_info on one Git instance never has any effect on what happens when version_info is accessed on another Git instance. Since the actual validity of cached version information depends on when the global operation of refreshing was last done, it may seem like the caching should be made global as well. But this can lead to some new problems, such that I suggest it either be put off until later or never done.

Why not cache globally?

To expand on why the that third property is something I've taken care to preserve, these are the three problems I anticipate will arise if the caching were to be made global, in increasing order of significance:

Although GitPython is not in general thread-safe (#1052), caching globally may still make it less capable of concurrency than before, and the interaction with multiprocessing would also have to be considered.
That the cache is per-instance is a long-standing behavior, which existing code may rely on. It is not explicitly stated in the property docstring, and I've taken care not to start stating it, in case it is to be changed. But I think it is how readers would have understood the docstring, since it is documenting an instance property and does not describe the cache as global or shared.
Not sharing the cache ensures that the code Git().version_info always makes a subprocess call to get fresh information. Besides being one of the ways this behavior may have been relied on already, changing it would also, in practice, lead to code that uses GitPython being written to refresh more often, which would have the opposite effect on speed and resource usage than is intended by caching.

The cache invalidation implementation

Rather than having the refresh operation reach out to all Git instances to invalidate their caches, I've added a _refresh_token class attribute that is set and reset along with GIT_PYTHON_GIT_EXECUTABLE during a refresh, and added a _version_info_token instance attribute (that, like _version_info, is excluded from pickle serialization).

When version_info is accessed, if _version_info_token is the same object as _refresh_token, then the cache is valid and the cached value in _version_info is returned. Otherwise, the cache is invalid, and git version is run to obtain the version; the version information is cached in _version_info, versioninfotoken is updated to refreshtoken, and the newly computed value is returned.

The _version_info_token instance attribute starts out as None to simplify debugging and unpickling. Its value is always either None or a value that has at some point been the value of the _refresh_token class attribute. The _refresh_token class attribute is never None, but always a direct instance of object, even before the first refresh, which ensures it never wrongly matches the initial None value of a _version_info_token instance attribute.

A successful refresh always causes _refresh_token to differ from any _version_info_token that exists at that time. This is important--and, in particular, these tokens must not be tied to the value of Git.GIT_PYTHON_GIT_EXECUTABLE--because refreshing must properly update for a change to which git implementation is accessed by the same command or even via the same path. The most common and important such case is probably when GIT_PYTHON_GIT_EXECUTABLE is the default value of "git", and git has been upgraded or a new version installed in a $PATH directory.

Dropping the `LazyMixin` base class

An aspect of this PR that I recommend be double-checked in review is my assumption that Git having LazyMixin as a base class can be considered a mere implementation detail of Git, and thus that it is non-breaking to remove inheritance from that base class.

I removed it because it was only being used to implement caching for the version_info property. It is well-suited to some forms of caching elsewhere in GitPython, but not well-suited to the changes needed in version_info for #1829 and #1836 (as discussed in #1836). Because I stopped using LazyMixin for this, it was no longer used for anything. Removing it simplified the code, though if necessary it could be brought back and just not used for anything. (The new version_info implementation would still work with LazyMixin present but unused.)

Other considerations

Keeping the `super().init` call

Git.__init__ contains this line:

This superficially appears related to the use of LazyMixin, but I believe that is not the case. LazyMixin, which is provided by the gitdb dependency, at least currently does not define an __init__ method. Furthermore, such a super() call was present even before LazyMixin was brought in as a base class in e583a9e to provide caching for the version_info property (which was also introduced at that time). At that time, Git had no base class besides object, yet this was still present.

Calling super().__init__() helps with cooperative multiple inheritance in some cases. It is sometimes done even in the absence of a specific need, though sometimes it's not the correct logic, if the next class in the MRO has an __init__ that is not callable with no arguments (or, rather, with only a self argument). Furthermore, you've given feedback that I have understood to mean that having Git support inheritance is not a design priority.

So I don't think this super().__init__() is present based on a general preference for it. Instead, looking back to when it was first introduced, this was in the first commit that had the Git class, 039bfe3. At that time, Git did have a base class, MethodMissingMixin.

It may be that it should either be removed or commented with an explanation. But I think it is sufficiently unconnected to LazyMixin (and version_info) that no change should be made to it in connection with the changes in this PR.

Effect of directly accessing `_version_info` has changed

When LazyMixin was used, reading from the nonpublic _version_info attribute behaved the same as reading from version_info. Code outside the Git class should not do that, but there is one place in GitPython's own tests where it did:

if rw_repo.git._version_info < (2, 29):

That had to be changed to use version_info instead. If necessary, a different backing attribute could be used, and _version_info could be made an undocumented alternative to version_info that issues a warning but otherwise behaves the same. However, it seems to me that this is not necessary; _version_info was never advertised as part of the public interface of Git, as far as I can tell.

`GIT_PYTHON_GIT_EXECUTABLE` as an instance attribute is not supported

If Git instances could have their own GIT_PYTHON_GIT_EXECUTABLE attributes, then they would be used instead of the same-named class attribute that is written in a refresh. This cannot happen on a direct Git instance, because Git is slotted and has no such slot (and cannot, since the descriptor would conflict with the class attribute). But it is possible to make a subclass of Git where GIT_PYTHON_GIT_EXECUTABLE is an instance attribute. This has never been supported and should not be done, and will likely work poorly in other ways, but it is possible for caching to "support" it by detecting that this is the case and just never caching version_info.

I have neither done this nor commented about it, and even this mention of it may be gratuitous. However, in general the effect of attempting to set what is assumed to be a class attribute on an instance should be considered. My reasoning for not dealing with this at all is that, even if a reasonable use case arises to derive from Git and shadow base-class class attributes as instance attributes (which seems unlikely), the onus is then on the author of the derived class to override base-class members like version_info accordingly.

Fix version_info cache invalidation, typing, parsing, and serialization by EliahKagan · Pull Request #1838 · gitpython-developers/GitPython (original) (raw)

Design goals for #1829

Why not cache globally?

The cache invalidation implementation

Dropping the LazyMixin base class

Other considerations

Keeping the super().__init__ call

Effect of directly accessing _version_info has changed

GIT_PYTHON_GIT_EXECUTABLE as an instance attribute is not supported

Dropping the `LazyMixin` base class

Keeping the `super().init` call

Effect of directly accessing `_version_info` has changed

`GIT_PYTHON_GIT_EXECUTABLE` as an instance attribute is not supported