Fix version_info cache invalidation, typing, parsing, and serialization by EliahKagan · Pull Request #1838 · gitpython-developers/GitPython (original) (raw)
Fixes #1829 - cache invalidation
Fixes #1830 - typing
Fixes #1833 - parsing
Fixes #1836 - serialization
Much of the design in the fixes for #1830, #1833, and #1836, as well as why I think it makes sense to include them here with #1829, is already largely covered in the discussion/comments in those three issues.
Design goals for #1829
For #1829, which I would regard to be the biggest of the changes here, and which is the change for which the most additional code is added to the test suite, I made sure to preserve three important properties of the code:
- Refreshing is global but affects all
Git
instances. Changing the way refreshing itself fundamentally works is beyond the scope of this PR and would also be a breaking change that I think could only be made in a new major release, and there are good reasons for refreshing to be global. Allowing a per-instance overriding refresh might be useful, but would be much more complicated than the current system unless accompanied by other design changes that would likely be breaking. - Caching is performed for
version_info
attributes. This has been explicitly promised in theversion_info
docstring for a long time, with the intention that accessingversion_info
be understood as cheap, and it is likely that substantial code exists that assumes it is cheap (or that throttles parts of GitPython that use it and benefit from it being cheap). - Caching does not take place across
Git
instances, and accessingversion_info
on oneGit
instance never has any effect on what happens whenversion_info
is accessed on anotherGit
instance. Since the actual validity of cached version information depends on when the global operation of refreshing was last done, it may seem like the caching should be made global as well. But this can lead to some new problems, such that I suggest it either be put off until later or never done.
Why not cache globally?
To expand on why the that third property is something I've taken care to preserve, these are the three problems I anticipate will arise if the caching were to be made global, in increasing order of significance:
- Although GitPython is not in general thread-safe (#1052), caching globally may still make it less capable of concurrency than before, and the interaction with multiprocessing would also have to be considered.
- That the cache is per-instance is a long-standing behavior, which existing code may rely on. It is not explicitly stated in the property docstring, and I've taken care not to start stating it, in case it is to be changed. But I think it is how readers would have understood the docstring, since it is documenting an instance property and does not describe the cache as global or shared.
- Not sharing the cache ensures that the code
Git().version_info
always makes a subprocess call to get fresh information. Besides being one of the ways this behavior may have been relied on already, changing it would also, in practice, lead to code that uses GitPython being written to refresh more often, which would have the opposite effect on speed and resource usage than is intended by caching.
The cache invalidation implementation
Rather than having the refresh operation reach out to all Git
instances to invalidate their caches, I've added a _refresh_token
class attribute that is set and reset along with GIT_PYTHON_GIT_EXECUTABLE
during a refresh, and added a _version_info_token
instance attribute (that, like _version_info
, is excluded from pickle serialization).
When version_info
is accessed, if _version_info_token
is the same object as _refresh_token
, then the cache is valid and the cached value in _version_info
is returned. Otherwise, the cache is invalid, and git version
is run to obtain the version; the version information is cached in _version_info
, versioninfotoken
is updated to refreshtoken
, and the newly computed value is returned.
The _version_info_token
instance attribute starts out as None
to simplify debugging and unpickling. Its value is always either None
or a value that has at some point been the value of the _refresh_token
class attribute. The _refresh_token
class attribute is never None
, but always a direct instance of object
, even before the first refresh, which ensures it never wrongly matches the initial None
value of a _version_info_token
instance attribute.
A successful refresh always causes _refresh_token
to differ from any _version_info_token
that exists at that time. This is important--and, in particular, these tokens must not be tied to the value of Git.GIT_PYTHON_GIT_EXECUTABLE
--because refreshing must properly update for a change to which git implementation is accessed by the same command or even via the same path. The most common and important such case is probably when GIT_PYTHON_GIT_EXECUTABLE
is the default value of "git"
, and git has been upgraded or a new version installed in a $PATH
directory.
Dropping the LazyMixin
base class
An aspect of this PR that I recommend be double-checked in review is my assumption that Git
having LazyMixin
as a base class can be considered a mere implementation detail of Git
, and thus that it is non-breaking to remove inheritance from that base class.
I removed it because it was only being used to implement caching for the version_info
property. It is well-suited to some forms of caching elsewhere in GitPython, but not well-suited to the changes needed in version_info
for #1829 and #1836 (as discussed in #1836). Because I stopped using LazyMixin
for this, it was no longer used for anything. Removing it simplified the code, though if necessary it could be brought back and just not used for anything. (The new version_info
implementation would still work with LazyMixin
present but unused.)
Other considerations
Keeping the super().__init__
call
Git.__init__
contains this line:
This superficially appears related to the use of LazyMixin
, but I believe that is not the case. LazyMixin
, which is provided by the gitdb dependency, at least currently does not define an __init__
method. Furthermore, such a super()
call was present even before LazyMixin
was brought in as a base class in e583a9e to provide caching for the version_info
property (which was also introduced at that time). At that time, Git
had no base class besides object
, yet this was still present.
Calling super().__init__()
helps with cooperative multiple inheritance in some cases. It is sometimes done even in the absence of a specific need, though sometimes it's not the correct logic, if the next class in the MRO has an __init__
that is not callable with no arguments (or, rather, with only a self
argument). Furthermore, you've given feedback that I have understood to mean that having Git
support inheritance is not a design priority.
So I don't think this super().__init__()
is present based on a general preference for it. Instead, looking back to when it was first introduced, this was in the first commit that had the Git
class, 039bfe3. At that time, Git
did have a base class, MethodMissingMixin.
It may be that it should either be removed or commented with an explanation. But I think it is sufficiently unconnected to LazyMixin
(and version_info
) that no change should be made to it in connection with the changes in this PR.
Effect of directly accessing _version_info
has changed
When LazyMixin
was used, reading from the nonpublic _version_info
attribute behaved the same as reading from version_info
. Code outside the Git
class should not do that, but there is one place in GitPython's own tests where it did:
if rw_repo.git._version_info < (2, 29): |
---|
That had to be changed to use version_info
instead. If necessary, a different backing attribute could be used, and _version_info
could be made an undocumented alternative to version_info
that issues a warning but otherwise behaves the same. However, it seems to me that this is not necessary; _version_info
was never advertised as part of the public interface of Git
, as far as I can tell.
GIT_PYTHON_GIT_EXECUTABLE
as an instance attribute is not supported
If Git
instances could have their own GIT_PYTHON_GIT_EXECUTABLE
attributes, then they would be used instead of the same-named class attribute that is written in a refresh. This cannot happen on a direct Git
instance, because Git
is slotted and has no such slot (and cannot, since the descriptor would conflict with the class attribute). But it is possible to make a subclass of Git
where GIT_PYTHON_GIT_EXECUTABLE
is an instance attribute. This has never been supported and should not be done, and will likely work poorly in other ways, but it is possible for caching to "support" it by detecting that this is the case and just never caching version_info
.
I have neither done this nor commented about it, and even this mention of it may be gratuitous. However, in general the effect of attempting to set what is assumed to be a class attribute on an instance should be considered. My reasoning for not dealing with this at all is that, even if a reasonable use case arises to derive from Git
and shadow base-class class attributes as instance attributes (which seems unlikely), the onus is then on the author of the derived class to override base-class members like version_info
accordingly.