Incremental compilation relies on hashes for soundness (original) (raw)

Incremental compilation relies on hashes for soundness -- specifically, on SipHash-1-3 with an all-zero key, which is not a cryptographic hash function. Collisions in that hash can lead to UB and other ill effects. Is that a problem? Should we do things to mitigate the risk of that? Or do we tell people they are expected to use non-incremental builds if they actually want to rely on the soundness of the result? Or is this not a bug at all and the current hashing scheme is "good enough"? After all, as @the8472 argues, if the programmer doesn't actively try to exploit the hard-coded all-zero key they are quite unlikely to hit a collision by pure chance. (That should follow from the fact that SipHash is a PRF, but I am no cryptographer.)

Fixing this by using a cryptographic hash is likely to be very bad for performance. However, @michaelwoerister mentions that there could be cheaper mitigation techniques:

Unless I'm overlooking something all incr. comp. specific hashes only need to be stable between successive incremental builds. So we can easily harden our use of SipHash there by generating random keys and then caching these in the incr. comp. cache.

If we use a truly random key for this, we could rely on SipHash being a cryptographic PRF -- without knowing the key, it's supposed to be very hard to find collisions. However, that promise is typically made for SipHash-2-4, not the weakened variant SipHash-1-3 that rustc uses. There seems to be evidence that even the weaker function is "good", but I'll leave it to cryptographic experts to evaluate the evidence here. @briansmith would be good to get your take on this.

Nominating for t-lang discussion to see what their stance is on the soundness requirements for incremental compilation, and whether relying on a non-collision-resistant hash function for soundness is "good enough" -- or whether this should be considered an implementation decision, to be made by t-compiler. (Previously, t-lang ruled that TypeId should use a "full (non-truncated) cryptographic hash", but the tradeoffs are quite different here so it's not at all clear that the same decision would apply to incremental hashes.)

EDIT: I have moved the nomination to #129030 as that broader question directly affects this more specific question.