Use length prefix in default Hasher::write_str
by purplesyringa · Pull Request #134134 · rust-lang/rust (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Using a 0xFF trailer is only correct for bytewise hashes. A generic
Hasher
is not necessarily bytewise, so use a length prefix in the
default implementation instead.
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @saethlin (or someone else) some time within the next two weeks.
Please see the contribution instructions for more information. Namely, in order to ensure the minimum review times lag, PR authors and assigned reviewers should ensure that the review label (S-waiting-on-review
and S-waiting-on-author
) stays updated, invoking these commands when appropriate:
@rustbot author
: the review is finished, PR author should check the comments and take action accordingly@rustbot review
: the author is ready for a review, this PR will be queued again in the reviewer's queue
This comment has been minimized.
bors added a commit to rust-lang-ci/rust that referenced this pull request
☀️ Try build successful - checks-actions
Build commit: 3eb8d2d (3eb8d2dcada8271314d3de3ee7d4705585eb041d
)
This comment has been minimized.
Note that rustc-hash optionally overrides both write_str and write_prefix_free. The comments indicate this is intended to be enabled in rustc but at a glance I don't see it being activated in the compiler Cargo.toml files. But even if it isn't, rustc-hash mixes in the length of slices passed to write
, and hashing one u8 does the same work as hashing one usize, so I wouldn't expect any difference in raw hashing performance or in hash collisions.
I'm having trouble estimating the quality of len
mixing in rustc-hash, but I see your point. My fault for assuming rustc uses the hashbrown's default hash function. Does anyone know a better way to test the performance impact?
I don't know of any hasher where one write_u8
vs one write_usize
should make a difference (the perf run will validate this for rustc-hash, assuming the nightly
feature really isn't activated). Hashbrown and hasher crates have their own benchmarks, maybe those are useful for testing that for more hashers.
The other experiment that would be useful is checking how easy or hard it is to trigger this theoretical problem. That is, pick some reasonably popular hasher that doesn't mix in the length and try to find a set of keys that'll lead to full hash collisions with the current write_str
implementation. I believe that the classic example of padding with 0 bytes to the next usize-aligned boundary isn't a problem for foldhash and other hashers that handle odd-length strings by doing multiple overlapping reads, because a longer string will cause some of those reads to get different non-zero bytes.
Finished benchmarking commit (3eb8d2d): comparison URL.
Overall result: ❌✅ regressions and improvements - no action needed
Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.
@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression
Instruction count
This is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.
mean | range | count | |
---|---|---|---|
Regressions ❌ (primary) | 0.7% | [0.7%, 0.7%] | 1 |
Regressions ❌ (secondary) | - | - | 0 |
Improvements ✅ (primary) | - | - | 0 |
Improvements ✅ (secondary) | -0.4% | [-0.4%, -0.4%] | 1 |
All ❌✅ (primary) | 0.7% | [0.7%, 0.7%] | 1 |
Max RSS (memory usage)
Results (primary -2.0%, secondary -2.4%)
This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
mean | range | count | |
---|---|---|---|
Regressions ❌ (primary) | 7.5% | [7.5%, 7.5%] | 1 |
Regressions ❌ (secondary) | - | - | 0 |
Improvements ✅ (primary) | -6.7% | [-9.8%, -3.6%] | 2 |
Improvements ✅ (secondary) | -2.4% | [-2.4%, -2.4%] | 1 |
All ❌✅ (primary) | -2.0% | [-9.8%, 7.5%] | 3 |
Cycles
Results (primary -2.3%, secondary 2.3%)
This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
mean | range | count | |
---|---|---|---|
Regressions ❌ (primary) | - | - | 0 |
Regressions ❌ (secondary) | 2.3% | [2.3%, 2.3%] | 1 |
Improvements ✅ (primary) | -2.3% | [-2.3%, -2.3%] | 1 |
Improvements ✅ (secondary) | - | - | 0 |
All ❌✅ (primary) | -2.3% | [-2.3%, -2.3%] | 1 |
Binary size
Results (primary 0.0%, secondary 0.0%)
This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
mean | range | count | |
---|---|---|---|
Regressions ❌ (primary) | 0.2% | [0.0%, 0.5%] | 20 |
Regressions ❌ (secondary) | 0.0% | [0.0%, 0.0%] | 4 |
Improvements ✅ (primary) | -0.2% | [-0.7%, -0.0%] | 11 |
Improvements ✅ (secondary) | - | - | 0 |
All ❌✅ (primary) | 0.0% | [-0.7%, 0.5%] | 31 |
Bootstrap: 768.372s -> 769.733s (0.18%)
Artifact size: 331.03 MiB -> 331.05 MiB (0.00%)
IMO, the main problem here is that this issue introduces seed-independent collisions. This affects, say, DoS resistance.
To give another example, suppose I'd like to find a collision-free hash function for a given set of 1M elements. Typically, I can keep trying random seeds until I get one without collisions. But at the moment, this is not the case for the affected hashers.
In theory, this is a strong argument. In practice it's not clear to me if there's any existing hasher that only has seed independent collisions today because of write_str
. Even if there is one, any set of collisions based solely on padding with zeros until to the next multiple of 4 or 8 bytes would be a very small set, so the significance is unclear.
There are a bunch of hashers that have seed-independent collisions among strings of the same length, and many hashers already mix the length into write
calls. Foldhash and wyhash are the only ones I'm aware of that doesn't seem to fall into either category, but their behavior is more subtle than "just pad with zero bytes until the next convenient multiple". I expect one could find a small set of seed independent collisions in either (with the current Hasher::write_str
default) by doing something smarter than appending zeros to a string, but it's not obvious to me how you'd turn that into set of, say, 1 million strings. Unless you break the hash so thoroughly that changing Hasher::write_str
wouldn't save it either.
In practice it's not clear to me if there's any existing hasher that only has seed independent collisions today because of write_str.
wyhash
and rapidhash
both try to resist seed-independent collisions. foldhash
uses a similar schema that, as far as I'm aware, is not subject to any realistic attacks. Although shash
explicitly advises against use in scenarios where DoS matters, it makes a solid attempt at avoiding this pitfall, too.
Generally speaking, all modern quality hash functions try to resist DoS as well as they can, as SMHasher tests this behavior, so this is hardly surprising.
Of course, many cryptographic or almost-cryptographic hash functions have Rust implementations that are only safe because they're bytewise. The moment anyone decides to optimize them in a seemingly innocuous way, the jig is up, and DoS resistance is broken.
Foldhash and wyhash are the only ones I'm aware of that doesn't seem to fall into either category, but their behavior is more subtle than "just pad with zero bytes until the next convenient multiple".
For the core loops (pre-finalization):
foldhash
:0123456789abcde
collides with01234567789abcde
wyhash
:a
collides witha\0
rapidhash
:0123456aaaabbbb
collides with0123456aaaaabbbb
.
These collisions don't typically matter, as the finalization mixes in the total length. But this doesn't protect against multiple writes whose lengths sum up to the same value. That is only broken because of write_str
.
There are a bunch of hashers that have seed-independent collisions among strings of the same length, and many hashers already mix the length into write calls.
I'd like to make a somewhat tangential counterargument.
No existing hashes are tuned to Rust's needs in particular, leading to suboptimal performance when using anything other than strings as keys. This is very much a barren field: I can think of multiple optimizations tuned for the streaming design of Hasher
, and as far as I'm aware, no crate in existence tries to perform them.
Any attempts to solve this (like what I'm currently working on) will inevitably raise the question: what does the Hasher
guarantee? How do we know if the tweaks don't make DoS easy, if the std implementation disagrees with the docs? I do believe these optimizations can be practical, and IMO, formalizing Hasher
requirements and changing the write_str
behavior is one step towards this goal.
An alternative sound formalization exists.
We can specify that write
calls don't have to be prefix-free, and explicitly request the Hasher
implementation to mix in the length. In this case, emitting 0xFF in write_str
is redundant, and write_str
can be optimized to just write(s.as_bytes())
. Similarly, hashing a slice of integers won't need to emit the length either.
Of course, this will break FNV and all other popular existing Hasher
s, so we probably don't want that. But I'd argue that favoring the current inconsistent semantics isn't right either.
As the author of foldhash, my 2 cents is that I think it's always been a mistake to automatically mix in anything at all. This applies to both write_str
, as well as when hashing a complete [u8]
object (which doesn't actually have a dedicated function and must be emulated with write_length_prefix
+ write
). The hash author most likely knows best how to most efficiently prevent length-extension and concatenation attacks, but the interface fundamentally prevents them from doing so. This fact is why we intentionally just completely ignore write_length_prefix in rustc-hash...
I would also like to call attention to the fact that write_str
is still unstable, and that means foldhash is slower than necessary for strings (as it is written in stable Rust and I would like to not have different outputs based on feature flags). And it is always slower than necessary for hashing [T]
slices, as they have a separate write_length_prefix
call that can't be avoided.
Thinking aloud:
I agree that write_length_prefix
is an odd API to have, as the hasher cannot, in the general case, reliably interpret it in any way other than write_usize
. Indeed, Vec<Vec<Vec<...>>>
emits several write_length_prefix
calls in a row, so you can't, say, just XOR the length in.
I believe that the only possibility in which this extra avalanche step can be avoided is when the length and the array contents are available at once.
As such, it seems reasonable to remove write_length_prefix
and switch all the users back to write_usize
, but introduce a new method called write_bytes
without the caller-side prefix-free requirement (similarly to write_str
). The default implementation would be
fn write_bytes(&mut self, bytes: &[u8]) { self.write_length_prefix(bytes.len()); self.write(bytes); }
The [{integer}]
serialization methods would replace the current write_length_prefix + write
with a single invocation of write_bytes
. Hasher
implementors would then be able to override write_bytes
and write_str
to get the behavior @orlp describes, while write
could ignore the length.
The Hash
docs would specify that if two objects a != b
are hashed, the first differing calls into Hasher
must be to the same method, with an extra requirement of prefix-freeness if the method is write
. This would trivially hold for Hash
es that only write
fixed-length arrays. Variable-length arrays would be advised to be hashed with write_bytes
.
This change would be backwards-compatible (nightly excluded, but deprecating write_length_prefix
for a single release would work too), as the default write_bytes
implementation is correct for all methods, and directly implementing write_str
via write_bytes
is correct too.
Does this sound reasonable or am I missing something?
As the author of foldhash, my 2 cents is that I think it's always been a mistake to automatically mix in anything at all. This applies to both
write_str
, as well as when hashing a complete[u8]
object (which doesn't actually have a dedicated function and must be emulated withwrite_length_prefix
+write
). The hash author most likely knows best how to most efficiently prevent length-extension and concatenation attacks, but the interface fundamentally prevents them from doing so. This fact is why we intentionally just completely ignore write_length_prefix in rustc-hash...
Ignoring write_length_prefix
is incorrect because not all slices will end up calling write
when hashed: a slice of NewType(u64)
where Hash for NewType
calls write_u64
will end up with this sequence of Hasher
calls:
write_length_prefix(slice.len())
write_u64(slice[0])
write_u64(slice[1])
- etc
If write_length_prefix
is ignored then this will be vulnerable to a concatenation attack.
IMO write
shouldn't blindly mix the length of the slice into the hash: the purpose of write_length_prefix
is to inform the hasher that the length of a slice is actually relevant. However @purplesyringa is correct in that the API as it stands forces the hasher to implement it as write_usize
.
If write_length_prefix
is not suitable then we should have a better API which can inform the hasher of the length of a sequence of items.
I never meant to suggest that write_length_prefix
should be removed entirely. It is 100% necessary for general user-defined collections. I simply wish it wasn't forcibly invoked for [T]
slices of built-in types.
Labels
Status: This is awaiting some action (such as code changes or more information) from the author.