Use length prefix in default Hasher::write_str by purplesyringa · Pull Request #134134 · rust-lang/rust (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

purplesyringa

@purplesyringa

Using a 0xFF trailer is only correct for bytewise hashes. A generic Hasher is not necessarily bytewise, so use a length prefix in the default implementation instead.

@rustbot

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @saethlin (or someone else) some time within the next two weeks.

Please see the contribution instructions for more information. Namely, in order to ensure the minimum review times lag, PR authors and assigned reviewers should ensure that the review label (S-waiting-on-review and S-waiting-on-author) stays updated, invoking these commands when appropriate:

@saethlin

@rust-timer

This comment has been minimized.

bors added a commit to rust-lang-ci/rust that referenced this pull request

Dec 10, 2024

@bors

@bors

@bors

☀️ Try build successful - checks-actions
Build commit: 3eb8d2d (3eb8d2dcada8271314d3de3ee7d4705585eb041d)

@rust-timer

This comment has been minimized.

@hanna-kruppe

Note that rustc-hash optionally overrides both write_str and write_prefix_free. The comments indicate this is intended to be enabled in rustc but at a glance I don't see it being activated in the compiler Cargo.toml files. But even if it isn't, rustc-hash mixes in the length of slices passed to write, and hashing one u8 does the same work as hashing one usize, so I wouldn't expect any difference in raw hashing performance or in hash collisions.

@purplesyringa

I'm having trouble estimating the quality of len mixing in rustc-hash, but I see your point. My fault for assuming rustc uses the hashbrown's default hash function. Does anyone know a better way to test the performance impact?

@hanna-kruppe

I don't know of any hasher where one write_u8 vs one write_usize should make a difference (the perf run will validate this for rustc-hash, assuming the nightly feature really isn't activated). Hashbrown and hasher crates have their own benchmarks, maybe those are useful for testing that for more hashers.

The other experiment that would be useful is checking how easy or hard it is to trigger this theoretical problem. That is, pick some reasonably popular hasher that doesn't mix in the length and try to find a set of keys that'll lead to full hash collisions with the current write_str implementation. I believe that the classic example of padding with 0 bytes to the next usize-aligned boundary isn't a problem for foldhash and other hashers that handle odd-length strings by doing multiple overlapping reads, because a longer string will cause some of those reads to get different non-zero bytes.

@rust-timer

Finished benchmarking commit (3eb8d2d): comparison URL.

Overall result: ❌✅ regressions and improvements - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.

mean range count
Regressions ❌ (primary) 0.7% [0.7%, 0.7%] 1
Regressions ❌ (secondary) - - 0
Improvements ✅ (primary) - - 0
Improvements ✅ (secondary) -0.4% [-0.4%, -0.4%] 1
All ❌✅ (primary) 0.7% [0.7%, 0.7%] 1

Max RSS (memory usage)

Results (primary -2.0%, secondary -2.4%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌ (primary) 7.5% [7.5%, 7.5%] 1
Regressions ❌ (secondary) - - 0
Improvements ✅ (primary) -6.7% [-9.8%, -3.6%] 2
Improvements ✅ (secondary) -2.4% [-2.4%, -2.4%] 1
All ❌✅ (primary) -2.0% [-9.8%, 7.5%] 3

Cycles

Results (primary -2.3%, secondary 2.3%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌ (primary) - - 0
Regressions ❌ (secondary) 2.3% [2.3%, 2.3%] 1
Improvements ✅ (primary) -2.3% [-2.3%, -2.3%] 1
Improvements ✅ (secondary) - - 0
All ❌✅ (primary) -2.3% [-2.3%, -2.3%] 1

Binary size

Results (primary 0.0%, secondary 0.0%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌ (primary) 0.2% [0.0%, 0.5%] 20
Regressions ❌ (secondary) 0.0% [0.0%, 0.0%] 4
Improvements ✅ (primary) -0.2% [-0.7%, -0.0%] 11
Improvements ✅ (secondary) - - 0
All ❌✅ (primary) 0.0% [-0.7%, 0.5%] 31

Bootstrap: 768.372s -> 769.733s (0.18%)
Artifact size: 331.03 MiB -> 331.05 MiB (0.00%)

@purplesyringa

IMO, the main problem here is that this issue introduces seed-independent collisions. This affects, say, DoS resistance.

To give another example, suppose I'd like to find a collision-free hash function for a given set of 1M elements. Typically, I can keep trying random seeds until I get one without collisions. But at the moment, this is not the case for the affected hashers.

@hanna-kruppe

In theory, this is a strong argument. In practice it's not clear to me if there's any existing hasher that only has seed independent collisions today because of write_str. Even if there is one, any set of collisions based solely on padding with zeros until to the next multiple of 4 or 8 bytes would be a very small set, so the significance is unclear.

There are a bunch of hashers that have seed-independent collisions among strings of the same length, and many hashers already mix the length into write calls. Foldhash and wyhash are the only ones I'm aware of that doesn't seem to fall into either category, but their behavior is more subtle than "just pad with zero bytes until the next convenient multiple". I expect one could find a small set of seed independent collisions in either (with the current Hasher::write_str default) by doing something smarter than appending zeros to a string, but it's not obvious to me how you'd turn that into set of, say, 1 million strings. Unless you break the hash so thoroughly that changing Hasher::write_str wouldn't save it either.

@purplesyringa

In practice it's not clear to me if there's any existing hasher that only has seed independent collisions today because of write_str.

wyhash and rapidhash both try to resist seed-independent collisions. foldhash uses a similar schema that, as far as I'm aware, is not subject to any realistic attacks. Although shash explicitly advises against use in scenarios where DoS matters, it makes a solid attempt at avoiding this pitfall, too.

Generally speaking, all modern quality hash functions try to resist DoS as well as they can, as SMHasher tests this behavior, so this is hardly surprising.

Of course, many cryptographic or almost-cryptographic hash functions have Rust implementations that are only safe because they're bytewise. The moment anyone decides to optimize them in a seemingly innocuous way, the jig is up, and DoS resistance is broken.

Foldhash and wyhash are the only ones I'm aware of that doesn't seem to fall into either category, but their behavior is more subtle than "just pad with zero bytes until the next convenient multiple".

For the core loops (pre-finalization):

These collisions don't typically matter, as the finalization mixes in the total length. But this doesn't protect against multiple writes whose lengths sum up to the same value. That is only broken because of write_str.

There are a bunch of hashers that have seed-independent collisions among strings of the same length, and many hashers already mix the length into write calls.

I'd like to make a somewhat tangential counterargument.

No existing hashes are tuned to Rust's needs in particular, leading to suboptimal performance when using anything other than strings as keys. This is very much a barren field: I can think of multiple optimizations tuned for the streaming design of Hasher, and as far as I'm aware, no crate in existence tries to perform them.

Any attempts to solve this (like what I'm currently working on) will inevitably raise the question: what does the Hasher guarantee? How do we know if the tweaks don't make DoS easy, if the std implementation disagrees with the docs? I do believe these optimizations can be practical, and IMO, formalizing Hasher requirements and changing the write_str behavior is one step towards this goal.

@purplesyringa

An alternative sound formalization exists.

We can specify that write calls don't have to be prefix-free, and explicitly request the Hasher implementation to mix in the length. In this case, emitting 0xFF in write_str is redundant, and write_str can be optimized to just write(s.as_bytes()). Similarly, hashing a slice of integers won't need to emit the length either.

Of course, this will break FNV and all other popular existing Hashers, so we probably don't want that. But I'd argue that favoring the current inconsistent semantics isn't right either.

@orlp

As the author of foldhash, my 2 cents is that I think it's always been a mistake to automatically mix in anything at all. This applies to both write_str, as well as when hashing a complete [u8] object (which doesn't actually have a dedicated function and must be emulated with write_length_prefix + write). The hash author most likely knows best how to most efficiently prevent length-extension and concatenation attacks, but the interface fundamentally prevents them from doing so. This fact is why we intentionally just completely ignore write_length_prefix in rustc-hash...

I would also like to call attention to the fact that write_str is still unstable, and that means foldhash is slower than necessary for strings (as it is written in stable Rust and I would like to not have different outputs based on feature flags). And it is always slower than necessary for hashing [T] slices, as they have a separate write_length_prefix call that can't be avoided.

@purplesyringa

Thinking aloud:

I agree that write_length_prefix is an odd API to have, as the hasher cannot, in the general case, reliably interpret it in any way other than write_usize. Indeed, Vec<Vec<Vec<...>>> emits several write_length_prefix calls in a row, so you can't, say, just XOR the length in.

I believe that the only possibility in which this extra avalanche step can be avoided is when the length and the array contents are available at once.

As such, it seems reasonable to remove write_length_prefix and switch all the users back to write_usize, but introduce a new method called write_bytes without the caller-side prefix-free requirement (similarly to write_str). The default implementation would be

fn write_bytes(&mut self, bytes: &[u8]) { self.write_length_prefix(bytes.len()); self.write(bytes); }

The [{integer}] serialization methods would replace the current write_length_prefix + write with a single invocation of write_bytes. Hasher implementors would then be able to override write_bytes and write_str to get the behavior @orlp describes, while write could ignore the length.

The Hash docs would specify that if two objects a != b are hashed, the first differing calls into Hasher must be to the same method, with an extra requirement of prefix-freeness if the method is write. This would trivially hold for Hashes that only write fixed-length arrays. Variable-length arrays would be advised to be hashed with write_bytes.

This change would be backwards-compatible (nightly excluded, but deprecating write_length_prefix for a single release would work too), as the default write_bytes implementation is correct for all methods, and directly implementing write_str via write_bytes is correct too.

Does this sound reasonable or am I missing something?

@Amanieu

As the author of foldhash, my 2 cents is that I think it's always been a mistake to automatically mix in anything at all. This applies to both write_str, as well as when hashing a complete [u8] object (which doesn't actually have a dedicated function and must be emulated with write_length_prefix + write). The hash author most likely knows best how to most efficiently prevent length-extension and concatenation attacks, but the interface fundamentally prevents them from doing so. This fact is why we intentionally just completely ignore write_length_prefix in rustc-hash...

Ignoring write_length_prefix is incorrect because not all slices will end up calling write when hashed: a slice of NewType(u64) where Hash for NewType calls write_u64 will end up with this sequence of Hasher calls:

If write_length_prefix is ignored then this will be vulnerable to a concatenation attack.

IMO write shouldn't blindly mix the length of the slice into the hash: the purpose of write_length_prefix is to inform the hasher that the length of a slice is actually relevant. However @purplesyringa is correct in that the API as it stands forces the hasher to implement it as write_usize.

If write_length_prefix is not suitable then we should have a better API which can inform the hasher of the length of a sequence of items.

@orlp

I never meant to suggest that write_length_prefix should be removed entirely. It is 100% necessary for general user-defined collections. I simply wish it wasn't forcibly invoked for [T] slices of built-in types.

Labels

S-waiting-on-author

Status: This is awaiting some action (such as code changes or more information) from the author.