TypeId: use a (v0) mangled type to remain sound in the face of hash collisions. by eddyb · Pull Request #95845 · rust-lang/rust (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation122 Commits3 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

This implements the strategy described in #77125 (comment), resulting in:

collision resistance: TypeIds become as sound as v0 symbol names
- that is, either can only be subverted by bypassing Cargo and linking together artifacts that share a crate identity (both name and disambiguator hash) but not the same sources
  on some platforms we may be able to instruct the linker to error based on e.g. different data in a symbol with a name only depends on crate identity, but in which we place the SVH (which depends on the crate's whole source/AST/HIR)
  however, improving symbol name soundness is not in the scope of this PR, only TypeId parity with the best tool we have for serializing Rust identities in the face of separate compilation (i.e. v0 symbol names)
breaking code transmute-ing TypeIds to u64
- you can see some examples in Widen TypeId from 64 bits to 128. #75923 (comment)
- this's why the (u64, ptr) layout was used in this PR (instead of a single pointer with the u64 hash stored behind it) - it's strictly larger irrespective of target
breaking code calling TypeId::of at compile-time (still unstable) then either transmute-ing it to an integer, or using it in a pattern (turns out we had a test for the latter)
- effectively unblocks Tracking Issue for const fn type_id #77125
(optionally) TypeId + rustc_demangle (which std already depends on) could offer access to the type name at runtime, which has been asked for before:
- Intrinsic for type_name_of_id to power a better impl Debug for TypeId? #61533
- Expose type_name() method on Any. #68379
- this would be equivalent to type_info::name from C++ RTTI (where type_info is obtained with the typeid(T) operator, which acts like our TypeId::of::<T>())
- while the type mangling string is not exposed in this PR, it could easily be later - but embedding rustc_demangle into core would be harder to do (and justify)

This implementation adds a new field to TypeId:

struct TypeId { hash: u64, mangling: &'static TypeManglingStr, }

The &TypeManglingStr type is like &str but "thin", with the usize length being stored before the str data (an optimization to avoid making TypeId more than double in size).

To avoid comparing the string contents of the mangling field every time, a == b has:

a "fast reject" path for a.hash != b.hash
- i.e. a and b are definitely different if their hashes differ
a "fast accept" path for ptr::eq(a.mangling, b.mangling)
- i.e. a and b are definitely the same if their manglings are the same in memory
- this is just one address comparison thanks to the use of the "thin" &TypeManglingStr type, which its length being stored behind the pointer
a fallback path: compare the actual mangling string contents

While 1. and 2. could come in either order (as hash and mangling equalities are intrinsically linked), for simplicity (and being able to rely on #[derive] a bit), the order used above is the order they are checked in, in this PR currently.

Ideally, the fallback path won't be hit outside of a hash collision, and this should be the case today if the same crate (or at least CGU) has caused the codegen of both TypeId::of which produced the two TypeIds in the first place (which will cause their mangling fields to point at the same place in read-only static memory).

However, to get global deduplication (across all crates/CGUs), we'd have to start using certain linker features (discussed in e.g. #75923), which I've left out of this PR.

As it stands, this implementation is comparable to the typeinfo implementation libcxx calls NonUnique, which seems to be used at least on Windows (and its existence suggest we may never be able to guarantee deduplication, only do it on a best-effort basis).

Fixes #10389

cc @oli-obk @RalfJung @KodrAus @mystor

Anders429 and davidtwco reacted with thumbs up emoji adamreichold, jplatte, terrarier2111, DianaNites, panstromek, varkor, delacian, davidtwco, and ruifengx reacted with hooray emoji RalfJung, ilslv, saethlin, Kixiron, Badel2, ShadowJonathan, Kobzol, terrarier2111, memoryruins, DianaNites, and 9 more reacted with heart emoji

rustbot added the T-compiler

Relevant to the compiler team, which will review and decide on the PR/issue.

label

Apr 9, 2022

Some changes occured to rustc_codegen_gcc

cc @antoyo

Some changes occured to the CTFE / Miri engine

cc @rust-lang/miri

r? @lcnr

(rust-highfive has picked a reviewer for you, use r? to override)

That sounds awesome! Will this fix #10389?

@@ -39,6 +39,87 @@ pub(crate) fn const_caller_location(
ConstValue::Scalar(Scalar::from_pointer(loc_place.ptr.into_pointer_or_addr().unwrap(), &tcx))
}

pub(crate) fn const_type_id<'tcx>(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer this helper to be in the interpret module, not const_eval -- it is needed for all interpreter instances after all, const_eval is meant for specifically the CTFE instance of the general interpreter infrastruture.

EDIT: It is needed also for regular codegen, but I think my point remains -- const_eval should be for things that are specific for CTFE. interpret should not usually import anything from const_eval.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're also creating your own ecx here, and returning this as a ConstValue -- why all that?

I guess an alternative might be to treat this like vtable_allocation?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type_name::alloc_type_name also seems similar in spirit?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh I see, I was modelling it after const_caller_location but that's only here because it's a query instead of an intrinsic. Should it have its own module under interpret::intrinsics like caller_location does?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, note in particular that the caller_location implementation in interpret does not call the const_eval code. It is not a nullary intrinsic so it takes slightly different paths. Though probably the const_caller_location query should be made more like the vtable query, which is not inside const_eval either.

Should it have its own module under interpret::intrinsics like caller_location does?

Yeah I guess that would make sense.
It would almost make sense to move these queries for intrinsics that are implemented via interpreter stuff but also used for codegen in a separate module rustc_const_eval::intrinsics, i.e., outside interpret. @oli-obk any thoughts?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at it, caller_location actually has a impl<'mir, 'tcx: 'mir, M: Machine<'mir, 'tcx>> InterpCx<'mir, 'tcx, M> block so having that inside interpret makes sense. type_name, OTOH, doesn't really do anything with the interpreter, it is very much like the vtable code (but the two are in totally different locations, which we should probably fix). "give me the vtable for type T" is not technically an intrinsic but not very different, either, is it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think cleaning this up is orthogonal.

Except for mplace_field we don't even need an InterpCx and can just create and intern the Allocations directly. If we added more convenience operations for writing data of a specific struct to an Allocation we can probably stop using InterpCx for caller_location, too.

Imo we should just merge this PR and then work on cleaning these up now that we have multiple intrinsics to clean up in a similar way.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree cleaning up the existing stuff is a separate task.

But we shouldn't have anything load-bearing inside the const_eval module, so I do think that this code should be moved somewhere else before landing. I don't care strongly which of the existing precedent it follows, but none of it has load-bearing code in const_eval, so let's not add yet another different style of implementing such an intrinsic. :)

(Even caller_location, which was mentioned as a template, has its load-bearing code in interpret::intrinsics::caller_location.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still relevant, should move this function.

Some backstory on how this took so long: #77125 (comment) is from August last year, and in September I already had half of this implemented - but around late October I ended up with a dead NVMe in the desktop where the branch was on. Only a few days ago I accidentally found a copy of it on the build server I typically use (git gc would've eaten other things, but this was checked out in a workdir). I saved the branch and moved on, but today I mentioned it in the context of the provide APIs, and I looked at it again, and it had way more implemented than I thought, so I decided to finish it after all.

This comment was marked as resolved.

Will this fix #10389?

Good point, yeah, I believe it would (cc @workingjubilee) - I'll go add it to the description.

Heh, code in the wild seems to be relying on hasher behavior - at least I've been meaning to revert that to what it was before (for performance reasons) so we don't need to break anyone who does that (as suspicious as it is).

impl fmt::Debug for TypeId {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
// FIXME(eddyb) perhaps attempt demangling `mangling`.
f.debug_struct("TypeId").finish_non_exhaustive()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a debug format regression. Previously it was possible to tell different TypeId apart.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, weird, I don't remember doing this but I guess I noticed TypeId was infoleaking and reacted by doing this? Ideally we would print the mangling but I'm not sure we should commit to that without a more involved proposal.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think showing TypeId { <mangling> } is what we want here.

I don't feel like this will add forward compat issues we should care about and being able to differentiate TypeIds is quite important. There won't be too many people relying on the concrete Debug string returned and if they do, I really don't mind breaking their code ^^

This comment was marked as resolved.

eddyb mentioned this pull request

Apr 9, 2022

4 tasks

#[inline]
fn eq(&self, other: &Self) -> bool {
// FIXME(eddyb) should `likely` be used around the `ptr::eq`?
if ptr::eq(self, other) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: this could be CTFE-friendly if it used guaranteed_eq instead.

This was referenced

Apr 9, 2022

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

It would be an interesting project to try to structure TypeData such that it both contained the mangled type name and strip could remove the type name data, but that's a later project.

Putting all mangled type names in a specific section and then using strip --remove-section=.rust_ty or something like that has a small chance of working. It is likely to result in the dynamic linker giving an error as relocation targets couldn't be resolved. Maybe weak symbols would work? Those are not very portable though. I don't think it is possible to convince strip to remove specific sections that are loaded at runtime without requiring extra arguments.

Ok, so it looks everyone agreed on how this crypto hash could look like. I'll assume there is no more to discuss and there's an actionable to move this PR forward. Please feel free to switch back when ready for a review, thanks.

I don't think it makes much sense to repurpose this PR for implementing the cryptographic hash variant, and I'd rather leave the comment history here be about the mangled type variant.

Also it isn't a strong priority for me wrt limited time (as per #95845 (comment) this was stalled for over half a year in the first place), but if anyone wants to repurpose the few relevant bits, feel free to get them from this PR.

Acouple smaller things could also be done (not sure if I'll be able to help with either):

manual PartialEq impl for TypeId (i.e. turning off "structural match")
- this would disallow match-ing on TypeIds (forcing the use of == instead), as to not box ourselves into a corner if address equality is ever needed in the future
increasing TypeId's size (and cratering that change)
- should help with breaking misuses (just like comments on this PR describe)
- it doesn't have to be actual new data, could just be a dead field
  * this is library-only, e.g. TypeId(type_hash::<T>()) -> TypeId(type_hash::<T>(), 0)

In the end, I'm sorry for wasting people's time with an approach that I misjudged as having stronger support (from both the formal side and the precedent of C++ RTTI) than it actually did.

I think the lang question here has been addressed: we're happy with a full (non-truncated) cryptographic hash. The compiler already has other things that depend on non-colliding hashes.

The compiler's internal implementations of these can be fixed without having to consult users, however. The results of the TypeId hashing will be public-facing.

Many Rust PRs have been historically stalled and pushed back on for the concern of potentially breaking users, even in the absence of evidence of any users who will actually be broken, even against the weight of correctness. Even, again, for situations where there was no guarantee in the public interface, merely "someone may have come to rely on it". So I do not understand why this is so cheerfully a part of discussion.

And it feels like being able to write something in Coq that ignores that the TypeId hash may collide does not automatically make our code correct in the face of actual implementations. After all, people are talking about the implications of the contents of the binary, now, and worrying what that will imply and what people will come to depend on in those. By then, the Rust language has long since left the picture and can provide very little control in actuality because the artifact has been fashioned. And Rust is included in dynamic libraries that cannot simply compile all this away. If we're concerned about people looking at TypeId symbols, why not about looking at our TypeId hashes, also?

Meanwhile, the concern of large symbols is primarily evinced by recursive symbols, which are subject to compression techniques.

I am absolutely certain that in 2040, assuming the Rust project still stands by then, that if the hash we use is broken, that someone is going to open a PR for upgrading the hash and everyone is going to hem and haw about these things then. It feels doubtful to me that, in the absence of well-defined ways to upgrade std's interfaces that bypass these concerns, that we should be using something as vulnerable as a hash. Any hash.

@workingjubilee I think there's a huge difference between "hey the transmute stopped working" and "the value in use is different now". I can't imagine there being any concern about changing the values generated here -- they can already change today when the rustc details change how the current unsound hash is generated.

I'm not sure I understand the worry behind a hash algorithm being broken in the future. When a hash is broken, it doesn't suddenly start producing accidental collisions for small strings like type names, it just means that it's now computationally feasible to find a collision, if you try really hard. Would that even matter for this use case? If it was possible to calculate a 'malicious' typename (probably of several kilo- or megabytes) that would collide with another, would that change anything? It seems to me that if an attacker can insert a malicious type name into the code, they can do much more damage by inserting much simpler things into the code.

It is an attack vector (defense in depth, etc) when using TypeId as a plugin system.

However, any case where this would be an actual attack vector does include trusting some property about the malicious actor, either because they're a transitive native dependency, or they're a runtime plugin and you're trusting them to give you a unique TypeId even if the plugin itself is sandboxed.

(Part of the reason this is the case is that it's not sufficient to get a TypeId collision to get an unsound API; you also need to be downcasting based on the colliding TypeId.)

So yes, I believe it is completely reasonable to restrict this to a safe/unsafe protection boundary against ~~incompetence~~ mistakes in safe code, rather than against any sort of deliberate attack vector.

(Perhaps a collision attack would make it easier to get a malicious crate through review... but that requires not just a hash collision, but a meaningful hash collision, which is much more difficult to find even with a broken hash.)

Taking that as an axiom, a reasonably-sized hash is certainly sufficient.

(But that also said, I think it is also reasonable to choose a hash size designed to be resilient against a hash collision in the entire ecosystem rather than just within a compilation. This prevents our malicious actor from just finding a preexisting hash collision and adding a dependency on it to create UB.)

It feels doubtful to me that, in the absence of well-defined ways to upgrade std's interfaces that bypass these concerns, that we should be using something as vulnerable as a hash. Any hash.

This feels dangerously close to "if we can't totally fix it there's no point in making things any better" which really sucks. If we can easily improve the state of things and leave the door open for future improvements then we 100% should

eddyb mentioned this pull request

Jul 28, 2022

I would really hate for TypeId to bloat in size since it is being used heavily in highly-dynamic code and assumed to be small. Small structs in my code own often carry them around as-is. Can we have it remain a single word in some way? I understand there are rogue crates illegally transmuting TypeIds, but certainly there are better ways to deal with them that does not make the rest of us pay for it too.

Besides, those who are going to transmute illegally will likely do it again, since what they are really trying to accomplish is compile-time comparison of TypeId, which will go away once we have some legal way to sate their needs.

A small TypeId is simply unsound, as we do have real-world examples of collisions. So that is not an option.

Also the issue with rogue crates illegally transmuting TypeIds is that they assume 64bit TypeIds. We have to increase the size of TypeId one way or another to avoid collisions. Rogue crates are making this harder. They are not the reason to change it in the first place.

TypeId could be just &'static TypeIdInner, and it would remain small. However, the reason for making it (u64, &'static TypeIdInner) is twofold:

Loudly break anyone using transmute, and
Have a fast unnaccept and hash which do not do not do a pointer chace.

You can argue that the first isn't necessary, but the second is very desirable.

If you want to argue against size_of::<TypeId>() becoming u128 big (which is the maximum anyone is proposing), #99189 is probably now the place to do it.

Have a fast unnaccept and hash which do not do not do a pointer chase.

Note that the fast accept also doesn't require indirection! With the right linker integration, we should be able to almost always hit one of these two:

a.hash != b.hash (definitely !=, "fast reject")
- only inaccurate (hash equal, for unequal types) for hash collisions
ptr::eq(a.inner, b.inner) (definitely ==, "fast accept")
- only inaccurate (inner differing in address, for equal types) when we weren't able to get the (dynamic) linker to deduplicate the backing constant data (or like this PR which doesn't do that at all)

Oh and when being passed around, such two fields will e.g. go into two registers (as they ake TypeId into a "scalar pair"), without pessimizing codegen a lot.

So for the most part, you would have double the size and double the comparisons - that shouldn't cause as big of a deal as "O(1) operation is now O(N) on average".

To be frank, I think TypeId being 2-word large is still fine. But looking at the discussion, I worry the same arguments might be used to enlarge it yet again to 256 bits in the future, which certainly will be too large to put inline.

Anyhow, I don't have a better idea to argue against this effectively. Maybe pointer-tagging some bits from the hash could help in some way to keep the struct small but the number of bits available to be used makes it seem negligible.

Nobody is arguing to make TypeId any bigger than (u64, &'static TypeIdInner). Rest assured that making TypeId inline bigger again will meet significant resistance if it's ever proposed.

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request