Incorrect handling of unpaired surrogates in JS strings · Issue #1348 · rustwasm/wasm-bindgen (original) (raw)

Describe the Bug

It was brought to my attention in Pauan/rust-dominator#10 that JavaScript strings (and DOMString) allow for unpaired surrogates.

When using TextEncoder, it will convert those unpaired surrogates into U+FFFD (the replacement character). According to the Unicode spec, this is correct behavior.

The issue is that because the unpaired surrogates are replaced, this is lossy, and that lossiness can cause serious issues.

You can read the above dominator bug report for the nitty gritty details, but the summary is that with <input> fields (and probably other things), it will send two input events, one for each surrogate.

When the first event arrives, the surrogate is unpaired, so because the string is immediately sent to Rust, the unpaired surrogate is converted into the replacement character.

Then the second event arrives, and the surrogate is still unpaired (because the first half was replaced), so the second half also gets replaced with the replacement character.

This has a lot of very deep implications, including for international languages (e.g. Chinese).

I did quite a bit of reading, and unfortunately I think the only real solution here is to always use JsString, and not convert into Rust String, because that is inherently lossy. Or if a conversion is done, it needs to do some checks to make sure that there aren't any unpaired surrogates.