Remove duplicate impl of string unescape from parse_format by hkBst · Pull Request #137995 · rust-lang/rust (original) (raw)

This is a large (+447/−547) change with zero explanation. I need some context and motivation, please!

Ah, sorry about that. Let me try and fix that:

The idea for this comes from this code at the bottom of rustc_parse_format/lib.rs:

fn unescape_string(string: &str) -> Option<String> {
    let mut buf = String::new();
    let mut ok = true;
    unescape::unescape_unicode(string, unescape::Mode::Str, &mut |_, unescaped_char| {
        match unescaped_char {
            Ok(c) => buf.push(c),
            Err(_) => ok = false,
        }
    });

    ok.then_some(buf)
}

which does unescaping but throws away all span information from the original string (via the _ in &mut |_, unescaped_char|. This function is called in fn find_width_map_from_snippet:

    let Some(unescaped) = unescape_string(snippet) else {
        return InputStringKind::NotALiteral;
    };

which then does its own light version of string unescape to build a width map (if the unescaped string matches the input string), which is basically a list of expansions from the unescaped string back to the original string, which has to be traversed to determine the old position (this happens in fn remap_pos, fn to_span_index, fn to_span_width, and fn span). (Doing a Vec traversal for each original position that is needed is quadratic behavior as it is linear in both the length of the width map Vec and the number of such translations from new to old position.)

The new code does the unescaping in Parser::new, while collecting the position information into a Vec, and checking that the unescaped string matches the input string like so:

                // snippet is not a raw string
                if snippet.starts_with('"') {
                    // snippet looks like an ordinary string literal
                    // check whether it is the escaped version of input
                    let without_quotes = &snippet[1..snippet.len() - 1];
                    let (mut ok, mut vec) = (true, vec![]);
                    let mut chars = input.chars();
                    unescape::unescape_unicode(
                        without_quotes,
                        unescape::Mode::Str,
                        &mut |range, res| match res {
                            Ok(ch) if ok && chars.next().is_some_and(|c| ch == c) => {
                                vec.push((range, ch));
                            }
                            _ => {
                                ok = false;
                                vec = vec![];
                            }
                        },
                    );

Here we're feeling some pain from the callback-based nature of unescaping, which forces the collection of span info into a Vec (at least I could not see a good alternative). Basically, the Parser needs to know the position of each character in input (Peekable<Char<>> in the current code) and in the original string as typed (snippet) (width_map plus translation functions in the current code). This new code ultimately collects a Vec<(original span, char byte pos in input, char in input)> for the same purpose. It is thus probably using more memory.

Most of the other changes are because of this change of span info representation.

Also, from skimming it I think there might be multiple distinct changes in the single commit?

It is possible, but this code went through a lot of iterations, as I came to understand the exact constraints imposed by the ui tests, and this is basically the first version that passes all those ui tests.

There are a few minor things that come to mind:

I'm hoping this is enough to get the broad idea, such that you can start asking about specific bits of this change, but let me know if there is more you need or that I can do to clarify.