"""; final Document...">

cssSelector doesn't handle combining characters correctly (original) (raw)

@Test
void combiningCharactersInIdentifier()
{
    final String html = """
        <html>
        <head>
        <meta charset="utf-8">
        </head>
                    
        <body>
        <img class="e\u0301" src="/corner.jpg">
        </body>
                    
        </html>""";

    final Document document = Jsoup.parse(html);
    final Elements images = document.getElementsByTag("img");

    final Element img = images.get(0);
    final String cssSelector = img.cssSelector();

    assertEquals("html > body > img.e\u0301", cssSelector);
}

The example above uses combining characters to create an รฉ. Emoji make heavy use of combining characters (๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘งโ€๐Ÿ‘ง is made up of 11 characters: \uD83D\uDC68\u200D\uD83D\uDC68\u200D\uD83D\uDC67\u200D\uD83D\uDC67).

I have seen emoji used as css class names in the wild, and I think the character escaping code is doing the wrong thing when calling cssSelector, it looks like it's escaping every character individually, which breaks things with these combining characters.