rustdoc: check parsing diffs between pulldown-cmark 0.9.6 and 0.10 by notriddle · Pull Request #121659 · rust-lang/rust (original) (raw)
Roughly categorizing what I'm seeing here. The specific crates are in this Gist:
(P)roblems: docs that are broken by this change
P1: unintended strikethrough
pulldown-cmark/pulldown-cmark#648 allows abso~fricking~lutely to render as absofrickinglutely. It makes pulldown-cmark agree with cmark-gfm.
This change causes 39 docs to have strikethroughs where it looks like the author intended literal tildes.
P2: Unintended block quote
pulldown-cmark/pulldown-cmark#675 makes block quotes, introduced by >, consistent between paragraph interruption and starting after a blank line. It's pretty unambiguous that this is a bug fix.
Causes 24 crates to appear wrong.
P3: Footnote reference immediately after link or another footnote
pulldown-cmark/pulldown-cmark#773
The way that pulldown-cmark ignores the apparent footnote reference in [something][^foot] is the same way GFM does it, and it's the easiest way to parse this given everything else, but it's weird.
Causes 16 crates to appear wrong.
P4: block nested inside footnote definition without indenting
pulldown-cmark/pulldown-cmark#654
This is, basically, the expected outcome of GFM-compatible footnote parsing. The syntax now parses the same way GitHub does, which is not the way rustdoc used to.
Causes 6 crates to appear wrong.
P5: emphasis does not match what author expects
In a text like __imp_(_), commonmark rules say that the result should be _ imp(_), but I'm pretty sure the author intended those to be literal underscores. In the old parser, they were.
Causes 2 crates to appear wrong.
P6: Single | does not continue table
This isn't what GitHub does, and the new parser aligns closer with GitHub, but some authors have used this as a way to put dividers in their table:
| header | two |
|---|---|
| item | n |
| item | e |
| section | l |
| two | o |
Causes 6 crates to appear wrong.
P7: Table is required to have a valid header line
This is not a table according to GitHub, but the old parser would render it as one sometimes.
| first | second |
| third | fourth |
| fifth | sixth |
The header line also needs to have at least one hyphen in each cell, so this isn't allowed either.
| first ||
|--------||
| second ||
Causes 4 crates to appear wrong.
P8: unintended link definition
This is a link definition in the new parser, but not the old one.
Causes 1 crate to appear wrong.
(F)ixes: docs that are actually render better with the new parser than the old one
F1: [^x] in the old parser rendered a broken footnote link when no footnote was intended or defined
pulldown-cmark/pulldown-cmark#654
Usually, they're trying to write regex inverted character classes, but they come out as footnotes.
This fix also makes things consistent with GitHub, and the fact that it's not breaking anyone's docs makes me happy.
Fixes 39 crates
F2: writing 1. alone on a line in the middle of a paragraph shouldn't start a list
pulldown-cmark/pulldown-cmark#681
It used to do the wrong thing with this:
Test paragraph with a count of
1.
Fixes 6 crates
F3: tables interrupt paragraphs
pulldown-cmark/pulldown-cmark#653
This makes things consistent with GitHub, and fixes NN docs. Not likely to be a problem, because you kinda have to try really hard to make something look like a table.
Fixes 37 crates
F4: ASCII art misinterpreted as list
Since asterisks on their own no longer count as lists when they interrupt paragraphs, some things that were never intended as list markers stop being seen as such.
Fixes 1 crate
F5: footnote has nested, indented children or lazy continuation
This is the flip side of P4, where the author actually wanted it to be parsed the way github parses it.
Fixes 2 crates
F6: Link definition is seen in new parser where old parser did not
The old parser didn't properly recognize link reference definitions when they were right after code blocks.
Fixes 1 crates
F7: block structure and inline structure are separated better in new parser
Consider this example:
The > on the second line shouldn't go in there, and now it doesn't.'
Fixes 2 crates
F8: a table row followed by two spaces isn't a hard break
A very specific bug that only seems to show up when a paragraph is nested inside a list. The new parser fixes it, it only affects 1 crate, and the change makes the generated docs look better.
Fixes 2 crates
F9: footnote with <autolink> was misparsed as link def
The old parser thought this was a link definition:
[^0]: <https://example.com>
The new parser sees it as a footnote, which is also how github does it.
Fixes 1 crate
F10: indented or spaced link definition still counts
This document:
[first]: https://example.com
[second]: https://example.com
[first]
[second]
The old parser ignored the definition of [second], so it only saw one link, but the new one doesn't, so it sees two links. This also agrees with the reference implementation.
The new parser also trims spaces, so [x ] and [x] are the same thing.
Fixes 2 crates
F11: intra-word strikethrough
This is the flip side of P1, where the author apparently intended to write intra-word strikeout.
Fixes 1 crate
(Q)uestionable cases, where the docs were broken before and are still broken now
Q1: crate author wrote ASCII art, or some language that isn't CommonMark, in their doc comments
This happened on N crates, and produces terrible results in both parsers. I'm not bothering to categorize them by which change to pulldown-cmark causes them to render differently.
There are 30 crates that do this
Q2: links with mismatched parens go from being broken links to not being links at all
pulldown-cmark/pulldown-cmark#738
There are 4 crates that do this
Q3: incorrect trim in block doc comment
Block comments are supposed to be written like this:
/**
- first
- second
- third */ ^^ these two characters are trimmed
If you get this wrong, the text that you intended to be a paragraph gets turned into a long unordered list instead, because trim is computed by checking every line for a common prefix.
It shows up here because incorrectly trimmed block doc comments often have asterisks with no text after them, and pulldown-cmark/pulldown-cmark#681 changes it from a list to plain text.
There are 8 crates that do this
Q4: numbered list that starts with zero and has two ones
- do stuff
- do more stuff
- do more stuff
The 0 isn't a valid list item in commonmark, so that first line is a paragraph. The divergence between the new and old parser is on the second line, but I'm not sure what the intended result by the crate author was? I think it's supposed to be a list starting at zero, but it doesn't work in the old version or the new one.
There is 1 crate that does this
Q5: incorrect footnote definition markup
The parser used to return a dangling footnote reference, but the new one doesn't. Both saw the intended footnote definition as invalid.
There are 18 crates that do this
Q6: block quotes written at start of line
Try writing block quotes like this:
First paragraph
Second paragraph
I think used to work in GitHub-flavored markdown (it works in Pandoc-flavored markdown now), and seems to be what the author intended in the 1 crate that hits this case, but it doesn't work on GitHub now, and it never worked right in rustdoc (though the exact way it fails changed, which is what brought it to my attention).
There is 1 crate that does this
(S)purious cases where the lint fires and it shouldn't
S1: loose task list
The lint I wrote fires false positives for task lists. The event stream from the parser is different, thanks to pulldown-cmark/pulldown-cmark#558, but they're detected in the same cases.
There are 7 crates that hit this
S2: minor change in HTML blocks
The indentation is handled differently, but will usually generate the same rendered result.
There are 3 crates that hit this
S3: two spaces on the last line of a paragraph
Strictly speaking, this is a different parse result. It will produce slight differences in the spacing.
There is 1 crate that hits this
S4: spans are different when emphasis wraps code
This causes the lint to claim there's a problem when there isn't.
There is 1 crate that hits this