bpo-36338: urllib.urlparse rejects invalid IPv6 addresses by vstinner · Pull Request #16780 · python/cpython (original) (raw)

The living URL Standard doesn't implement IPv6 scope on purpose:

Support for <zone_id> is intentionally omitted.

This comment points to https://www.w3.org/Bugs/Public/show_bug.cgi?id=27234#c2 which is a comment written by Ryan Sleevi at 2015-08-14:

Yes, we're especially not keen to support these in Chrome and have repeatedly decided not to. The platform-specific nature of <zone_id> makes it difficult to impossible to validate the well-formedness of the URL (see https://tools.ietf.org/html/rfc4007#section-11.2 , as referenced in 6874, to fully appreciate this special hell). Even if we could reliably parse these (from a URL spec standpoint), it then has to be handed 'somewhere', and that opens a new can of worms.

Even 6874 notes how unlikely it is to encounter these in practice

   Thus, URIs including a
   ZoneID are unlikely to be encountered in HTML documents.  However, if
   they do (for example, in a diagnostic script coded in HTML), it would
   be appropriate to treat them exactly as above.

Note that a 'dumb' parser may not be sufficient, as the Security Considerations of 6874 note:

   To limit this risk, implementations MUST NOT allow use of this format
   except for well-defined usages, such as sending to link-local
   addresses under prefix fe80::/10.  At the time of writing, this is
   the only well-defined usage known.

And also

   An HTTP client, proxy, or other intermediary MUST remove any ZoneID
   attached to an outgoing URI, as it has only local significance at the
   sending host.

This requires a transformative rewrite of any URLs going out the wire. That's pretty substantial. Anne, do you recall the bug talking about IP canonicalization (e.g. http://127.0.0.1 vs http://[::127.0.0.1] vs http://012345 and friends?) This is conceptually a similar issue - except it's explicitly required in the context of <zone_id> that the <zone_id> not be emitted.

There's also the issue that zone_id precludes/requires the use of APIs that user agents would otherwise prefer to avoid, in order to 'properly' handle the zone_id interpretation. For example, Chromium on some platforms uses a built in DNS resolver, and so our address lookup functions would need to define and support <zone_id>'s and map them to system concepts. In doing so, you could end up with weird situations where a URL works in Firefox but not Chrome, even though both 'hypothetically' supported <zone_id>'s, because FF may use an OS routine and Chrome may use a built-in routine and they diverge.

Overall, our internal consensus is that <zone_id>'s are bonkers on many grounds - the technical ambiguity (and RFC 6874 doesn't really resolve the ambiguity as much as it fully owns it and just says #YOLOSWAG) - and supporting them would add a lot of complexity for what is explicitly and admittedly a limited value use case.

Firefox feature request https://bugzilla.mozilla.org/show_bug.cgi?id=700999 has been rejected using this comment as well at 2015-08-14.

Currently, only Microsoft Edge supports IPv6 scope: Firefox and Chromium don't.

I suggest to follow Firefox, Chromium and living URL Standard example: don't support IPv6 scope.

My current implementation doesn't implement the RFC 6874 which suggests to use %25 between the IPv6 and the scope. For example address ::1 with scope eth0 should be written ::1%25eth0. This syntax is hard to read if you use numeric scopes which are common: ::1 with scope 2 should be written ::1%252 :-(