Better URL encoding/decoding (original) (raw)

Flurl.Url has always contained a few undocumented public static methods for URL-encoding and decoding. In an effort to fix a couple reported bugs and, more generally, fix some known quirks in the .NET world, these methods have gotten an overhaul, a couple renames, and a cohesive new story, so I'm now happy to "advertise" them. :)

What the RFC says

When dealing with URL-encoding, characters fit into one of 3 categories:

unreserved (legal in URLs): alphanumeric and -._~
reserved (legal, but may have special meaning in URLs): :/?#[]@!$&'()*+,;=
everything else (illegal in URLs, must be encoded)

One notable special case is the % character. When used as part of a %-encoding sequence (e.g. %20 to represent a space), it is legal in the URL. Otherwise, it must be encoded.

Another thing to note is that although the RFC says nothing about encoding space characters as +, the HTML spec does specify this for URL-encoded form data, and it is also a common practice in query strings.

What .NET gives us

Uri.EscapeDataString is our best option for encoding both illegal and reserved characters, but it has the following shortcomings:
- It chokes with a UriFormatException at 65,520 characters, which is a realistic problem when using it to URL-encode form data.
- It has no option to encode space characters as +.
Uri.EscapeUriString is our best option for encoding illegal characters only. For example, with a string like "1 2/3 4", it'll encode the spaces for you but assumes you want to keep the / as a path separator. But it has one major quirk:
- It always encodes the % character, even if it's proceeded by 2 hex characters, which is a %-encoded sequence and perfectly legal in a URL.
Uri.UnescapeDataString is our best option for URL decoding, but it too has a shortcoming:
- It has no option to interpret + characters as spaces.
WebUtility.UrlEncode is our best option for...pretty much nothing.

How Flurl improves on these

Flurl sets out to replace the methods above and correct their quirks with the following static methods:

Url.Encode(string s, bool encodeSpaceAsPlus) encodes both illegal and reserved characters. It has no string size limit and gives you the option to encode space characters as +.
Url.EncodeIllegalCharacters(string s, bool encodeSpaceAsPlus) encodes illegal characters only, and will not encode % if it is part of a %-hex-hex sequence, so there is no worry of already-encoded strings getting double-encoded.
Url.Decode(string s, bool interpretPlusAsSpace) decodes any size string and gives you the option to decode + characters to spaces.

What's breaking?

As mentioned, Url has always had encoding/decoding methods, but with their new purpose in life, 2 have been renamed and, effectively, superseded:

Url.EncodeQueryParamValue is superseded by Url.Encode.
Url.DecodeQueryParamValue is superseded by Url.Decode.

Since these were mainly for internal use I'm hopeful this won't cause problems for most, but they were public methods so I want to fully disclose this breaking change.