HttpUtility.UrlEncode considered harmful

March 17, 2008 Edit asp.net .net

An interesting issue was recently raised against my rest framework. The following URI template was used to access a fictional customer:

/customer/{customerName}

When generating an email pointing to this customer, another part of the system was adding the following url:

http://example.org/customer/john+doe

At this point, my framework returned a 404, even though John Doe exists in our database. So what is happening there? When setting a breakpoint on my CustomerHandler.Get(string customerName) method, customerName ended up with the plus sign present. Why wasn't it converted to a space? Here's some PowerShell code to demonstrate. First, let's create a Uri object and see the result.

59> [System.Uri]"http://example.org/folder with space" | select absolutepath, absoluteuri,originalstring | fl

AbsolutePath : /folder%20with%20space
AbsoluteUri : http://example.org/folder%20with%20space
OriginalString : http://example.org/folder with space

As you can see, the space is encoded with a %20... Now let's see what happens if I call HttpUtility.UrlEncode.

61> [System.Reflection.Assembly]::LoadWithPartialName("System.Web") | out-null
62> [System.Web.HttpUtility]::UrlEncode("folder with space")
folder+with+space

Now the space has been replaced with a plus. Let's review the msdn documentation for the UrlEncode method.

If characters such as blanks and punctuation are passed in an HTTP stream, they might be misinterpreted at the receiving end. URL encoding converts characters that are not allowed in a URL into character-entity equivalents; URL decoding reverses the encoding. For example, when embedded in a block of text to be transmitted in a URL, the characters < and > are encoded as %3c and %3e.

Obviously the documentation doesn't really described the behaviour that we experience. So is a URL encoding within the scope of the http protocol supposed to have a + or a %20? Who's right and who's wrong?

Let's travel together along the spec stack we use when dealing with html content, and find out who, between Uri and UrlEncode, is right.

RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax

This specification covers URLs in their generalized form, and that's teh specification that defines percent encoding of url, where any reserved character have to be encoded as a percent sign followed by the two-letters hexadecimal representation of the codepoint.

This actually would imply that the correct encoding form of the URL should be http://example.org/customer/john%20doe.

It also specify that the plus sign is a reserved character, one that can be used by a scheme to delimit stuff. Defining what that stuff is should be the scheme's responsibility. The scheme here is http, so we switch to the next spec in our stack.

RFC 2616 - Hypertext Transfer Protocol -- HTTP/1.1

This specification, even though it's been written earlier than the lastest URI, does specify some scheme-specific information on how URLs work and what they represent. Going through that specification, HTTP defines some specific behavior for Uri, for example that http://example.org, http://example.org:80 and http://example.org/ are all equivalent.

The specification also mention the equivalence of the percent encoding we've already seen. Still nothing about the plus sign. Which leads us to the third specification that's involved here.

W3C Recommendation: HTML 4.01

Note that I ignore the XHTML 1.0 specification as it is mostly only a reformulation of the html 4 specification in an xml format.

The HTML specification reminds the reader of URIs and how they work. There are two interesting bits in the specification. The first one, entitled non-ascii characters in URI attribute values, defines once again the percent encoding scheme. Still no trace of that plus sign.

And then you discover the gem of the application/x-www-form-urlencoded content type. In it, we find the usual url encoding, with the addition of the space being encoded as a plus.

Interestingly enough, this format is only to be used when attaching content within a POST http request, and has nothing to do with URL encoding, except for the similarity of writing key=value&key2=othervalue to encode named values.

Conclusion

So there you have it. The content-type used by html to send form content is an html specific content sent as the payload to a post http request. It has no bearing and no compatibility with either http or URLs. If you generate or consume URLs, the plus sign should be opaque.

In other words, stay welll away from HttpUtility.UrlEncode.