![]() |
Site Archive (Complete) | |||
|
ABOUT US |
CONTACT |
ADVERTISE |
SUBSCRIBE |
SOURCE CODE |
CURRENT PRINT ISSUE |
NEWSLETTERS
|
RESOURCES
|
BLOGS
|
PODCASTS
|
CAREERS
|
||||
October 01, 2003
URL Canonicalization TestingA test-case generator for decoding URLsMichael J. Hunter
URLs can be encoded in many different ways, and if you don't process them correctly, you could introduce a security breach. In this article, we'll look at some common exploits, like path navigation injection, and then use a test-case generator that creates each form of encoding for a specific URL so that requests can be safely processed.
Download the code for this issue
Test Cases Everywhere
If you have filled out web-based forms whose data was passed onto the web server via the URL, you have probably seen sequences of UTF-8 encoded characters such as "%20". The "%nn" encoding is simply the ASCII value of a character converted to hexadecimal and prepended with a percent sign.
If you're writing code that takes a URL as a parameter, the URL may really be a file, and the file's extension may be used to determine what you should do with it (e.g., .asp files are handled differently than .txt files). If the name has been encoded thus making the filename MyBigFile%2etxt you won't be able to determine what to do. This typically results in one of two actions: Either an error is returned, or the file will be passed on through the system (after all, it's not an executable, right?). Annoying the user is generally bad, but even worse: If some function down the call chain (or an API the function calls) does decode the filename, you may find yourself running malicious executables.
As you likely know, a web site domain or server name is really just a textual alias for the sequence of four numbers separated by decimal points that make up the IP address (commonly called a "dotted quad"). What you might not know is that the sequence can also be represented in six other forms, and each of these forms can be in decimal, octal, hex, or any combination thereof. These different forms all resolve down to the same machine, so these variations may seem harmless. Once again, if you are deciding what privileges to allow a request based on a dotted-quad IP address and you encounter an IP address without any dots, you might treat it as a domain name. Older versions of Internet Explorer were designed so that if a domain name did not contain any dots, it was assumed to be on an intranet and thus was given higher privileges. This insecure behavior has since been corrected (see References).
These are both examples of a larger problem wherein URLs can be mangled in many different ways decanonicalized forms, if you will. In this article, I will explain many of the most common ways to munge a URL and present a test case generator that creates each form for a specific URL. After I finish scaring you, I will show how to obviate most of these problems. Converting a URL to its canonical form is actually fairly easy to do. It requires much more than simply converting all "%20" sequences to space characters, however, and most people don't get it right the first time.
User Authentication
One of the simplest ways to mangle a URL is to add user-authentication information. If access to a web page is restricted via a username and password, that login information can be passed as part of the URL: http://username:password@<URL>, where everything between the protocol and the '@' compose the login data. If access is not restricted, login information can still be provided but will be ignored. "Ignored" often translates to "do whatever you like," and such is the case here. So, "http://www.windevnet.com@www.hackme.com" appears to be taking you to the Windows Developer web site but, in fact, is loading data from hackme.com. To make matters worse, the '@' can be encoded, giving you "http://www.windevnet.com%40www.hackme .com," making the URL appear even more like the Windows Developer web page.
The test case generator creates a few user authentication variations of the base URL by inserting "username:password@", "blahblah@", and "www.evil.com%2fYou%2fAre%2fSo%2fHacked.htm@" between the URL's protocol and the URL's path (the forward slash has to be encoded for this to work when subpaths are included). It also partially and fully encodes the '@' to the requested level using the specified encoder; I'll cover encoders and encoding levels later.
Navigation Injection
The next simplest conceptually, at least method for mangling URLs is navigation injection. The '.' and '..' characters are typically used to pull in images and other supporting files (e.g., <img href="..\..\images\spacer.html"/>), and this pathname navigation works just as well in URLs: http://www.evil.com/../../../webserver/data/passwords.txt. If your web server doesn't guard against this, a hacker can easily gain access to private data or run commands on your web server: http://www.evil.com/../../../windows/system32/cmd.exe /c format c:. Similarly, if you are using the URL path to determine access privileges, an attacker (or simply an employee curious what the CEO's salary is) can gain access to data she should not have. While the need to conjure up the correct relative path to access specific folders may seem to lessen this threat, in reality, many administrators install web servers to their default location. Attackers can install web servers just as easily as we can, so it is not very hard for them to determine the exact path to use.
An infinite number of path injection cases are possible, but one of the goals of the test-case generator is to only generate valid cases. Thus, it restricts itself to inserting "this folder" navigation as well as "parent folder" navigation when the URL path contains multiple levels. CNavigationInjector (Listing 1) takes a path and passes it through a CUrl (a helper object that knows how to split a path or URL into its protocol, domain, and path) to isolate the path, then uses SelectInjectionPoint to decide where to insert the path navigation.
SelectInjectionPoint first identifies the location of each path separator, then picks one as follows:
Once we have the injection point, it's a simple matter for InjectForwardSlashDot to insert the "this folder" navigation. InjectForwardSlashDotDot has it slightly more complicated as it can only do so if a parent folder exists before the injection point. GetParentFolder determines whether this is so by searching the path backwards from the injection point for another path separator; if one is found, the two locations define the parent folder, which is used to generate the "..\parent_folder" string that is inserted.
If you download and peruse the full code, you'll note that both forward and backward slash-path navigation variants are generated. Web browsers generally support both, so both need to be tested.
IP Address Encoding
Now we start getting into the truly interesting problems. Although they tend to require complicated mathematics, they also tend to be eminently computable, so we can let the test-case generator do all the nasty calculations, and we can concentrate on correctly handling the cases. I touched upon the first of these earlier. The most common form of an IP address is the familiar dotted decimal quad: www.windevnet.com translates to 66.35.216.85, for example. As the name implies, these numbers are in base 10. As is often the case when working with computers, though, the numbers can just as easily be given in base 8 (octal) or base 16 (hexadecimal, or hex). All that's required is prepending a '0' for octal or a '0x' for hex.
C++ makes converting between strings and numbers or converting a number between bases super easy. Decisions you've chewed over in the past such as trying to guess the largest number you'll need to handle so you can size the buffer big enough to handle all possible values but not so big you're just wasting space are completely eliminated when you take advantage of the stringstream class. In most cases, you can simply use Boost's lexical_cast, but as it doesn't provide a way to specify the base of a number, I borrowed its technique to roll my own converter. NumberToStringAsBase (see Listing 2) uses the std::stringstream object a member of the stream I/O family that happens to use a string as its backing store to read in a number and extract its string form. Not only does stringstream take care of the datatype conversion, but it also allows you to specify which base to use. StringAsBaseToNumber does the reverse: It converts a stringized number in a given base to the equivalent number.
Not only can the quads be encoded in different bases, but additional digits can be prepended to the octal and hex forms. Each quad is really an 8-bit value, but IP-processing functions often accept larger numbers and simply ignore the extra bits. To use Windows Developer's address again, 66.35.216.85 could also be written as 66.35.216.3157. Thus, not only do we need to generate variations where the various parts of the IP address are encoded as hex and octal, we need to generate variations containing random data prepended to each quad. Given our string-to-number converters and the SelectRandomIndex helper, this becomes as easy as generating a series of random numbers; see Listing 3.
Finally, two or more of the quads can be collapsed to a single value. An IP address is really a single number split into pieces to make it easier to work with. Four small numbers are easier both on a human level (people usually find many small values easier to work with than a single large value would you rather read off 66.35.216.85 or 1109645397 when recording a user's IP address?) and on a technical level (for example, a subnet mask of 0.0.255.0 lets you easily separate every IP address from 66.35.0.0 to 66.35.255.255).
Collapsing two, three, or all four of the quads is valid (when all four are collapsed, the IP address is called "dotless"). Many web sites will give you a somewhat involved formula for determining the collapsed value. "How to Obscure Any URL" (http://www.pc-help.org/ obscure.htm) explains it particularly well, but I find it much simpler to simply convert each quad to its hexadecimal form, squish the four hex values together, then convert the resulting hex value to the required base. Thus 66.35.216.85 becomes 42.23.D8.55, which becomes 4223D855, which becomes the 1109643597 I previously referenced.
Once again, our string-to-number converters make short work of this task; see Listing 4. The closest to complicated this comes is remembering to prepend a 0 to single-digit values to make the math work correctly.
The CIPEncoder class brings all this together. The constructor does the grunt work of converting the domain name to an IP address and splitting the address into its constituent quads, then it has GenerateEncodings (Listing 5 is a stripped-down version) generate a set of encodings using the provided encoder. GenerateEncodings converts each quad to hex and octal, then it generates each collapsed combination and also converts them to hex and octal. Finally a series of mix-and-match cases are generated where one or more quads are converted to a different base while one or more (possibly different) quads are collapsed; I used multiwise combinatorics to minimize the number of these cases while still ensuring fairly complete coverage across the myriad of possibilities. And, of course, every case has random digits prepended to one or more of the quads.
UTF-8 Character Encodings
Now we're finally ready to talk about character encodings. In the introduction, I mentioned the UTF-8 encoding; this is the most common but UTF-16, UTF-32, and Unicode encodings also exist. Almost any character in a URL can be encoded: the slashes in the protocol, the dots in the IP address, the dots or any other character in the domain name, and any character in the path (including path separators). Further, encodings can themselves be encoded, so "file.txt" can become "file%2Etxt", which can become "file%252Etxt" or "file%25%32%65txt"; a further pass could take it on to "file%25252Etxt" or "file%25%32%6525%32%65txt". And, of course, any character can be encoded in each pass not just a character that was encoded in the previous pass so a single URL could contain characters that aren't encoded at all, characters that have been encoded once or twice, and characters that have been encoded 10 or more times.
As long as you're dealing with ASCII values, UTF-8 encoding is simple: Simply prepend a '%' to the hexadecimal numerical value of the character. Thus, a period, which has an ASCII value of 46, or 0x2E, is encoded as "%2E". Once you expand out into the wide world of Unicode, however, the math starts getting a little hairy. UTF-8 is a variable-width encoding that specifies the number of subsequent bytes that go with the lead byte by using special sequences of lead bits in the lead byte, a special "I'm a supporting byte" bit sequence in subsequent bytes, and spreading the bits composing the character's value across the remaining bits in each byte.
Even with examples and assistance from a colleague, Lawrence Landauer, it took me awhile to understand how this works. Let's start simple: A period is 0x2E, which is 101110 in binary. Looking through Table 1 for the character range that contains our character, we see that it falls into the very first range, so we replace the 'x's in the Encoded Bytes cell with the binary representation of our character. This gives us 00101110, which converts back to hex as 0x2E. We already knew that would be the answer, but it gives us confidence our technique is correct. Now we can follow the exact same process for any other value: convert the value to binary, find the character range containing the value, replace the 'x's in the Encoded Bytes cell with the binary value, then convert back to hex. Thus, we can convert the Lira sign to UTF-8 like so:
1. Look up the character's value. (If you have access to Windows, charmap.exe is very helpful in this regard. If you have Microsoft Office, you'll also want to install the Arial Unicode MS font by checking the Office Shared Features | International Support | Universal Font option.) The Lira sign is 0x20A4.
2. Convert the value to binary. (Again, if you have access to Windows, calc.exe helps out here.) The Lira sign in binary is 10000010100100.
3. Look through Table 1 and find the character range containing the value, then move across and find the byte encoding. 0x20A4 falls between 0x800 and 0xFFFF, so we use 1110xxxx 10xxxxxx 10xxxxxx.
4. Moving right-to-left both in the byte-encoding template and in the binary value, replace each 'x' in the template with the corresponding bit from the binary value. Replace any remaining 'x's with 0. This gives us 11100010 10000010 10100100.
5. Convert the binary value to hex. This gives us 0xE2 0x82 0xA4.
If my explanation isn't helping, the book Writing Secure Code has a readable explanation of all this (see References); or if you're a glutton for punishment, RFC 2279 (http:// www.ietf.org/rfc/rfc2279.txt) is the official definition.
We programmers are never one to leave a good thing alone. While the UTF-8 makes it quite clear you should always use the shortest possible representation to encode a character, you are free to use a longer (known as "overlong") encoding as well. If we repeat the Lira example, but use the immediately following template (11110xxx 10xxxxxx 10xxxxxx 10xxxxxx), we'll end up with a result of 0xF0 0x82 0x82 0xA4.
Lest you think you won't run into this in the wild, Microsoft Security Bulletin MS00-057 describes a bug in Microsoft Internet Information Server exploitable via overlong UTF-8 (which the bulletin coyly calls "a particular type of malformed URL") that allowed an attacker to bypass security and use path navigation injection to pass commands to the command shell.
As it happens, the test-case generator will create UTF-8 and overlong UTF-8 cases for you via the CUtf8Encoder class. ConvertToUtf8 (see Listing 6) goes through the same steps we did to convert a character's value to UTF-8. After finding the first character range into which the character fits, it spreads the character's value's bits across the empty bits of the template, then converts each byte to its equivalent string-ized hex value. It doesn't stop there, however, but goes on to create all valid overlong encodings as well. CUtf8Encoder's constructor allows you to specify whether to generate overlong UTF-8 and how overlong to go; ConvertToUtf8 uses this to determine which of the generated encodings to return.
Unicode Character Encodings
The Unicode (UCS-2 Unicode, to be specific) encoding is much simpler than the UTF-8 encoding: Simply convert the character's hexadecimal Unicode value to a four character string, then prepend "%u". Thus a period would be "%u002E", and the Lira sign would be "%u20A4".
You're not out of the woods just yet, however. The lower and upper ASCII characters also have full-width variants that live in the 0xFF00 through 0xFFEF range. A full-width character is just another Unicode character, however, so the full-width period would be "%uFF0E". The CUnicodeMapper class (download the full code) populates a vector with the standard-to-full-width mapping for each character; its GetCharacterFromMap function takes a character (which can be either standard or full width) and returns the matching standard or full-width character. (UTF-8 can also encode full-width Unicode characters.)
Using the Encoders
We've already covered user authentication, path navigation injection, and the specifics of how the UTF-8 and Unicode encoding schemes work, but I've mentioned only obliquely how the encoding schemes are applied to the URL and what the encoding level means in this context. Recall that not only can a plain character be encoded, but that the encoded form can itself be encoded, ad infinitum. The CCharacterEncoder class uses a specific encoder (i.e., null which does nothing, Unicode, or UTF-8) to encode a single character. If given a nonzero encoding level, it also encodes the character multiple times as specified. (No upper limit is imposed, but as just a level of three or four will generate many hundreds of test cases, you generally won't want to go too high.) Two variants are stored at each encoding level: a partial encoding, where only one character in the string is encoded; and a full encoding, where every character in the string is encoded. Thus, encoding a backslash using the UTF-8 encoder might generate the encodings in Table 2.
CStringEncoder (Listing 7 is its header; Listing 8 is its body) builds on CCharacterEncoder and is typical of how all of the encoders work. Arguments to the constructor include the string to encode, the encoding level, and the encoder to use. CreateSingleCharacterEncodings picks a random index into the string to encode, generates encodings for the character at that index, then creates a version of the original string where the source character is replaced with the generated full and partial encodings of the character. CreateEntireStringFullEncodings, on the other hand, encodes every character in the string at every level. This will quickly generate extremely long strings, so you can test for buffer overruns at the same time you're testing URL handling. Finally, CreateEntireStringRandomEncodings also encodes every character in the string, but rather than always using the encoding for the current level, it randomly selects an encoding level. Thus, the resulting strings will contain unencoded characters, singly encoded characters, doubly encoded characters, and so on.
Generating the Test Cases
Listing 9 shows how to generate and use the test cases. The action starts with EncodeUrl. EncodeUrl saves off the source URL, the encoding level (more on that in a moment), whether it should generate normal or full-width Unicode, whether it should generate overlong UTF-8 and if so how overlong to go, and the maximum number of characters to prepend to IP addresses. Then it calls EncodeUrlUsingEncoder once for each encoder (null, Unicode, and UTF-8). EncodeUrlUsingEncoder (Listing 10) wraps all the different encodings and URL manglings we have discussed, mixes in a few more, then spits out anywhere from one to hundreds of test cases.
The first thing EncodeUrlUsingEncoder does is add the raw URL to the collection of test cases, passing it through a CUrl to ensure it is fully canonicalized. If an encoding level of zero was specified, nothing else happens.
If the requested encoding level was one or higher, the raw URL will be run once through UrlEscape to encode any "dangerous" characters. MSDN defines "dangerous" as "those characters that may be altered during transport across the Internet, [which] include the (<, >, ", #, {, }, |, \, ^, ~, [, ], and ') characters" (see References). This is the most common form of altered URLs you will encounter.
Requesting an encoding level of two or higher will get you all the encodings we've talked about: User authentication will be inserted, path navigation will be injected, and the IP address will be encoded and collapsed. In addition, various special characters (e.g., dots in the domain name and IP address, dots and slashes in the path), random characters in the domain name and path, the entire domain name, and the entire path will be encoded.
Verifying URL Handling Failures
If you've jumped ahead of the rest of the class and tried these test cases on Internet Explorer (I generally use IE as a litmus test for whether a particular URL variant is valid), you've probably found that some of them don't work. Does this mean the encoding is incorrect? No, it just means IE doesn't handle that particular URL. Some of the time this is intended behavior; IE5 and later don't support dotless IP addresses, for example. Other times it's unclear why it doesn't work (I've found neither Internet Explorer nor any of the Win32 canonicalization APIs like Unicode-encoded URLs, for example). Regardless, the result is you have to pick through a large number of test cases and identify the ones that don't work and you have to do so on every combination of OS and Internet Explorer version you intend to support. To make that task easier, I've included UrlVerifier in the online source code.
UrlVerifier does two things: It generates a set of test cases by passing its command-line arguments on to UrlCanonicalizationTestCases, then it determines whether each test case is valid (see the sidebar entitled "Test Cases Everywhere" for an explanation of why I wrapped EncodeUrlUsingEncoder in UrlCanonicalizationTestCases rather than using it directly). To do so, it first repeatedly passes the URL through a canonicalizing function until it stops changing. (The most common URL canonicalization mistake is to only do one pass through the canonicalization function.) Hopefully, I've convinced you that rolling your own such function would be rather complicated; in fact, there's no reason to do so as Windows provides not one, not two, but three separate canonicalizing APIs: InternetCanonicalizeUrl, UrlCanonicalize, and UrlUnescape API functions. Each one takes slightly different options and works in a slightly different manner, but as UrlVerifier's output show, UrlUnencode seems to do the best job.
After canonicalizing each URL using each of the canonicalization functions, UrlVerifier attempts to download the presumably decoded file using URLDownloadToCacheFile. The results of each operation are written to a log file (see Table 3 for a portion of the output from running UrlVerifier on http://www .windevnet.com/wdm/default.html) from which you can easily determine which cases Windows thinks are valid and, thus, which cases your application should handle. However, if you use different APIs than UrlVerifier, be sure to run your own tests with those APIs rather than taking UrlVerifier's word on which variations should work. And if you're rolling your own URL support, be especially vigilant when deciding which cases to handle.
Summary
There are many different ways to mangle a URL, but the solution to handling all of these is to run every URL through InternetCanonicalizeUrl over and over until it stops changing and to do so the moment you're handed the URL, before you've made any decisions about the URL or taken any action based on it. The Internet can be a wild and wooly place, but with the help of a few simple API calls, you can effectively tame it. The sidebar "Handling Illegal Filenames" discusses another test case generator I've provided to help tame another area rife with pitfalls: illegal filenames.
Acknowledgment
Thanks much to Lawrence Landauer, whose patient explanation of how UTF-8 works and assistance with the UTF-8 encoding algorithm was invaluable!
References
Microsoft Security Bulletin MS98-016. http://www.microsoft.com/technet/treeview/default.asp?url=/technet/security/bulletin/ ms98-016.asp.
Microsoft Security Bulletin MS01-051. http://www.microsoft.com/technet/treeview/default.asp?url=/technet/security/bulletin/ ms01-051.asp.
Boost Random Number Library. http://www.boost.org/libs/random/index.html.
Boost Header lexical_cast. http://www.boost.org/libs/conversion/lexical_cast.htm.
"How to Obscure Any URL." http://www.pc-help.org/obscure.htm.
Writing Secure Code, pp. 323-324, Michael Howard and David LeBlanc. Microsoft Corporation, 2002.
RFC 2279. http://www.ietf.org/rfc/rfc2279.txt.
Microsoft Security Bulletin MS00-057. http://www.microsoft.com/technet/treeview/default.asp?url=/technet/security/bulletin/ ms00-057.asp. w::d
Michael J. Hunter is a lead developer at Humbug Reality where he works on whatever catches his interest. Michael's alter ego is a tester at a major software company who has so much fun finding bugs he doesn't mind funding Michael's flights of fancy.
|
|
||||||||||||||||||||||||||
|
|