InternetDomainNameExplained (original) (raw)

InternetDomainName

Introduction

InternetDomainName is a useful tool for parsing and manipulating domain names. It can be used as a validator, a component extractor, and as a value type for passing around domain names in a type-safe way.

However, there are some aspects of InternetDomainName behavior which may be surprising, and which can lead to bugs in calling code. This document addresses these concerns.

Details

Public suffixes and private domains

An InternetDomainName object is guaranteed to be syntactically valid according to relevant RFC specifications, but it is not guaranteed to correspond to an actual addressable domain on the Internet. It is impossible to do that without doing a net lookup of the domain and trying to contact it, and that is unacceptable overhead for most common cases.

Still, it is often very useful to determine whether a given domain namemight represent an actual domain on the Internet. For this purpose, we use data from the Public Suffix List (PSL), a list maintained by the Mozilla Foundation. There are methods on InternetDomainNameto determine the relationship of a given domain to the PSL. To put it in its most basic terms, if domain.hasPublicSuffix() returns true, then the domainmight correspond to a real Internet address; otherwise, it almost certainly does not.

At this point we need to back up and define some terms. There are four terms of interest:

It's worth reading the linked articles carefully before proceeding.

A major source of confusion is that people say "TLD" when they mean either "registry suffix" or "public suffix". All three of these are independent concepts. So, for example,

This confusion is especially dangerous because TLD and registry suffix have crisp, formal definitions, while public suffix does not. In the end, a public suffix is something that a credible source has asked the PSL maintainers to add to the list. Credible sources include ICANN and country-domain managers, but also include private companies offering services that share the characteristics that (fuzzily) define a public suffix -- independent subdomains andsupercookie suppression. So, for example, many Google-owned domains (e.g. blogspot.com) are included in the PSL.

Getting back to InternetDomainName, as long as we limit ourselves to usinghasPublicSuffix() to validate that the domain is a plausible Internet domain, all is well. The danger arises from the methods that identify or extract the "top private domain". From a technical point of view, the top private domain is simply the rightmost superdomain preceding the public suffix. So for example,www.foo.co.uk has a public suffix of co.uk, and a top private domain offoo.co.uk.

As noted in the documentation on isUnderPublicSuffix(),isTopPrivateDomain(), and topPrivateDomain(), the only thing these methods are (mostly) reliable for is determining where one can set cookies. However, what many people are actually trying to do is find the "real" domain, or the "owner" domain, from a subdomain. For example, in mail.google.com they would like to identify google.com as the owner domain. So they write

InternetDomainName owner = InternetDomainName.from("mail.google.com").topPrivateDomain();

...and sure enough, owner ends up with the domain google.com. Indeed, this idiom (and ones like it) work a great deal of the time. It seems intuitive that "the domain under the public suffix" should be semantically equivalent to "the owner domain".

But it's not, and therein lies the problem. Consider blogspot.com, which appears in the PSL. While it has the characteristics of a public suffix -- people can register domains under it (for their blogs), and cookies should not be allowed to be set on it (to prevent cross-blog cookie shenanigans), it is itself an addressable domain on the Internet (which happens to redirect toblogger.com at time of writing, but that could easily change).

So if one uses the idiom above on foo.blogspot.com, owner will be that same domain, foo.blogspot.com. This is the right answer for cookie-setting anymany common applications, but it is obviously surprising to many people.

For those rare applications where the goal is actually to determine the domain name that was purchased from a registry, the correct abstraction is not public suffix but registry suffix. If we alter the above code to use the parallel methods for registry suffixes:

InternetDomainName owner = InternetDomainName.from("foo.blogspot.com").topDomainUnderRegistrySuffix();

...then we end up with owner set to blogspot.com.

This handy table summarizes the differences between the various types of suffix:

InternetDomainName publicSuffix() topPrivateDomain() registrySuffix() topDomainUnderRegistrySuffix()
google.com com google.com com google.com
co.uk co.uk N/A co.uk N/A
www.google.co.uk co.uk google.co.uk co.uk google.co.uk
foo.blogspot.com blogspot.com foo.blogspot.com com blogspot.com
google.invalid N/A N/A N/A N/A

The big lessons here are:

Remember that if you do not heed this advice, your code will appear to work on a huge variety of inputs...but the failure cases are all bugs just waiting to happen, and the set of failure cases will change as PSL updates are incorporated into the code underlying InternetDomainName.