Passwords (original) (raw)

Passwords are widely used for authentication, especially on the web. What practices should be used to make passwords as secure as possible?

14.1. Risks and weaknesses of passwords

Passwords have some well-known usability shortcomings. Security experts recommend that people pick long, strong passwords, but long random passwords are harder to remember. In practice, users are more likely to choose memorable passwords, which may be easier to guess. Also, rather than using a different, independently chosen password for each site, users often reuse passwords across multiple sites, for ease of memorization. This has security consequences as well.

From a security perspective, we can identify a number of security risks associated with password authentication:

We’ll look at defenses and mitigations for each of these risks, below.

14.2. Mitigations for eavesdropping

There is a straightforward defense against eavesdropping: we can use SSL (also known as TLS). In other words, instead of connecting to the web site via http, the connection can be made over https. This will ensure that the username and password are sent over an encrypted channel, so an eavesdropper cannot learn the user’s password.

Today, many sites do use SSL, but many do not.

Another possible defense would be to use more advanced cryptographic protocols. For instance, one could imagine a challenge-response protocol where the server sends your browser a random challenge \(r\); then the browser takes the user’s password \(w\), computes \(H(w,r)\) where \(H\) is a cryptographic hash (e.g., SHA256), and sends the result to the server. In this scheme, the user’s password never leaves the browser and is never sent over the network, which defends against eavesdroppers. Such a scheme could be implemented today with Javascript on the login page, but it has little or no advantage over SSL (and it has some shortcomings compared to using SSL), so the standard defense is to simply use SSL.

14.3. Mitigations for client-side malware

It is very difficult to protect against client-side malware.

To defend against keyloggers, some people have proposed using randomized virtual keyboards: a keyboard is displayed on the screen, with the order of letters and numbers randomly permuted, and the user is asked to click on the characters of their password. This way, a keylogger (which only logs the key strokes you enter) would not learn your password. However, it is easy for malware to defeat this scheme: for instance, the malware could simply record the location of each mouse click and take a screen shot each time you click the mouse.

In practice, if you type your password into your computer and your computer has malware on it, then the attacker learns your password. It is hard to defend against this; passwords are fundamentally insecure in this threat model. The main defense is two-factor authentication, where we combine the password with some other form of authentication (e.g., a SMS sent to your phone).

14.4. Online guessing attacks

How easy are online guessing attacks? Researchers have studied the statistics of passwords as used in the field, and the results suggest that online guessing attacks are a realistic threat. According to one source, the five most commonly used passwords are 123456, password, 12345678, qwerty, abc123. Of course, a smart attacker will start by guessing the most likely possibilities for the password first before moving on to less likely possibilities. A careful measurement study found that with a dictionary of the 10 most common passwords, you can expect to find about 1% of users’ passwords. In other words, about 1% of users choose a password from among the top 10 most commonly used passwords. It also found that, with a dictionary of the \(2^{20}\) most commonly used passwords, you can expect to guess about 50% of users’ passwords: about half of all users will have a password that is in that dictionary.

One implication is that, if there are no limits on how many guesses an attacker is allowed to make, an attacker can have a good chance of guessing a user’s password correctly. We can distinguish targeted from untargeted attacks. A targeted attack is where the attacker has a particular target user in mind and wants to learn their password; an untargeted attack is where the attacker just wants to guess some user’s password, but doesn’t care which user gets hacked. An untargeted attack, for instance, might be relevant if the attacker wants to take over some existing Gmail account and send lots of spam from it.

The statistics above let us estimate the work an attacker would have to do in each of these attack settings. For an untargeted attack, the attacker might try 10 guesses at the password against each of a large list of accounts. The attacker can expect to have to try about 100 accounts, and thus make a total of about 1000 login attempts, to guess one user’s password correctly. Since the process of guessing a password and seeing if it is correct can be automated, resistance against untargeted attacks is very low, given how users tend to choose their passwords in practice.

For a targeted attack, the attacker’s workload has more variance. If the attacker is extremely lucky, he might succeed within the first 10 guesses (happens 1% of the time). If the attacker is mildly lucky, he might succeed after about one million guesses (happens half of the time). If the attacker is unlucky, it might take a lot more than one million guesses. If each attempt takes 1 second (to send the request to the server and wait for the response), making \(2^{20}\) guesses will take about 11 days, and the attack is very noticeable (easily detectable by the server). So, targeted attacks are possible, but the attacker is not guaranteed a success, and it might take quite a few attempts.

14.5. Mitigations for online guessing attacks

Let’s explore some possible mitigations for online guessing:

14.6. Mitigations for server compromise

The natural way to implement password authentication is for the website to store the passwords of all of its passwords in the clear, in its database. Unfortunately, this practice is bad for security. If the site gets hacked and the attacker downloads a copy of the database, then now all of the passwords are breached; recovery may be painful. Even worse, because users often reuse their passwords on multiple sites, such a security breach may now make it easier for the attacker to break into the user’s accounts on other websites.

For these reasons, security experts recommend that sites avoid storing passwords in the clear. Unfortunately, sites don’t always follow this advice. For example, in 2009, the Rockyou social network got hacked, and the hackers stole the passwords of all 32 million of their users and posted them on the Internet; not good. One study estimates that about 30–40% of sites still store passwords in the clear.

14.7. Password hashing

If storing passwords in the clear is not a good idea, what can we do that is better? One simple approach is to hash each password with a cryptographic hash function (say, SHA256), and store the hash value (not the password) in the database.

In more detail, when Alice creates her account and enters her password \(w\), the system can hash \(w\) to get \(H(w)\) and store \(H(w)\) in the user database. When Alice returns and attempts to log in, she provides a password, say \(w'\); the system can check whether this is correct by computing the hash \(H(w')\) of \(w'\) and checking whether \(H(w')\) matches what is in the user database.

Notice that the properties of cryptographic hash functions are very convenient for this application. Because cryptographic hash functions are one-way, it should be hard to recover the password \(w\) from the hash \(H(w)\); so if there is a security breach and the attacker steals a copy of the database, no cleartext passwords are revealed, and it should be hard for the attacker to invert the hash and find the user’s hashes. That’s the idea, anyway.

Unfortunately, this simple idea has some shortcomings:

14.8. Password hashing, done right

With these shortcomings in mind, we can now identify a better way to store passwords on the server.

First, we can eliminate the amortized guessing attack by incorporating randomness into the hashing process. When we create a new account for some user, we pick a random salt \(s\). The salt is a value whose only purpose is to be different for each user; it doesn’t need to be secret. The password hash for password \(w\) is \(H(w,s)\). Notice that the password hash depends on the salt, so even if Alice and Bob share the same password \(w\), they will likely end up with different hashes (Alice will have \(H(w,s_A)\) and Bob \(H(w,s_B)\), where most likely \(s_A\ne s_B\)). Also, to enable the server to authenticate each user in the future, the salt for each user is stored in the user database.

Instead of storing \(H(w)\) in the database, we store \(s,H(w,s)\) in the database, where \(s\) is a random salt. Notice that \(s\) is stored in cleartext, so if the attacker gets a copy of this database, the attacker will see the value of \(s\). That’s OK; the main point is that each user will have a different salt, so the attacker can no longer use the amortized guessing attack above. For instance, if the salt for Alice is \(s_A\), the attacker can try guesses \(g_1,g_2,\dots,g_n\) at her password by computing \(H(g_1,s_A),\dots,H(g_n,s_A)\) and comparing each one against her password hash \(H(w_A,s_A)\). But now when the attacker wants to guess Bob’s password, he can’t reuse any of that computation; he’ll need to compute a new, different set of hashes, i.e., \(H(g_1,s_B),\dots,H(g_n,s_B)\), where \(s_B\) is the salt for Bob.

Salting is good, because it increases the attacker’s workload to invert many password hashes. However, it is not enough. As the back-of-the-envelope calculation above illustrated, an attacker might still be able to try \(2^{20}\) guesses at the password against each of 100 million users’ password hashes in about a day. That’s not enough to prevent attacks. For instance, when LinkedIn had a security breach that exposed the password hashes of all of their users, it was discovered that they were using SHA256, and consequently one researcher was able to recover 90% of their users’ passwords in just 6 days. Not good.

So, the second improvement is to use a slow hash. The reason that offline password guessing is so efficient is because SHA256 is so fast. If we had a cryptographic hash that was very slow—say, it took 1 millisecond to compute—then offline password guessing would be much slower; an attacker could only try 1000 guesses at the password per second.

One way to take a fast hash function and make it slower is by iterating it. In other words, if \(H\) is a cryptographic hash function like SHA256, define the function \(F\) by

\[F(x) = H(H(H(\cdots(H(x)) \cdots))),\]

where we have iteratively applied \(H\) \(n\) times. Now \(F\) is a good cryptographic hash function, and evaluating \(F\) will be \(n\) times slower than evaluating \(H\). This gives us a tunable parameter that lets us choose just how slow we want the hash function to be.

Therefore, our final construction is to store \(s,F(w,s)\) in the database, where \(s\) is a randomly chosen salt, and \(F\) is a slow hash constructed as above. In other words, we store

\[s,H(H(H(\cdots(H(w,s)) \cdots)))\]

in the database.

How slow should the hash function \(F\) be? In other words, how should we choose \(n\)? On the one hand, for security, we’d like \(n\) to be as large as possible: the larger it is, the slower offline password guessing will be. On the other hand, we can’t make it too large, because that will slow down the legitimate server: each time a user tries to log in, the server needs to evaluate \(F\) on the password that was provided. With these two considerations, we can now choose the parameter \(n\) to provide as much security as possible while keeping the performance overhead of slow hashing down to something unnoticeable.

For instance, suppose we have a site that expects to see at most 10 logins per second (that would be a pretty high-traffic site). Then we could choose \(n\) so that evaluating \(F\) takes about one millisecond. Now the legitimate server can expect to spend 1% of its CPU power on performing password hashes—a small performance hit. The benefit is that, if the server should be compromised, offline password guessing attacks will take the attacker a lot longer. With the example parameters above, instead of taking 1 day to try \(2^{20}\) candidate passwords against all 100 million users, it might take the attacker about 3000 machine-years. That’s a real improvement.

In practice, there are several existing schemes for slow hashing that you can use: Scrypt, Bcrypt, or PBKDF2. They all use some variant of the “iterated hashing” trick mentioned above.

14.9. Implications for cryptography

The analysis above has implications for the use of human-memorable passwords or passphrases for cryptography.

Suppose we’re building a file encryption tool. It is tempting to prompt the user to enter in a password \(w\), hash it using a cryptographic hash function (e.g., SHA256), use \(k=H(w)\) as a symmetric key, and encrypt the file under \(k\). Unfortunately, this has poor security. An attacker could try the \(2^{20}\) most common passwords, hash each one, try decrypting under that key, and see if the decryption looks plausibly like plaintext. Since SHA256 is fast, this attack will be very fast, say one millisecond; and based upon the statistics mentioned above, this attack might succeed half of the time or so.

You can do a little bit better if you use a slow hash to generate the key instead of SHA256. Unfortunately, this isn’t enough to get strong security. For example, suppose we use a slow hash tuned to take 1 millisecond to compute the hash function. Then the attacker can make 1000 guesses per second, and it’ll take only about 15 minutes to try all \(2^{20}\) most likely passwords; 15 minutes to have a 50% chance of breaking the crypto doesn’t sound so hot.

The unavoidable conclusion is that deriving cryptographic keys from passwords, passphrases, or human-memorable secrets is usually not such a great idea. Password-based keys tend to have weak security, so they should be avoided whenever possible. Instead, it is better to use a truly random cryptographic key, e.g., a truly random 128-bit AES key, and find some way for the user to store it securely.

14.10. Alternatives to passwords

Finally, it is worth noting that there are many alternatives to passwords, for authenticating to a server. Some examples include:

We most likely won’t have time to discuss any of these further in this class, but they are worth knowing about, for situations where you need more security than passwords can provide.

14.11. Summary

The bottom line is: don’t store passwords in the clear. Instead, sites should store passwords in hashed form, using a slow cryptographic hash function and a random salt. If the user’s password is \(w\), one can store

\[s,H(H(H(\cdots(H(w,s)) \cdots)))\]

in the database, where \(s\) is a random salt chosen randomly for that user and \(H\) is a standard cryptographic hash function.