Moving from reCAPTCHA to hCaptcha (original) (raw)
2020-04-08
6 min read
We recently migrated the CAPTCHA provider we use from Google's reCAPTCHA to a service provided by the independent hCaptcha. We're excited about this change because it helps address a privacy concern inherent to relying on a Google service that we've had for some time and also gives us more flexibility to customize the CAPTCHAs we show. Since this change potentially impacts all Cloudflare customers, we wanted to walk through the rationale in more detail.
CAPTCHAs at Cloudflare
One of the services Cloudflare provides is a way to block malicious automated ("bot") traffic. We use a number of techniques to accomplish that. When we are confident something is malicious bot activity we block it entirely. When we are confident it's good human traffic (or a good bot like a search engine crawler) then we let it through. But, sometimes, when we're not 100% sure if something is malicious or good we issue it a “challenge”.
We have different types of challenges, some are entirely automatic, but one requires human intervention. Those challenges are known as CAPTCHAs. That's an acronym for Completely Automated Public Turing Test to Tell Computers and Humans Apart (a few Ts are dropped otherwise it'd be CAPTTTCHA). These are the prompts to enter squiggly letters in a box or identify traffic lights or cross walks. Generally, they're supposed to be something that's easy for humans to do but hard for machines.
Since Cloudflare's earliest days, we have used Google's reCAPTCHA service. ReCAPTCHA started as a research project out of Carnegie Mellon University in 2007. Google acquired the project in 2009, around the same time that Cloudflare was first getting started. Google provided reCAPTCHA for free in exchange for data from the service being used to train its visual identification systems. When we were looking for a CAPTCHA for Cloudflare, we chose reCAPTCHA because it was effective, could scale, and was offered for free — which was important since so many of Cloudflare's customers use our free service.
Privacy and Blocking Concerns
Since those early days, some customers have expressed concerns about using a Google service to serve CAPTCHAs. Google's business is targeting users with advertising. Cloudflare's is not. We have strict privacy commitments. We were able to get comfortable with the Privacy Policy around reCAPTCHA, but understood why some of our customers were concerned about feeding more data to Google.
We also had issues in some regions, such as China, where Google's services are intermittently blocked. China alone accounts for 25 percent of all Internet users. Given that some subset of those could not access Cloudflare's customers if they triggered a CAPTCHA was always concerning to us.
Over the years, the privacy and blocking concerns were enough to cause us to think about switching from reCAPTCHA. But, like most technology companies, it was difficult to prioritize removing something that was largely working instead of brand new features and functionality for our customers.
Google's Changing Business Model
Earlier this year, Google informed us that they were going to begin charging for reCAPTCHA. That is entirely within their right. Cloudflare, given our volume, no doubt imposed significant costs on the reCAPTCHA service, even for Google.
Again, this is entirely rational for Google. If the value of the image classification training did not exceed those costs, it makes perfect sense for Google to ask for payment for the service they provide. In our case, that would have added millions of dollars in annual costs just to continue to use reCAPTCHA for our free users. That was finally enough of an impetus for us to look for a better alternative.
A Better CAPTCHA
We evaluated a number of CAPTCHA vendors as well as building a system ourselves. In the end, hCaptcha emerged as the best alternative to reCAPTCHA. We liked a number of things about the hCaptcha solutions: 1) they don't sell personal data; they collect only minimum necessary personal data, they are transparent in describing the info they collect and how they use and/or disclose it, and they agreed to only use such data to provide the hCaptcha service to Cloudflare; 2) performance (both in speed and solve rates) was as good as or better than expected during our A/B testing; 3) it has a robust solution for visually impaired and other users with accessibility challenges; 4) it supported Privacy Pass to reduce the frequency of CAPTCHAs; 5) it worked in regions where Google was blocked; and 6) the hCaptcha team was nimble and responsive in a way that was refreshing.
The standard hCaptcha business model was similar to how reCAPTCHA started. They planned to charge customers that needed image classification data and pay publishers to install their CAPTCHA on their sites. Sounded great to us, but, unfortunately, while that may work well for most publishers, it doesn't at Cloudflare's scale.
We worked with hCaptcha in two ways. First, we are in the process of leveraging our Workers platform to bear much of the technical load of the CAPTCHAs and, in doing so, reduce their costs. And, second, we proposed that rather than them paying us we pay them. This ensured they had the resources to scale their service to meet our needs. While that has imposed some additional costs, those costs were a fraction of what reCAPTCHA would have. And, in exchange, we have a much more flexible CAPTCHA platform and a much more responsive team.
When do Customers Serve CAPTCHAs?
When we first started working on this project, the assumption was that Cloudflare Bot Management and Firewall Rules would be by far the largest consumer of CAPTCHAs. This was somewhat correct. While Firewall/Bots was the #1 consumer, it only was a bit over 50% of our CAPTCHAs served.
These are the breakdowns of when Cloudflare customers asked us to serve a CAPTCHA on their zones, by total CAPTCHAs served.
Source
Firewall and Bot Rules
54.8%
IP Firewall
18.6%
Security Level
16.8%
DDoS
6.3%
Rate Limiting
1.7%
WAF Rules
1.5%
Other
0.3%
Our Firewall and Bot Rules are at the top and are the majority of the CAPTCHAs served by Cloudflare. These are rules written by our customers that specifically throw a CAPTCHA when the rule is matched. Examples of these include firing a Captcha if a Bot Management score is below a threshold where you believe it is likely that the connection is automated, but the score is above a threshold where you are not certain. Another common rule in this bucket is to CAPTCHA 100% of all traffic behind a site or specific endpoint. Customers may be doing this to limit connections to their origins, or to slow down automated systems from doing something like credential stuffing on a login page or polluting registration data. This leads to some sites on Cloudflare serving hundreds of millions of CAPTCHAs per day.
The second most popular is our IP Firewall. This is similar to the Firewall and Bot Rules, but much less granular at the IP, ASN, or country level. The majority of CAPTCHAs for this category are done for rules written at the ASN or country level. Presumably our customers find some level of distrust with a particular ASN (for example, why would there be supposed user traffic coming from a cloud infrastructure provider?) or are being attacked from specific countries.
Next comes Security Levels. Security levels have two distinct use cases: 1) as a blunt tool for IP address reputation and 2) I’m Under Attack Mode. While we recommend to customers that they only use I’m Under Attack Mode while under an active denial of service attack, some customers leave the feature on 100% of the time as a rudimentary form of rate limiting and filtering.
The final major use of CAPTCHA is through one of our automated systems: recently our Denial of Service protection engineering team taught Gatebot to use CAPTCHAs to mitigate small floods in specific scenarios. Gatebot can now write temporary rules that result in CAPTCHAs being shown to attackers.
Lastly, there were also some customers selecting it as an override action for their Rate Limiting or Managed WAF rulesets.
We also took a look at which types of customers were serving the CAPTCHAs. Over a week’s period of time (normalizing for attacks), our free customers configured their zones to serve roughly 40-60% of the total CAPTCHAs served by Cloudflare. Of our paying customers, there is a generally even split between our pay-as-you-go and our enterprise customers. Overall, we have measured that Cloudflare will show multiple millions of CAPTCHAs per second when one or more of our customers are under attack.
Fixing Challenges
Whenever we change any part of Cloudflare's systems, it makes things significantly better for some, but temporarily worse for others. We and the hCaptcha team are committed to addressing any problems that come up. If you or your users see issues with the new hCaptcha implementation, please post on the forum or open a Support ticket with as much detail as possible.
Whenever possible, please include the RayID that usually appears on the footer of the CAPTCHA page so we can track down what went wrong.
Over time, we believe visual (and audio) CAPTCHAs are an imperfect answer to a number of difficult problems. Cloudflare is continuing work to minimize and hopefully eventually eliminate altogether the number of CAPTCHAs we issue and we will be excited to share the results of that work in this blog as we move along. The name of our internal chat room for the team making this change isn’t New CAPTCHA, it’s (No)CAPTCHA.
Cloudflare's connectivity cloud protects entire corporate networks, helps customers build Internet-scale applications efficiently, accelerates any website or Internet application, wards off DDoS attacks, keeps hackers at bay, and can help you on your journey to Zero Trust.
Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.
To learn more about our mission to help build a better Internet, start here. If you're looking for a new career direction, check out our open positions.