robots.txt – Techdirt (original) (raw)

Crawlers And Agents And Bots, Oh My: Time To Clarify Robots.txt

from the perplexing dept

Perplexity is an up-and-coming AI company that has broad ambition to compete with Google in the search market by providing answers to user queries with AI as its core technology.

They’ve been in the news because their news feature repurposed content published on the Forbes website in an investigative article, which severely annoyed the Forbes editorial staff and media community (never a good idea) and led to accusations from Forbes’ legal team of willful copyright infringement. Now Wired is reporting that Perplexity’s web hosting provider (AWS) is investigating their practices, focused on whether they respect robots.txt, the standard governing the behavior of web crawlers (Or is it all robots? More on that later.)

We don’t know everything about how Perplexity actually works under the hood, and I have no relationship to the company or special knowledge. The facts are still somewhat murky, and as with any dispute over the ethics or legality of digital copying, the technical details will matter. I worked on copyright policy for years at Google, and have seen this pattern play out enough times to not pass judgment too quickly.

Based on what we know today from press reports, it seems plausible to me that the fundamental issue at root here, i.e. what is driving Perplexity to dig its heels in, and where much of the reporting seems to cite as Perplexity’s fundamental ethical failing, is what counts as a “crawler” for the purposes of robots.txt.

This is an ambiguity that will likely need to be addressed in years to come regardless of Perplexity’s practices, so it seems worth unpacking a little bit. (In fact similar questions are floating around Quora’s chatbot Poe.)

Why do I think this is the core issue? This snippet from today’s Wired article was instructive (Platnick is a Perplexity spokesperson):

“When a user prompts with a specific URL, that doesn’t trigger crawling behavior,” Platnick says. “The agent acts on the user’s behalf to retrieve the URL. It works the same way as if the user went to a page themselves, copied the text of the article, and then pasted it into the system.”

This description of Perplexity’s functionality confirms WIRED’s findings that its chatbot is ignoring robots.txt in certain instances.

The phrase “ignoring robots.txt in certain instances” sounds bad. There is the ethical conversation of what Perplexity is doing with news content of course, which is likely to be an ongoing and vigorous debate. The claim is that Perplexity is ignoring the wishes of news publishers, as expressed in robots.txt.

But we tend to codify norms and ethics into rules, and a reasonable question is: What does the robots.txt standard have to say? When is a technical system expected to comply with it, or ignore it? Could this be rooted in different interpretations of the standard?

First a very quick history of robots.txt: In the late 80s and early 90s, it was a lot more expensive to run a web server. They also tended to be very prone to breaking under high loads. As companies began to crawl the web to build things like search engines (which requires accessing a lot of the website), stuff started to break, and the blessed nerds who kept the web working came up with an informal standard in the mid 90s that allowed webmasters to put up road signs to direct crawlers away from certain areas. Most crawlers respected this relatively informal arrangement, and still do.

Thus, “crawlers” has for decades been understood to refer to systems that access URLs in bulk, systems that pick which URLs to access next based on a predetermined method written in code (presumably why it’s described as “crawling”). And the motivating issue was mainly a coordination problem: how to enable useful services like search engines, that are good for everyone including web publishers, without breaking things.

It took nearly two decades but robots.txt was eventually codified and adopted as the Robots Exclusion Protocol, or RFC 9309, by the Internet Engineering Task Force (IETF), part of the aforementioned blessed nerd community who maintain the technical standards of the internet.

RFC 9309 does not define “crawler” or “robot” in the way a lawyer might expect a contract or statute to define a term. It says simply that “crawlers are automatic clients” with the rest left up to context clues. Most of those context clues refer to issues posed by bulk access of URIs:

It may be inconvenient for service owners if crawlers visit the entirety of their URI space. This document specifies the rules […] that crawlers are requested to honor when accessing URIs.

Every year the web’s social footprint expands and we increase the pressures put on robots.txt. It’s begun to solve a broader set of challenges, beyond protecting webmasters from the technical inconveniences of bulk access. It now increasingly arbitrates massive economic interests, and now the social and ethical questions AI has inspired in recent years. Google, whose staff are the listed authors of RFC 9309, has already started thinking about what’s next.

And the technology landscape is shifting. Automated systems are accessing web content with a broader set of underlying intentions. We’re seeing the emergence of AI agents that actually do things on behalf of users and at their direction, intermediated by AI companies using large language models. As OpenAI says, AI agents may “substantially expand the helpful uses of AI systems, and introduce a range of new technical and social challenges.”

Automatic clients will continue to access web content. The user-agent might even reasonably have “Bot” in the name. But is it a crawler? It won’t be for the same purpose as a search engine crawler, and not at the same scale and depth required for search. The ethical, economic, technical, and legal landscape for automatic AI agents will look completely different than for crawlers.

It may very well be sensible to expand RFC 9309 to apply to things like AI agents directed by users, or any method of automated access of web content where the user-agent isn’t directly a user’s browser. And then we would think about the cascading implications of the robots.txt standard and its requirements. Or maybe we need a new set of norms and rules to govern that activity separate from RFC 9309.

Either way, disputes like this are an opportunity to consider improving and updating the rules and standards that guide actors on the web. To the extent this disagreement really is about the interpretation of “crawler” in RFC 9309, i.e. what counts as a robot or crawler and therefore what must respect listed disallows in the robots.txt file, that seems like a reasonable place to start thinking about solutions.

Alex Kozak is a tech policy consultant with Proteus Strategies, formerly gov’t affairs and regulatory strategy at Google X, global copyright policy lead at Google, and open licensing advocate at Creative Commons.

Filed Under: agents, ai, bots, crawling, generative ai, robots.txt
Companies: perplexity

Hey Doordash: Why Are You Hiding Your 'Security Notice' From Google Just Days After You Revealed A Massive Security Breach?

from the questions,-questions dept

As you might have heard, late last week, delivery company DoorDash admitted via a Medium post that there had been a large data breach exposing info on 4.9 million users of the service. The breach had actually happened months earlier, but was only just discovered earlier this month.

We take the security of our community very seriously. Earlier this month, we became aware of unusual activity involving a third-party service provider. We immediately launched an investigation and outside security experts were engaged to assess what occurred. We were subsequently able to determine that an unauthorized third party accessed some DoorDash user data on May 4, 2019. We took immediate steps to block further access by the unauthorized third party and to enhance security across our platform. We are reaching out directly to affected users.

The information accessed included names, emails, delivery addresses, order histories and phone numbers. Salted and hashed passwords were accessible too, but assuming Doordash didn’t mess up the salting/hashing, those should still be safe. Some customers also had the last four digits of their credit cards revealed.

All in all a somewhat typical breach that happens these days. However, as TechCrunch cybersecurity reporter Zack Whittaker noticed, somewhere right around the time the breach went up, DoorDash told Google to stop indexing its “SecurityNotices” page via robots.text.

You know what's really weird? @DoorDash has no mention of its massive data breach on its homepage. There's nothing on its Twitter or Facebook page, either.

What's also weird is DoorDash's robots file hides "/securitynotice" from Google, so people can't even search for it. pic.twitter.com/m81Geafxnz

— Zack Whittaker (@zackwhittaker) September 27, 2019

He also notes that DoorDash doesn’t seem to be going out of its way to alert people to the breach — pointing out that there’s nothing on DoorDash’s front page, or on its various social media accounts. Just the blog post on Medium (and, if I’m not mistaken, Medium posts can end up behind a paywall in lots of cases). That’s pretty lame. My guess is that since DoorDash says it’s “contacting” customers impacted by the breach, it felt it didn’t need to do wider outreach. But… that seems like a huge cop out. Notifying people of such a breach is kind of important.

And, also, yanking your “securitynotices” directory from Google (even if it currently appears blank) seems super suspicious. Why do that except to hide information from people searching for info about your security issues? A breach of this nature is bad, but it happens to so many companies these days that I don’t think this kind of breach leads to much trust lost from customers. However, proactively trying to keep things quiet about this… well… that’s the kind of thing that raises eyebrows and destroys trust.

Of course, in a bit of perfect timing to distract from all of this, DoorDash happily announced today that it’s now delivering for McDonald’s, so get your Big Macs quick and ignore any lingering concerns about security…

Filed Under: breach notification, robots.txt, security, security breaches
Companies: doordash

Chilling Effects On Chilling Effects As DMCA Archive Deletes Self From Google

from the transparency-never dept

Over the weekend, TorrentFreak noted that the website Chilling Effects had apparently removed itself from Google’s search index after too many people complained.

This week, however, we were no longer able to do so. The Chilling Effects team decided to remove its entire domain from all search engines, including its homepage and other informational and educational resources.

TorrentFreak asked the site about it and was told it was done in search of “better balance.”

?After much internal discussion the Chilling Effects project recently made the decision to remove the site?s notice pages from search engines,? Berkman Center project coordinator Adam Holland informs TF.

?Our recent relaunch of the site has brought it a lot more attention, and as a result, we?re currently thinking through ways to better balance making this information available for valuable study, research, and journalism, while still addressing the concerns of people whose information appears in the database.?

[….]

?As a project, we?ve always worked to strike that balance, for example by removing personally identifying information. Removing notice pages from search engine results is the latest step in that balancing process,? Holland tells us.

?It may or may not prove to be permanent, but for now it?s the step that makes the most sense as we continue to think things through,? he adds.

Meanwhile, Chilling Effects founder, Wendy Seltzer, seems to insist that this was an implementation mistake and that the team never meant to remove the whole domain:

So it’s a little unclear what happened here. You’d think the folks at the Berkman Center and associated with Chilling Effects know how to properly set up a robots.txt file if they want to just exclude certain pages.

Either way it seems like a massive blow for transparency, and in many ways is a “chilling effect” of its own. It’s no secret that many legacy copyright system supporters absolutely hate Chilling Effects and the transparency it brings. Sandra Aistars, of the Copyright Alliance, referred to the site as “repugnant” in Congressional testimony just a few months ago. Yes, providing transparency on censorship is “repugnant.” Says a lot about the Copyright Alliance, doesn’t it?

Others have made similar statements in the past. A few years ago, a lawyer tried to block Google from forwarding DMCA takedown notices to Chilling Effects, arguing that passing along those notices makes Google “potentially liable for the infringement” in passing on the notices. Others have argued that the takedown notices themselves are subject to copyright and have tried to block them from appearing on Chilling Effects.

The concern, they claim, is twofold: First, the details in the takedown notice often demonstrate where infringing content actually is. That’s especially true for notices to Google or Twitter (two of the bigger suppliers of notices to Chilling Effects) who are not hosting the content, but are merely linking to it (i.e. they are “information location tools.”) In those cases, the links may get removed from the services in question, but remain on the internet itself. The second concern, as put forth by Aistars, is that people issuing DMCA takedown notices are sensitive little flowers, and publishing the fact that they’re trying to take down content opens them up to harassment and abuse.

Neither of these arguments survives much scrutiny. The idea that anyone is trawling through Chilling Effects seeking unauthorized content is fairly unlikely. And, really, if people are, those aren’t exactly the kind of people who are then going to turn around and start willfully forking over cash to the legacy entertainment industry for that same content. The Chilling Effects haters, no doubt, would argue that this is why it’s important to remove Chilling Effects itself from Google, because people searching on Google might not find the originals, but would then find the takedown notices with links back to the originals. Except, that seems unlikely. First, as has been detailed many times, people looking for unauthorized copies of works tend not to use Google that much, since it’s not very good for that purpose, and other tools tend to be much more effective. Second, the kinds of information in a takedown notice itself aren’t likely to trigger a high result for someone looking for an unauthorized download. Terms like “free” and “download” are unlikely to be found on such documents.

The other argument — that being exposed for sending takedowns leads to harassment — also seems bogus. We’ve seen little indication that people get that upset about legitimate takedowns. It’s the excessive, abusive and censorious takedowns that really seem to concern people. And those are the ones that need transparency the most.

Hopefully, the folks at Chilling Effects rethink this decision and stick by their own stated philosophy of working “to provide as much transparency as possible” about DMCA takedown notices. It would seem that blocking a key search tool from accessing the data goes directly against that principle.

Filed Under: chilling effects, copyright, copyright infringement, dmca, indexing, robots.txt, search, takedowns
Companies: chilling effects, google

from the donkeys-arguing-against-the-wheel dept

There’s been something of a battle going on in the UK over news aggregators. Obviously, we’ve all heard about the various threats by companies like News Corp. in the US to sue Google over its Google News product, but a lot of this has already been playing out on a smaller scale in the UK. Last year we wrote about newspapers in the UK threatening aggregators like NewsNow, leading some to start blocking NewsNow crawlers. This is silly in the extreme. These aggregators offer links to the news. The “issue” with NewsNow is that it sells this as a service to companies — and the newspapers claim they deserve a cut. Note that NewsNow provides just a link and a headline and the tiniest of blurbs. It’s much less than even Google News provides. The newspapers seem to think that no one can profit from advertising their own stories unless they get a direct cut.

In fact, last year the NLA (Newspaper Licensing Association) in the UK decided to start charging all such services just for linking. This is, of course, ridiculous. One of the largest services of this type is called Meltwater News, and it decided to protest this ridiculous license on linking. It was joined in this effort by the Public Relations Consultants Association (PRCA), who noted that there is no copyright on headlines and links — and the NLA’s license amounted to an illegal tax. The NLA responded by saying that Meltwater and PRCA had no right to protest these licenses.

Earlier this week, however, the Copyright Tribunal in the UK ruled in favor of the PRCA and Meltwater in protesting these new licenses, and it ordered the NLA to pay the costs of both organizations. Now there will be a full trial concerning the legality of the licenses.

What’s interesting, however, is that hours after this decision came out, the Times Online in the UK just so happened to update its robots.txt file to block Meltwater (along with NewsNow, who had already been blocked). Basically, it was a quiet threat: if you don’t pay, we’ll block you.

The newspapers are walking a very thin line here. They’re trying to charge for the most basic element of the web: linking and sharing links with others. I would imagine that if they actually win this fight, they’re going to end up regretting it even more — because if they start linking to other sites themselves, how long will it take before those linked sites start demanding money back from the newspapers as well. It’s an incredibly short-sited view that a newspaper takes to think that others must pay you to promote you.

Filed Under: aggregators, copyright, links, newspapers, robots.txt, uk
Companies: meltwater news, newsnow, nla, prca

Google To Newspapers: Here, Let Me Introduce You To Robots.txt

from the snappy dept

With the silly introduction last week of the AP’s attempt to create a weird and totally unnecessary new data feed to keep out aggregators and search engines, it seems that Google has gotten fed up. Google execs and employees have made similar statements on various panels and discussions, but Senior Business Product Manager Josh Cohen put up a blog post directed at newspapers, that can be summarized as: Dear newspapers: let me introduce you to a tool that’s been around forever. It’s called robots.txt. If you don’t like us indexing you, use it. Otherwise, shut up. In only slightly nicer language.

Filed Under: newspapers, robots.txt
Companies: google

Search Engines Should Ignore Bossy Publishers

from the disallow dept

James Grimmelman has an in depth look a ACAP, the new "standard" for website access control that we discussed last Friday. I put "standard" in scare quotes because, as Grimmelman points out, the specs clearly weren't written by people with any experience in writing technical standards. While a well-written standard will very precisely specify which behaviors are required, which are prohibited, and under what circumstances, the ACAP spec is full of vague directives and confusing terminology. Some parts of the standard are apparently designed to “only be interpreted by prior arrangement.” Also, despite the "1.0" branding, the latest version of the specification has several sections that are labeled "not yet fully ready for implementation." It is, in short, a big mess.Of course, this shouldn't surprise us, because it's not really a technical standard at all. Robots.txt works just fine for almost everyone, and search engines aren't clamoring to replace it. Rather, some publishers are using the trappings of a technical standard to try to micromanage the uses to which search engines put their content, and they're laying the groundwork for lawsuits if search engines fail to heed the demands embedded in ACAP files. Not only are the rules vague and confused, but the "standard" also helpfully notes that the rules "may change or be withdrawn without notice." In other words, a search engine that committed to complying with ACAP directives would be setting itself up to have their search engine's functionality micro-managed by the publishers who control the ACAP specifications.

Luckily, as Mike pointed out on Friday, search engines have the upper hand here. So here's my suggestion for search engines: instead of trying to comply with every nitpicky detail of the ACAP standard, just announce that every line of an ACAP file will be interpreted as the equivalent of a "Disallow" line in a robots.txt file. Websites would discover pretty quickly that posting ACAP directives on their sites just caused their content to disappear from search engines. As much as they might bluster about other search engines "stealing" their content, the reality is that they can't afford to give up the traffic that search engines send their way. If search engines simply refused to include ACAP-restricted pages in their index, publishers would quickly realize that those old robots.txt files aren't so bad after all.

Filed Under: publishers, robots.txt, search engines
Companies: associated press, google, microsoft, yahoo

News Publishers Want To Change Robots.txt; Want To Make Sure Their Content Is Less Useful

from the deep-misunderstandings dept

Following on the speech given earlier this month by the head of the Associated Press, where it was made clear that the AP and news organizations still think that they can be gatekeepers of news, a bunch of publishers along with the AP are now trying to revise robots.txt so that they can hide content on a more selective level. Now, it is true that robots.txt can be rather broad in its sweep. But it’s rather telling that it’s the publishers who banded together and are telling search engines what changes are needed, rather than working with the search engines to come up with a reasonable solution. In the meantime, there really are some simple solutions if you don’t want content indexed by search engines — but we’ve yet to fully understand why publishers are so upset that Google, Yahoo and others are sending them so much traffic in the first place.

Filed Under: publishers, robots.txt
Companies: associated press