scraping – Techdirt (original) (raw)

AI Crawlers Are Harming Wikimedia, Bringing Open Source Sites To Their Knees, And Putting The Open Web At Risk

from the externalizing-your-costs-directly-into-my-face dept

The current rapid advances in generative AI are built on three things. Computing power, some clever coding, and vast amounts of training data. Lots of money can buy you more of the first two, but finding the necessary training material is increasingly hard. Anyone seeking to bolster their competitive advantage through training needs to find fresh sources. This has led to the widespread deployment of AI crawlers, which scour the Internet for more data that can be downloaded and used to train AI systems. Some of the prime targets for these AI scraping bots are Wikimedia projects, which claim to be “the largest collection of open knowledge in the world”. This has now become a serious problem for them:

We are observing a significant increase in request volume, with most of this traffic being driven by scraping bots collecting training data for large language models (LLMs) and other use cases. Automated requests for our content have grown exponentially, alongside the broader technology economy, via mechanisms including scraping, APIs, and bulk downloads. This expansion happened largely without sufficient attribution, which is key to drive new users to participate in the movement, and is causing a significant load on the underlying infrastructure that keeps our sites available for everyone.

Specifically:

Since January 2024, we have seen the bandwidth used for downloading multimedia content grow by 50%. This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models. Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs.

AI crawlers seek to download as much material as possible, including the most obscure, so Wikimedia projects that are optimized for human use incur extra costs:

While human readers tend to focus on specific – often similar – topics, crawler bots tend to “bulk read” larger numbers of pages and visit also the less popular pages. This means these types of requests are more likely to get forwarded to the core datacenter, which makes it much more expensive in terms of consumption of our resources.

Wikimedia’s analysis shows that 65% of this resource-consuming traffic is coming from bots, whereas the overall pageviews from bots are about 35% of the total. As the Diff news story notes, this is becoming a widespread problem not just for Wikimedia, but across the Internet. Some companies are responding with lawsuits, but for another important class of sites this is not a practical option.

These are the open source projects that have a Web presence with a wide range of resources. Many of them are struggling under the impact of aggressive AI crawlers, as a post by Niccolò Venerandi on the LibreNews site details. For example, Drew Devault, the founder of the open source development platform SourceHut, wrote a blog post last month with the title “Please stop externalizing your costs directly into my face”, in which he lamented:

These bots crawl everything they can find, robots.txt be damned, including expensive endpoints like git blame, every page of every git log, and every commit in every repo, and they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.

Devault says that he knows many other Web sites are similarly affected:

All of my sysadmin friends are dealing with the same problems. I was asking one of them for feedback on a draft of this article and our discussion was interrupted to go deal with a new wave of LLM bots on their own server. Every time I sit down for beers or dinner or to socialize with my sysadmin friends it’s not long before we’re complaining about the bots and asking if the other has cracked the code to getting rid of them once and for all. The desperation in these conversations is palpable.

The LibreNews article discusses some of the technical approaches to excluding these AI crawlers. But setting them up, monitoring and fine-tuning them requires time and energy from those running the sites time that could have been spent more fruitfully on managing the actual projects. Similar, the unexpected extra bandwidth costs caused by massive bot downloads come out of the small and often stretched budgets of open source projects. There is a clear danger that these LLM bots will cause open source projects to struggle, and possibly shut down completely.

An article in MIT Technology Review by Shayne Longpre warns that publishers may respond to this challenge in another way, by blocking all crawlers unless they are licensed. That may solve the problem for those sites, and allow deep-pocketed AI companies to train their systems on the licensed material, but many others will lose out:

Crawlers from academic researchers, journalists, and non-AI applications may increasingly be denied open access. Unless we can nurture an ecosystem with different rules for different data uses, we may end up with strict borders across the web, exacting a price on openness and transparency.

It’s increasingly clear that the reckless and selfish way in which AI crawlers are being deployed by companies eager to tap into today’s AI hype is bringing many sites around the Internet to their knees. As a result, AI crawlers are beginning to threaten the open Web itself, and thus the frictionless access to knowledge that it has provided to general users for the last 30 years.

Follow me @glynmoody on Mastodon and on Bluesky.

Filed Under: access to knowledge, ai, apis, bandwidth, bots, datacenter, drew devault, licensing, llms, open source, open web, publishers, scraping, sysadmins, training data, web crawlers, wikimedia
Companies: sourcehut

from the LLM-grooming dept

It’s no secret that Russia has taken advantage of the Internet’s global reach and low distribution costs to flood the online world with huge quantities of propaganda (as have other nations): Techdirt has been writing about Putin’s troll army for a decade now. Russian organizations like the Internet Research Agency have been paying large numbers of people to write blog and social media posts, comments on Web sites, create YouTube videos, and edit Wikipedia entries, all pushing the Kremlin line, or undermining Russia’s adversaries through hoaxes, smears and outright lies. But technology moves on, and propaganda networks evolve too. The American Sunlight Project (ASP) has been studying one of them in particular: Pravda (Russian for “truth”), a network of sites that aggregate pro-Russian material produced elsewhere. Recently, ASP has noted some significant changes (pdf) there:

Over the past several months, ASP researchers have investigated 108 new domains and subdomains belonging to the Pravda network, a previously-established ecosystem of largely identical, automated web pages that previously targeted many countries in Europe as well as Africa and Asia with pro-Russia narratives about the war in Ukraine. ASP’s research, in combination with that of other organizations, brings the total number of associated domains and subdomains to 182. The network’s older targets largely consisted of states belonging to or aligned with the West.

According to ASP:

The top objective of the network appears to be duplicating as much pro-Russia content as widely as possible. With one click, a single article could be autotranslated and autoshared with dozens of other sites that appear to target hundreds of millions of people worldwide.

The quantity of material and the rate of posting on the Pravda network of sites is notable. ASP estimates the overall publishing rate of the network is around 20,000 articles per 48 hours, or more than 3.6 million articles per year. You would expect a propaganda network to take advantage of automation to boost its raw numbers. But ASP has noticed something odd about these new Web pages: “The network is unfriendly to human users; sites within the network boast no search function, poor formatting, and unreliable scrolling, among other usability issues.”

There are obvious benefits from flooding the Internet with pro-Russia material, and creating an illusory truth effect through the apparent existence of corroborating sources across multiple sites. But ASP suggests there may be another reason for the latest iteration of the Pravda propaganda network:

Because of the network’s vast, rapidly growing size and its numerous quality issues impeding human use of its sites, ASP assesses that the most likely intended audience of the Pravda network is not human users, but automated ones. The network and the information operations model it is built on emphasizes the mass production and duplication of preferred narratives across numerous platforms (e.g. sites, social media accounts) on the internet, likely to attract entities such as search engine web crawlers and scraping algorithms used to build LLMs [large language models] and other datasets. The malign addition of vast quantities of pro-Russia propaganda into LLMs, for example, could deeply impact the architecture of the post-AI internet. ASP is calling this technique LLM grooming.

The rapid adoption of chatbots and other AI systems by governments, businesses and individuals offers a new way to spread propaganda, one that is far more subtle than current approaches. When there are large numbers of sources supporting pro-Russian narratives online, LLM crawlers scouring the Internet for training material are more likely to incorporate those viewpoints uncritically in the machine learning datasets they build. This will embed Russian propaganda deep within the LLM that emerges from that training, but in a way that is hard to detect, not least because there is little transparency from AI companies about where they gather their datasets.

The only way to spot LLM grooming is to look for signs of targeted disinformation in chatbot output. Just such an analysis has been carried out recently by NewsGuard, an organization researching disinformation, which Techdirt wrote about last year. NewsGuard tested 10 leading chatbots with a sampling of 15 false narratives that were spread by the Pravda network. It explored how various propaganda points were dealt with by the different chatbots, although: “results for the individual AI models are not publicly disclosed because of the systemic nature of the problem”:

The NewsGuard audit found that the chatbots operated by the 10 largest AI companies collectively repeated the false Russian disinformation narratives 33.55 percent of the time, provided a non-response 18.22 percent of the time, and a debunk 48.22 percent of the time.

NewsGuard points out that removing the tainted sources from LLM training datasets is no trivial matter:

The laundering of disinformation makes it impossible for AI companies to simply filter out sources labeled “Pravda.” The Pravda network is continuously adding new domains, making it a whack-a-mole game for AI developers. Even if models were programmed to block all existing Pravda sites today, new ones could emerge the following day.

Moreover, filtering out Pravda domains wouldn’t address the underlying disinformation. As mentioned above, Pravda does not generate original content but republishes falsehoods from Russian state media, pro-Kremlin influencers, and other disinformation hubs. Even if chatbots were to block Pravda sites, they would still be vulnerable to ingesting the same false narratives from the original source.

The corruption of LLM training sets, and the resulting further loss of trust in online information, is a problem for all Internet users, but particularly for those in the US, as ASP points out:

Ongoing governmental upheaval in the United States makes it and the broader world more vulnerable to disinformation and malign foreign influence. The Trump administration is currently in the process of dismantling numerous U.S. government programs that sought to limit kleptocracy and disinformation worldwide. Any current or future foreign information operations, including the Pravda network, will undoubtedly benefit from this.

This “malign foreign influence” probably won’t be coming from Russia alone. Other nations, companies or even wealthy individuals could adopt the same techniques to push their own false narratives, taking advantage of the rapidly falling costs of AI automation. However bad you think disinformation is now, expect it to get worse in the future.

Follow me @glynmoody on Bluesky and on Mastodon.

Filed Under: ai, american sunlight project, automation, disinformation, influencers, internet research agency, kleptocracy, llm grooming, llms, machine learning, newsguard, propaganda, russia, scraping, social media, training, troll army, web crawlers, wikipedia, youtube

Air Canada Would Rather Sue A Website That Helps People Book More Flights Than Hire Competent Web Engineers

from the time-to-cross-air-canada-off-the-flight-list dept

I am so frequently confused by companies that sue other companies for making their own sites and services more useful. It happens quite often. And quite often, the lawsuits are questionable CFAA claims against websites that scrape data to provide a better consumer experience, but one that still ultimately benefits the originating site.

Over the last few years various airlines have really been leading the way on this, with Southwest being particularly aggressive in suing companies that help people find Southwest flights to purchase. Unfortunately, many of these lawsuits are succeeding, to the point that a court has literally said that a travel company can’t tell others how much Southwest flights cost.

But the latest lawsuit of this nature doesn’t involve Southwest, and is quite possibly the dumbest one. Air Canada has sued the site Seats.aero that helps users figure out the best flights for their frequent flyer miles. Seats.aero is a small operation run by the company with the best name ever: Localhost, meaning that the lawsuit is technically “Air Canada v. Localhost” which sounds almost as dumb as this lawsuit is.

The Air Canada Group brings this action because Mr. Ian Carroll—through Defendant Localhost LLC—created a for-profit website and computer application (or “app”)— both called Seats.aero—that use substantial amounts of data unlawfully scraped from the Air Canada Group’s website and computer systems. In direct violation of the Air Canada Group’s web terms and conditions, Carroll uses automated digital robots (or “bots”) to continuously search for and harvest data from the Air Canada Group’s website and database. His intrusions are frequent and rapacious, causing multiple levels of harm, e.g., placing an immense strain on the Air Canada Group’s computer infrastructure, impairing the integrity and availability of the Air Canada Group’s data, soiling the customer experience with the Air Canada Group, interfering with the Air Canada Group’s business relations with its partners and customers, and diverting the Air Canada Group’s resources to repair the damage. Making matters worse, Carroll uses the Air Canada Group’s federally registered trademarks and logo to mislead people into believing that his site, app, and activities are connected with and/or approved by the real Air Canada Group and lending an air of legitimacy to his site and app. The Air Canada Group has tried to stop Carroll’s activities via a number of technological blocking measures. But each time, he employs subterfuge to fraudulently access and take the data—all the while boasting about his exploits and circumvention online.

Almost nothing in this makes any sense. Having third parties scrape sites for data about prices is… how the internet works. Whining about it is stupid beyond belief. And here, it’s doubly stupid, because anyone who finds a flight via seats.aero is then sent to Air Canada’s own website to book that flight. Air Canada is making money because Carroll’s company is helping people find Air Canada flights they can take.

Why are they mad?

Air Canada’s lawyers also seem technically incompetent. I mean, what the fuck is this?

Through screen scraping, Carroll extracts all of the data displayed on the website, including the text and images.

Carroll also employs the more intrusive API scraping to further feed Defendant’s website.

If the “API scraping” is “more intrusive” than screen scraping, you’re doing your APIs wrong. Is Air Canada saying that its tech team is so incompetent that its API puts more load on the site than scraping? Because, if so, Air Canada should fire its tech team. The whole point of an API is to make it easier for those accessing data from your website without needing to do the more cumbersome process of scraping.

And, yes, this lawsuit really calls into question Air Canada’s tech team and their ability to run a modern website. If your website can’t handle having its flights and prices scraped a few times every day, then you shouldn’t have a website. Get some modern technology, Air Canada:

Defendant’s avaricious data scraping generates frequent and myriad requests to the Air Canada Group’s database—far in excess of what the Air Canada Group’s infrastructure was designed to handle. Its scraping collects a large volume of data, including flight data within a wide date range and across extensive flight origins and destinations—multiple times per day.

Maybe… invest in better infrastructure like basically every other website that can handle some basic scraping? Or, set up your API so it doesn’t fall over when used for normal API things? Because this is embarrassing:

At times, Defendant’s voluminous requests have placed such immense burdens on the Air Canada Group’s infrastructure that it has caused “brownouts.” During a brownout, a website is unresponsive for a period of time because the capacity of requests exceeds the capacity the website was designed to accommodate. During brownouts caused by Defendant’s data scraping, legitimate customers are unable to use or the Air Canada + Aeroplan mobile app, including to search for available rewards, redeem Aeroplan points for the rewards, search for and view reward travel availability, book reward flights, contact Aeroplan customer support, and/or obtain service through the Aeroplan contact center due to the high volume of calls during brownouts.

Air Canada’s lawyers also seem wholly unfamiliar with the concept of nominative fair use for trademarks. If you’re displaying someone’s trademarks for the sake of accurately talking about them, there’s no likelihood of confusion and no concern about the source of the information. Air Canada claiming that this is trademark infringement is ridiculous:

I guarantee that no one using Seats.aero thinks that they’re on Air Canada’s website.

The whole thing is so stupid that it makes me never want to fly Air Canada again. I don’t trust an airline that can’t set up its website/API to handle someone making its flights more attractive to buyers.

But, of course, in these crazy times with the way the CFAA has been interpreted, there’s a decent chance Air Canada could win.

For its part, Carroll says that he and his lawyers have reached out to Air Canada “repeatedly” to try to work with them on how they “retrieve availability information,” and that “Air Canada has ignored these offers.” He also notes that tons of other websites are scraping the very same information, and he has no idea why he’s been singled out. He further notes that he’s always been open to adjusting the frequency of searches and working with the airlines to make sure that his activities don’t burden the website.

But, really, the whole thing is stupid. The only thing that Carroll’s website does is help people buy more flights. It points people to the Air Canada site to buy tickets. It makes people want to fly more on Air Canada.

Why would Air Canada want to stop that other than that it can’t admit that it’s website operations should all be replaced by a more competent team?

Filed Under: api, cfaa, flights, frequent fliers, scraping, screen scraping, trademark
Companies: air canada, localhost, seats.aero

WOW Fans Trick ‘AI’ ‘News’ Scraper Into Covering Fake New Game Feature

from the yes-I-can-absolutely-do-that,-Dave dept

Tue, Jul 25th 2023 05:23am - Karl Bode

Language learning technology’s (aka “AI”) introduction into journalism has been a blistering mess. And not just because the technology is undercooked (which it is), but because the folks in charge of most major media outlets are incompetent cheapskates who simply see the tech as a way to cut corners, wage war on labor, and automate all of the clickbait attention economy’s very worst impulses.

The result of that continues to go about how you’d expect, with a ton of rushed computer-generated articles filled with dumb mistakes.

But last week there was a fun wrinkle when users over at the r/wow subreddit tricked an “AI” scraping the web for news into publishing an article on a new World of Warcraft feature that doesn’t exist. The fans created an entirely new game mode and lore called Glorbo, talked about it as if it was a real thing in the subreddit, and got a website called The Portal, owned by Zleague.gg, to treat it like a real thing:

The Portal, owned by Zleague.gg, ran an SEO item on Glorbo headlined “World of Warcraft (WoW) Players Excited for Glorbo’s Introduction”, quoting the main Reddit thread directly. Though it appears The Portal has since realised its mistake and removed the post, it can still be read in full on Archive.Today. The original post does not appear to denote that the story was automated. The author byline on the piece does not lead to a bio or social media links of any kind.”

While this was a fun prank related to gaming news, the same kind of lazy rushed implementation of “AI” is also occurring in the broader field of journalism. And while the tech may improve over time, the kind of greedy, incompetent leadership we’ve seen in media generally won’t.

There are plenty of ways these language learning tools could actually help journalists do a better, more efficient job. But we’re not injecting the technology into a healthy journalism and media environment. We’re injecting it into an already very broken clickbait bullshit generation machine, effectively supercharging all of its worst tendencies.

The goal for a lot of the VC types in media is to create a giant pointless ouroboros of clickbait gibberish and ad consumption that shits money. A giant wheel of pointless, often-manufactured engagement that is largely free of any pesky concerns about silly things like paying human beings a living wage, the quality of the end product, or the health of the broader industry.

Filed Under: ai, clickbait, gaming, journalism, labor, media, news, scraping, world of warcraft
Companies: reddit, zleague.gg

Elon Musk’s ‘War’ On Possibly Imaginary Scrapers Now A Lawsuit, Which Might Actually Work

from the killing-the-open-web dept

Elon Musk seems infatuated with bots and scrapers as the root of all his problems at Twitter. Given his propensity to fire engineers who tell him things he doesn’t want to hear, it’s not difficult to believe that engineers afraid to tell Musk the truth are conveniently blaming “scraping” on the variety of problems that Twitter has had since Musk’s YOLO leadership style at Twitter has knocked out some fundamental tools that kept the site reliable in the before times.

He tried to blame bots for spam (which he’s claimed repeatedly to have dealt with, but then gone back to blaming them for other things, because he hasn’t actually stopped automated spam). His attempts to “stop the bots” has resulted in a series of horrifically stupid decision-making, including believing that his non-verification Twitter Blue system would solve it (it didn’t), believing that cutting off free API access would drive away the spam bots (it drove away the good bots), and then believing that rate limiting post views would somehow magically stop scraping bots (which might only be scraping because of his earlier dumb decision to kill off the API).

The latest, though, is that last week Twitter went to court last week to sue ‘John Doe’ scrapers in a Texas court. And while I’ve long argued that scraping should be entirely legal, court precedents may be on Twitter’s side here.

Scraping is part of how the internet works and has always worked. The war on scraping is problematic for all sorts of reasons, and is an attack on the formerly open web. Unfortunately, though, courts are repeatedly coming out against scraping.

So, while I’d argue that this, from the complaint, is utter nonsense, multiple courts seem to disagree and find the argument perfectly plausible:

Scraping is a form of unauthorized data collection that uses automation and other processes to harvest data from a website or a mobile application.

Scraping interferes with the legitimate operation of websites and mobile applications, including Twitter, by placing millions of requests that tax the capacity of servers and impair the experience of actual users.

This is not how any of this should work, and is basically just an attack on the open web. Yes, scraping bots can overwhelm a site, but it’s on the site itself to block it, not the courts.

Twitter users have no control over how data-scraping companies repackage and sell their personal information.

This sounds scary, but again is nonsense. Scraping only has access to public information. If you post information publicly, then of course users don’t have control over that information any more. That’s how information works.

The complaint says that Twitter (I’m not fucking calling it ‘X Corp.’) has discovered IP addresses engaged in “flooding Twitter’s sign-up page with automated requests.” The complaint says:

The volume of these requests far exceeded what any single individual could send to a server in a given period and clearly indicated that these automated requests were aimed at scraping data from Twitter.

This also feels like a stretch. It seems like the more likely reason for flooding a sign up page is to create spam accounts. That’s also bad, of course, but it’s not clear how this automatically suggests scraping.

Of course, there have been a bunch of scraping cases in the past, and there are some somewhat mixed precedents here. There was the infamous Power.com case, that said it could be a CFAA (Computer Fraud and Abuse Act) violation to scrape content from behind a registration wall (even if the user gave permission). Last year, there was the April ruling in the 9th Circuit on LinkedIn/HiQ which notably said that scraping from a public website rather than a registration-walled website could not be a CFAA violation.

Indeed, much of the reporting on Twitter’s new lawsuit is pointing to that decision. But, unfortunately, that’s the wrong decision to look at. Months later, the same court ruled again in that case (in a ruling that got way less attention) that even if the scraping wasn’t a CFAA violation, it was still a a violation of LinkedIn’s terms of service, and granted an injunction against the scrapers.

Given the framing in the complaint, Twitter seems to be arguing the same thing (rather than a CFAA violation, that this is a terms of service violation). On top of that, this case is filed in Texas state court, and at least in federal court in Texas, the 5th Circuit has found that scraping data can be considered “unjust enrichment.”

In other words, as silly as this is, and as important scraping is to the open web, it seems that courts are buying the logic of this kind of lawsuit, meaning that Twitter’s case is probably stronger than it should be.

Of course, Twitter still needs to figure out who is actually behind these apparent scraping IP addresses, and then show that they actually were scraping. And who knows if the company will be able to do that. In the meantime, though, this is yet another case, following in the unfortunate pattern of Facebook, LinkedIn, and even Craigslist, to spit on the open web they were built on.

Filed Under: lawsuits, scraping, terms of service, texas
Companies: twitter

Something Stupid This Way Comes: Twitter Threatens To Sue Meta Over Threads, Because Meta Hired Some Of The People Elon Fired

from the everything-is-stupid dept

Just fucking fight it out already.

The whole stupid “cage match” brawl thing was started when Meta execs made some (accurate) cracks about Elon’s management of Twitter, and Elon couldn’t handle it. But, now with the launch of Meta’s Threads, Elon feels the need to send a ridiculously laughable legal threat to Meta.

Elon’s legal lapdog, Alex Spiro, dashed off a threat letter so dumb that even his employer, Quinn Emanuel — who is famous among powerful law firms for having no shame at all — should feel shame.

Dear Mr. Zuckerberg:

I write on behalf of X Corp., as successor in interest to Twitter, Inc. (“Twitter”). Based on recent reports regarding your recently launched “Threads” app, Twitter has serious concerns that Meta Platforms (“Meta”) has engaged in systemic, willful, and unlawful misappropriation of Twitter’s trade secrets and other intellectual property.

Lol, wut? Threads is like a dozen other microblogging type services. There are no “trade secrets” one needs to misappropriate from Twitter. I mean, seriously, who in their right mind thinks that Meta with billions of users of Facebook, Instagram, and WhatsApp is learning anything from Twitter, beyond “don’t do the dumbshit things Elon keeps doing.”

Over the past year, Meta has hired dozens of former Twitter employees. Twitter knows that these employees previously worked at Twitter; that these employees had and continue to have access to Twitter’s trade secrets and other highly confidential information; that these employees owe ongoing obligations to Twitter; and that many of these employees have improperly retained Twitter documents and electronic devices. With that knowledge, Meta deliberately assigned these employees to develop, in a matter of months, Meta’s copycat “Threads” app with the specific intent that they use Twitter’s trade secrets and other intellectual property in order to accelerate the development of Meta’s competing app, in violation of both state and federal law as well as those employees’ ongoing obligations to Twitter.

Let’s break this one down, because holy shit, is it ever stupid. The reason that Meta was able to hire a bunch of former Twitter employees most likely had to do with the fact that Elon recklessly fired 85% of the existing staff, and did so willy nilly, destroying tons of institutional knowledge and knowhow. And yet, Musk claimed he had to get rid of these employees because they were not hardcore, and were useless to Twitter. Yet, now we’re being told they are somehow invaluable to Threads? That doesn’t even pass the most basic laugh test.

The claim that “these employees have improperly retained Twitter documents and electronic devices” is particularly ridiculous, given that I’ve spoken to many, many, many ex-Twitter employees who have spent months trying to return their laptops, without Twitter bothering to respond to them at all. To use that against those employees is ridiculous.

And, really, what fucking “trade secrets” or “intellectual property’ do Spiro and Musk honestly think that any former employees took with them to Meta? How to competently run a microblogging service? This is all bluff and bluster from Elon, who knows he’s fucked up Twitter and is scared of any competition.

On top of that, assuming any of those employees are in California, then state law for the last century and a half has prohibited arguments regarding non-competes or similar, because the state has a stated policy that people should be allowed to be employed. So, to the extent that Twitter thinks it can enforce some sort of quasi-non-compete agreement, that’s just not going to fly.

Update: Also, Meta has now said that none of the small team working on Threads is a former Twitter employee anyway, so the assumptions in the letter are entirely false.

The letter continues:

Twitter intends to strictly enforce its intellectual property rights, and demands that Meta take immediate steps to stop using any Twitter trade secrets or other highly confidential information. Twitter reserves all rights, including, but not limited to, the right to seek both civil remedies and injunctive relief without further notice to prevent any further retention, disclosure, or use of its intellectual property by Meta.

In short, even as we’re not paying many of our bills and are desperately short on cash, especially compared to Meta, which has a building full of litigators, we’re ready, able, and willing to file a completely bogus, vexatious lawsuit just to try to annoy you.

Then we get to the real fear: that Meta might make it easy to recreate your Twitter social graph on threads:

Further, Meta is expressly prohibited from engaging in any crawling or scraping of Twitter’s followers or following data. As set forth in Twitter’s Terms of Service, crawling any Twitter services — including, but not limited to, any Twitter websites, SMS, APIs, email notifications, applications, buttons, widgets, ads, and commerce services — is permissible only “if done in accordance with the provisions of the robots.txt file” available at https://twitter.com/robots.txt. The robots.txt file specifically disallows crawling of Twitter’s followers or following data. Scraping any Twitter services is expressly prohibited for any reason without Twitter’s prior consent. Twitter reserves all rights, including but not limited to, the right to seek both civil remedies or injunctive relief without further notice.

So, yeah. This letter is basically Elon publicly admitting he’s scared shitless of Threads and its potential impact on Twitter. This is a “holy shit, this is bad, we’re fucked” kinda letter. Not one from a position of strength. Honestly, this letter makes me think that Threads has a better chance than I initially expected, if Musk is so damn scared of it.

Of course, to date, I’ve seen no indication that Threads was looking to scrape Twitter or enable easy transfer of the Twitter social graph to threads. Of course, lots of third parties often create such tools, and we’ve already seen Elon freak out over tools that helped users find their Twitter social graph on Mastodon, so I guess this is how he competes. By throwing up bogus walls.

That said, Meta can’t really say much here. After all, it set one of the horrible precedents in court regarding scraping data from websites to build services on top of them. To the extent that Twitter actually has any legal power to stop Meta from scraping, that power was given to it via a bad lawsuit that Meta itself started and pushed to completion.

Though, again, there’s been no indication that Meta actually plans to do that. The fact that it’s able to bootstrap its network off of the (much, much, much larger than Twitter) Instagram network suggests it has no need to port Twitter’s social graph over.

Again, this legal threat letter appears to be legal bluster from the much weaker party of the two.

I doubt this turns into an actual legal dispute, though with Elon, you never really know. If it does turn into a live dispute however, assuming that Meta didn’t do something preposterously silly (like asking former Twitter employees to share internal documents), then Meta will destroy this lawsuit easily.

But, you know, if we’re going to see a cage match between these two billionaires, why not just throw this on the undercard as well.

Filed Under: alex spiro, competition, elon musk, employees, intellectual property, mark zuckerberg, scraping, threads, trade secrets
Companies: meta, threads, twitter

Surveillance Tech Firm Sued By Meta For Using Thousands Of Bogus Accounts To Scrape Data

from the breaking-the-rules-to-sell-stuff-to-cops dept

About a half-decade ago, major social media companies finally did something to prevent their platforms from being used to engage in mass surveillance. Prompted by revelations in public records, Twitter and Facebook began cutting off API access to certain data scrapers that sold their services to government agencies. Twitter blocked both Dataminr and Geofeedia from accessing its “firehose” API. Facebook did the same thing to Geofeedia, denying it access to both its core service and Instagram.

That may have had some impact on these companies’ ability to secure new government contracts, but there are plenty of others willing to fill the tiny void left by this disruption. And they’re willing to break the rules that govern users of social media platforms, just like the law enforcement agencies they sell to.

Meet Voyager Labs, first exposed late last year by The Guardian, which based its report on public records obtained by the Brennan Center. Here’s what Voyager offers to its law enforcement customers, which include the Los Angeles Police Department:

Pulling information from every part of an individual’s various social media profiles, Voyager helps police investigate and surveil people by reconstructing their entire digital lives – public and private. By relying on artificial intelligence, the company claims, its software can decipher the meaning and significance of online human behavior, and can determine whether subjects have already committed a crime, may commit a crime or adhere to certain ideologies.

But new documents, obtained through public information requests by the Brennan Center, a non-profit organization, and shared with the Guardian, show that the assumptions the software relies on to draw those conclusions may run afoul of first amendment protections. In one case, Voyager indicated that it considered using an Instagram name that showed Arab pride or tweeting about Islam to be signs of a potential inclination toward extremism.

The documents also reveal Voyager promotes a variety of ethically questionable strategies to access user information, including enabling police to use fake personas to gain access to groups or private social media profiles.

It’s that last part — the use of fake personas — that’s getting Voyager sued by Meta, Facebook’s parent company. Facebook has let law enforcement officers know — on multiple occasions — that setting up fake accounts violates its terms of use. It also informed (repeatedly) this particular enabler of ToS violations. When it was ignored to the tune of tens of thousands of bogus accounts by Voyager, it sued, as Jess Weatherbed reports for The Verge.

According to a legal filing issued on November 11th, Meta alleges that Voyager Labs created over 38,000 fake Facebook user accounts and used its surveillance software to gather data from Facebook and Instagram without authorization. Voyager Labs also collected data from sites including Twitter, YouTube, and Telegram.

Meta says Voyager Labs used fake accounts to scrape information from over 600,000 Facebook users between July 2022 and September 2022. Meta says it disabled more than 60,000 Voyager Labs-related Facebook and Instagram accounts and pages “on or about” January 12th.

The updated complaint [PDF], containing more than 1,500 pages of exhibits covering everything from Voyager’s financial statements to its communications with law enforcement users, seeks an injunction blocking Voyager from further violating Facebook’s terms of service agreement.

This is kind of a pleasant surprise. Restricting the complaint to breach of contract actions under both state and federal law keeps the oft-abused CFAA out of it. Had the CFAA been brought into this as a cause of action, it would have created the possibility that researchers, academics, and others who scrape Facebook for useful data might have been harmed by an expansive reading of the CFAA’s “unauthorized access” clause. Fortunately, the CFAA is not in play here, with Meta content to seek damages for Voyager’s repeated violations of its agreements with Facebook.

If Meta succeeds, Voyager’s “real time” scraping service will cease to be useful to its customers. And if the company gets a favorable ruling that results in the collection of damages, fewer companies will be as likely to violate rules just so they can sell stuff to cops.

Filed Under: breach of contract, fake accounts, law enforcement, scraping
Companies: meta, voyager

Federal Court Says Scraping Court Records Is Most Likely Protected By The First Amendment

from the public-access-by-any-means-necessary dept

Automated web scraping can be problematic. Just look at Clearview, which has leveraged open access to public websites to create a facial recognition program it now sells to government agencies. But web scraping can also be quite useful for people who don’t have the power or funding government agencies and their private contractors have access to.

The problem is the Computer Fraud and Abuse Act (CFAA). The act was written to give the government a way to go after malicious hackers. But instead of being used to prosecute malicious hackers, the government (and private companies allowed to file CFAA lawsuits) has gone after security researchers, academics, public interest groups, and anyone else who accesses systems in ways their creators haven’t anticipated.

Fortunately, things have been changing in recent years. In May of last year, the DOJ changed its prosecution policies, stating that it would not go after researchers and others who engaged in “good faith” efforts to notify others of data breaches or otherwise provide useful services to internet users. Web scraping wasn’t specifically addressed in this policy change, but the alteration suggested the DOJ was no longer willing to waste resources punishing people for being useful.

Web scraping is more than a CFAA issue. It’s also a constitutional issue. None other than Clearview claimed it had a First Amendment right to gather pictures, data, and other info from websites with its automated scraping.

Clearview may have a point. A few courts have found scraping of publicly available data to be something protected by the First Amendment, rather than a violation of the CFAA.

Unfortunately, all we really have is a pinkie swear from the DOJ and a handful of decisions that only have precedential weight in certain jurisdictions. But there’s more coming. As the ACLU reports, another federal court has come to the conclusion that government efforts banning web scraping violate the rights of would-be scrapers. But, as is the case in many legal actions, the details matter.

In an important victory, a federal judge in South Carolina ruled that a case to lift the categorical ban on automated data collection of online court records – known as “scraping” – can move forward. The case claims the ban violates the First Amendment.

The decision came in NAACP v. Kohn, a lawsuit filed by the American Civil Liberties Union, ACLU of South Carolina, and the NAACP on behalf of the South Carolina State Conference of the NAACP. The lawsuit asserts that the Court Administration’s blanket ban on scraping the Public Index – the state’s repository of court filings – violates the First Amendment by restricting access to, and use of, public information, and prohibiting recording public information in ways that enable subsequent speech and advocacy.

The case stems from the NAACP’s “Housing Navigator,” which scrapes publicly available info from government websites to find tenants subject to eviction in order to provide them assistance in fighting eviction orders or finding new housing. As the NAACP (and ACLU) point out, this valuable service would be impossible if the NAACP was limited to manual searches to find affected tenants.

The state of South Carolina — via a state appellate decisions — claims the NAACP is only allowed limited access — the manual searches the NAACP says render its eviction assistance efforts impossible to achieve. The federal court says the state does have the power to limit access to public records, but those limits must align themselves with the tenets of the First Amendment, which presume open access to government records by the governed.

The state comes down on the losing side here, at least for the moment. The limits proposed by the state court order nullify the services the NAACP hopes to offer. As it stands now, the state cannot escape this lawsuit because there’s enough on the record at the moment that suggests there’s a viable constitutional claim.

The NAACP alleges that without scraping, it is impossible to gather the information quickly enough to meet the ten-day deadline to request a hearing. It alleges that scraping poses at most a de minimis burden on the functionality of the website.

As discussed above, it also contends suggested alternatives to scraping, such as Rule 610, are insufficient, and that Defendants have, in any event, indicated an unwillingness to provide the information under that rule. […]

True, the evidence may eventually show that Defendants have a sufficient reason to prohibit scraping. It may indicate that the NAACP’s access to the records is unburdened by the restriction. Or, it may demonstrate that Defendants have provided sufficient alternatives to access the information. But, as alleged, the restrictions state a claim for violation of the First Amendment.

The bottom line is this: automated access to government records is almost certainly protected by the First Amendment. What will be argued going forward is how much the government can restrict this access without violating the Constitution. There’s not a lot on the record at the moment, but this early ruling seems to suggest this court will err on the side of unrestricted access, rather than give its blessing to unfettered fettering of the presumption of open access that guides citizens’ interactions with public records.

Filed Under: 1st amendment, court documents, public data, scraping
Companies: aclu, naacp

Meta Sues Scraping Firms; Is It Really Protecting Users? Or Protecting Meta?

from the potentially-problematic dept

For many years we’ve written stories regarding various lawsuits over scraping the web. Without the ability to scrape the web, we’d have no search engines, no Internet Archive, and lots of other stuff wouldn’t work right either. However, more importantly, the ability to scrape the web should result in a better overall internet, potentially reversing the trend of consolidation and internet giants that silo off your info. Most often, we’ve talked about this in the context of Facebook’s case over a decade ago against Power.com. That involved a company that was trying to build a single dashboard for multiple social media companies, allowing users to log into a single interface to see content from, and post content to, multiple platforms at once. In that case Facebook relied on the Computer Fraud and Abuse Act (the CFAA), and the courts sided with Facebook, saying that because Facebook had sent Power a cease-and-desist letter, that made the access (even with the approval of the users themselves!) somehow “unauthorized.”

Over the years, we’ve pointed out how this decision and interpretation of the CFAA is one of the biggest reasons the market for social media is not as competitive as it could be. That decision effectively said that Facebook could build its own silo, in which your data checks in but it never checks out. Other tech companies — including Craigslist and LinkedIn — have brought similar lawsuits, though in LinkedIn’s case against HiQ the court cut back the earlier Power.com ruling, and basically said that it only applied to information that was behind a registration wall. Publicly available information was legal to scrape.

More recently, Facebook parent company Meta has again gone after scraping operations. Earlier this year, we noted how the company had sued a somewhat sketchy provider of “insights” into “influencers and their audiences” that had been scraping information on Facebook. And, now, the company has announced two new lawsuits against scraping companies. Once again, neither of the defendants are as sympathetic as Power, and Meta even frames these lawsuits as “safeguarding” its users privacy.

The first lawsuit, against a company called Octopus Data, raises all sorts of questions. Octopus offers a cloud-based service called Octoparse, which allows customers to extract web data from basically any URL without having to do any coding yourself. This is actually… really really useful? Especially for researchers. The ability to scrape and extract data from webpages is not just useful, it’s how lots of services work, including search engines. But Meta is not at all happy.

Since at least March 25, 2015, and continuing to the present, Defendant Octopus Data Inc., (“Octopus”) has operated an unlawful service called Octoparse, which was designed to improperly collect or “scrape” user account profiles and other information from various websites, including Amazon, eBay, Twitter, Yelp, Google, Target, Walmart, Indeed, LinkedIn, Facebook and Instagram.

Defendant’s service used and offered multiple products to scrape data. First, Defendant offered to scrape data directly from various websites on behalf of its customers (the “Scraping Service”). Second, Defendant developed and distributed software designed to scrape data from any website, including Facebook and Instagram, using a customer’s self-compromised account (the “Scraping Software”). Defendant’s Scraping Software was capable of scraping any data accessible to a logged in Facebook and Instagram user. And Defendant designed the “premium” Scraping Software to launch scraping campaigns from Defendant’s computer network and infrastructure. Finally, Defendant claimed to use and distribute technologies to avoid being detected and blocked by Meta and other websites they scraped.

Defendant’s conduct was not authorized by Meta and it violates Meta’s and Instagram’s terms and policies, and federal and California law. Accordingly, Meta seeks damages and injunctive relief to stop Defendant’s use of its platform and products in violation of its terms and policies.

Perhaps notably, Facebook does not try to use either the CFAA or California’s state equivalent in this case. Instead, it tosses in… a copyright claim. That’s because one of the premium services of Octoparse is that it will scrape the data and store it on its own server — and Meta argues that Octoparse violates Section 1201 of the DMCA (the anti-circumvention part) because the scraping tool has to “circumvent” Meta’s technical tools put in place to block Octoparse.

Certain user generated content is also copyright protected and users grant Meta a non-exclusive, transferable, sub-licensable, royalty-free, and worldwide license to host, use, distribute, modify, run, copy, publicly perform or display, translate, and create derivative works of that content consistent with the user’s privacy and application settings.

Meta uses technological measures designed to detect and disrupt automaton and scraping and that also effectively control access to Meta’s and users’ copyright protected works, including requiring users to register for an account and login to the account before using those products, monitoring for the automated creation of accounts, monitoring account use patterns that are inconsistent with a human user, employing a reCAPTCHA program to distinguish between bots and human users, identifying and blocking of IP addresses of known data scrapers, disabling accounts engaged in automated activity, and setting rate and data limits.

Defendant has circumvented and is circumventing technological measures that effectively control access to copyright protected works and those of its users on Facebook and Instagram and/or portions thereof.

Defendant manufactures, provides, offers to the public, or otherwise traffics in technology, products, services, devices, components, or parts thereof, that are primarily designed or produced for the purpose of circumventing technological measures and/or protection afforded by technological measures that effectively control access to copyright protected works and/or portions thereof.

Defendant’s Octoparse Scraping Services or parts thereof, as described above, have no or limited commercially significant purpose or use other than to circumvent technological measures that effectively control access to Meta and its user’s copyrighted works and/or portions thereof in order to scrape copyright protected data from Facebook and Instagram.

So, much of that is bullshit. Octoparse seems like a pretty useful service for researchers and others looking to extract data from websites. There are tons of non-nefarious reasons for doing so, including research or building tools to enable people to access content on social media sites without having to set up an account and give all your info to Meta.

In other words, this lawsuit seems dangerous in multiple ways — an expansion of DMCA 1201, and a tool that Meta can use in a similar manner to what it did with Power and the CFAA to effectively limit competition and to build higher walls for its silos.

The second lawsuit, admittedly, involves a much, much sketchier defendant (which may be why Meta seems to be playing it up, and why much of the press coverage focuses on this lawsuit, rather than the Octoparse one). It’s against a guy named Ekrem Ates, who is apparently based in Turkey and runs (or possibly ran) a website with the evocative name of MyStalk.

MyStalk would scrape information from Instagram users, and repost it to its own site, so that users could follow an Instagram users’ stories without (1) having to log in to Instagram or (2) reveal to the original uploader who was viewing the video. For semi-obvious reasons you can see why this is a bit… creepy. And stalkerish (I mean, the name doesn’t help). But, there are potentially useful reasons for such a service. I mean, in some ways it’s similar to the Nitter service that some people use to view tweets without sharing information back to Twitter.

But, again, Meta insists this is nothing but evil.

Beginning no later than July 2017 and continuing until present, Defendant Ekrem Ateş used unauthorized automation software to improperly access and collect—or “scrape”—the profiles of Instagram users, including their posts, photos, Stories, and profile information. Defendant’s automation software used thousands of automated Instagram accounts that falsely identified themselves as legitimate Instagram users connected to either the official Instagram mobile application or website. Through this fraudulent connection, Defendant scraped data from the profiles of over 350,000 Instagram users. These profiles had not been set to private by the users and, beyond a limited number of profiles and posts, were publicly viewable only to loggedin Instagram users. Defendant published the scraped data on his own websites, which allowed visitors to view and search for Instagram profiles, displayed user data scraped from Instagram, and promoted “stalking people” without their noticing. Defendant also generated revenue by displaying ads on these websites.

Meta notes that it sent Ates a cease and desist letter (a la Power). Ates, apparently without a lawyer (and not very wisely) replied directly to the C&D, admitting to a bunch of stuff he probably should not have admitted to. He claimed that he shut down the services he ran and deleted the data, but also that he had sold the “mystalk” domain to someone else and no longer had control over it. Meta’s lawyers asked him to say who he sold it to, and Ates tried to use that as a negotiation tactic, saying he would reveal the information if Meta promised not to take legal action against him. Meta’s lawyers were, as lawyers are, somewhat vague, suggesting that something might be worked out, but without promising anything, and after that Ates went silent — leading to this lawsuit.

Ates does admit that he made about $1000 from the site, and says he got rid of it because it wasn’t worth it, and says he spent more than that maintaining the site.

This lawsuit is… strange on multiple levels. Ates is clearly a small time player, and he’s based in Turkey, so it seems unlikely he’s going to show up in a US federal court. A default judgment seems like the most likely outcome.

Like the Octoparse case, this one involves breach of contract and unjust enrichment claims, but then adds in California Penal Code § 502. This is the California equivalent of the CFAA.

So, yes, obviously someone setting up a website to allow people to “stalk” others is unsympathetic. But the underlying issue still remains: scraping data and extracting data is also a really useful tool. It’s useful for research. It’s useful for building additional services. It’s useful for creating competition and for limiting the ability of certain internet giants to control absolutely everything.

Yes, it can be abused. But it really feels here (yet again) that this is Meta/Facebook leaning hard on the fact that people keep complaining it doesn’t do enough to protect its users’ privacy as an excuse to get legal rulings that will increasingly shield the company from both scrutiny and competition.

Filed Under: ca penal code 502, cfaa, clone sites, competition, copyright, data, dmca 1201, ekrem ates, octoparse, research, scraping, stalking
Companies: facebook, meta, mystalk, octopus data

Appeals Court Says That Scraping Public Data Off A Website Does Not Violate Hacking Law

from the phew dept

For years now we’ve been following cases related to scraping data off of websites and the Computer Fraud and Abuse Act (CFAA). The CFAA is an extremely poorly drafted law, that has been stretched by both law enforcement and civil plaintiffs alike to argue that all sorts of things are “unauthorized access” and therefore hacking. We’ve covered many of these cases over the years. The courts have at least started to push back on some of the more extreme interpretations of the law, though it’s still problematic.

Over a decade ago, we followed a case that I still think is one of the most problematic rulings for the internet: when Facebook sued a small startup called Power.com. Power made a social media aggregator, allowing you to access all your different social media accounts through one interface and even to post messages across multiple platforms through that single interface. In order to do that, you had to provide your login to Power, which would access your social media accounts, suck out the data (or push in the data for posting). Again, this was the user willingly granting their login information. Leaving aside whether or not it’s wise to share your login info with a third party, it was still the user’s choice.

However, Facebook decided that this was hacking and in violation of the CFAA… and the courts (tragically) agreed, allowing Facebook to effectively shut down a useful service that would have prevented Facebook from locking up so much data (and becoming such a dominant player). The key reason the court sided with Facebook was it claimed that once Facebook sent a cease-and-desist letter, that effectively mean that any further scraping was “unauthorized.” I still think that we’d see an extremely different competitive landscape today if the Power case had turned out differently. It would have significantly limited the ability of the big social media players to lock in their users. Instead, the rule more or less turned Facebook into a roach motel where your data checks in, but it can never check out.

Other internet companies unfortunately followed suit, using similar lawsuits against websites providing useful complementary services. Craigslist went after 3taps, which made Craigslist data available to third party apps. LinkedIn went after a company called HiQ that was scraping and making use of LinkedIn data. Here, unlike the Power case, the courts actually ruled against LinkedIn saying that LinkedIn could not use the CFAA to block scraping of public data. The key difference between this case and the Power one was that HiQ was scraping public info (i.e., it didn’t need to log in to LinkedIn with someone’s info to access the data). LinkedIn appealed… and lost again. LinkedIn then asked the Supreme Court to weigh in, resulting in the Supreme Court vacating the 9th Circuit’s ruling and sending it back to the court to reconsider in light of last summer’s big Van Buren ruling that limited parts of the CFAA.

So now, with yet another chance… the 9th Circuit has correctly concluded the same thing. HiQ’s scraping of public information still does not violate the CFAA. There are a few different legal issues involved here, but the CFAA claims are the main event. LinkedIn argued that it sent a cease-and-desist to HiQ, so as per the Power ruling, its continued scraping violated the law.

The panel reviewing this case goes deep into the CFAA, why it exists, and what it’s supposed to do before concluding that LinkedIn’s interpretation can’t be the correct one, noting that “the CFAA is best understood as an anti-intrusion statute and not as a ‘misappropriation statute,'” and as such accessing public information shouldn’t be a violation.

Put differently, the CFAA contemplates the existence of three kinds of computer systems: (1) computers for which access is open to the general public and permission is not required (2) computers for which authorization is required and has been given, and (3) computers for which authorization is required but has not been given (or, in the case of the prohibition on exceeding authorized access, has not been given for the part of the system accessed). Public LinkedIn profiles, available to anyone with an Internet connection, fall into the first category. With regard to websites made freely accessible on the Internet, the “breaking and entering” analogue invoked so frequently during congressional consideration has no application, and the concept of “without authorization” is inapt.

As for reconsidering in light of the Van Buren ruling, that doesn’t change things.

Van Buren’s “gates-up-or-down inquiry” is consistent with our interpretation of the CFAA as contemplating three categories of computer systems

[….]

Van Buren’s distinction between computer users who “can or cannot access a computer system,” suggests a baseline in which there are “limitations on access” that prevent some users from accessing the system (i.e., a “gate” exists, and can be either up or down). The Court’s “gates-up-or-down inquiry” thus applies to the latter two categories of computers we have identified: if authorization is required and has been given, the gates are up; if authorization is required and has not been given, the gates are down. As we have noted, however, a defining feature of public websites is that their publicly available sections lack limitations on access; instead, those sections are open to anyone with a web browser. In other words, applying the “gates” analogy to a computer hosting publicly available webpages, that computer has erected no gates to lift or lower in the first place.17 Van Buren therefore reinforces our conclusion that the concept of “without authorization” does not apply to public websites.

The court again distinguishes Power from the HiQ case by saying that Facebook limited access to the data to only those who were logged in, as opposed to the more public access available on LinkedIn.

In that case, Facebook sued Power Ventures, a social networking website that aggregated social networking information from multiple platforms, for accessing Facebook users’ data and using that data to send mass messages as part of a promotional campaign. Id. at 1062–63. After Facebook sent a cease-and-desist letter, Power Ventures continued to circumvent IP barriers and gain access to password protected Facebook member profiles. Id. at 1063. We held that after receiving an individualized cease-and-desist letter, Power Ventures had accessed Facebook computers “without authorization” and was therefore liable under the CFAA. Id. at 1067–68. But we specifically recognized that “Facebook has tried to limit and control access to its website” as to the purposes for which Power Ventures sought to use it. Id. at 1063. Indeed, Facebook requires its users to register with a unique username and password, and Power Ventures required that Facebook users provide their Facebook username and password to access their Facebook data on Power Ventures’ platform. Facebook, Inc. v. Power Ventures, Inc., 844 F. Supp. 2d 1025, 1028 (N.D. Cal. 2012). While Power Ventures was gathering user data that was protected by Facebook’s username and password authentication system, the data hiQ was scraping was available to anyone with a web browser

And thus this doesn’t fix the unfortunate precedent in the Power case, but at least it limits it from getting worse, while making it clear that scraping public web pages is not hacking, even if you’re sent a cease-and-desist letter.

Filed Under: 9th circuit, cfaa, scraping, web scraping
Companies: hiq, linkedin