scraping – Techdirt (original) (raw)

Air Canada Would Rather Sue A Website That Helps People Book More Flights Than Hire Competent Web Engineers

from the time-to-cross-air-canada-off-the-flight-list dept

I am so frequently confused by companies that sue other companies for making their own sites and services more useful. It happens quite often. And quite often, the lawsuits are questionable CFAA claims against websites that scrape data to provide a better consumer experience, but one that still ultimately benefits the originating site.

Over the last few years various airlines have really been leading the way on this, with Southwest being particularly aggressive in suing companies that help people find Southwest flights to purchase. Unfortunately, many of these lawsuits are succeeding, to the point that a court has literally said that a travel company can’t tell others how much Southwest flights cost.

But the latest lawsuit of this nature doesn’t involve Southwest, and is quite possibly the dumbest one. Air Canada has sued the site Seats.aero that helps users figure out the best flights for their frequent flyer miles. Seats.aero is a small operation run by the company with the best name ever: Localhost, meaning that the lawsuit is technically “Air Canada v. Localhost” which sounds almost as dumb as this lawsuit is.

The Air Canada Group brings this action because Mr. Ian Carroll—through Defendant Localhost LLC—created a for-profit website and computer application (or “app”)— both called Seats.aero—that use substantial amounts of data unlawfully scraped from the Air Canada Group’s website and computer systems. In direct violation of the Air Canada Group’s web terms and conditions, Carroll uses automated digital robots (or “bots”) to continuously search for and harvest data from the Air Canada Group’s website and database. His intrusions are frequent and rapacious, causing multiple levels of harm, e.g., placing an immense strain on the Air Canada Group’s computer infrastructure, impairing the integrity and availability of the Air Canada Group’s data, soiling the customer experience with the Air Canada Group, interfering with the Air Canada Group’s business relations with its partners and customers, and diverting the Air Canada Group’s resources to repair the damage. Making matters worse, Carroll uses the Air Canada Group’s federally registered trademarks and logo to mislead people into believing that his site, app, and activities are connected with and/or approved by the real Air Canada Group and lending an air of legitimacy to his site and app. The Air Canada Group has tried to stop Carroll’s activities via a number of technological blocking measures. But each time, he employs subterfuge to fraudulently access and take the data—all the while boasting about his exploits and circumvention online.

Almost nothing in this makes any sense. Having third parties scrape sites for data about prices is… how the internet works. Whining about it is stupid beyond belief. And here, it’s doubly stupid, because anyone who finds a flight via seats.aero is then sent to Air Canada’s own website to book that flight. Air Canada is making money because Carroll’s company is helping people find Air Canada flights they can take.

Why are they mad?

Air Canada’s lawyers also seem technically incompetent. I mean, what the fuck is this?

Through screen scraping, Carroll extracts all of the data displayed on the website, including the text and images.

Carroll also employs the more intrusive API scraping to further feed Defendant’s website.

If the “API scraping” is “more intrusive” than screen scraping, you’re doing your APIs wrong. Is Air Canada saying that its tech team is so incompetent that its API puts more load on the site than scraping? Because, if so, Air Canada should fire its tech team. The whole point of an API is to make it easier for those accessing data from your website without needing to do the more cumbersome process of scraping.

And, yes, this lawsuit really calls into question Air Canada’s tech team and their ability to run a modern website. If your website can’t handle having its flights and prices scraped a few times every day, then you shouldn’t have a website. Get some modern technology, Air Canada:

Defendant’s avaricious data scraping generates frequent and myriad requests to the Air Canada Group’s database—far in excess of what the Air Canada Group’s infrastructure was designed to handle. Its scraping collects a large volume of data, including flight data within a wide date range and across extensive flight origins and destinations—multiple times per day.

Maybe… invest in better infrastructure like basically every other website that can handle some basic scraping? Or, set up your API so it doesn’t fall over when used for normal API things? Because this is embarrassing:

At times, Defendant’s voluminous requests have placed such immense burdens on the Air Canada Group’s infrastructure that it has caused “brownouts.” During a brownout, a website is unresponsive for a period of time because the capacity of requests exceeds the capacity the website was designed to accommodate. During brownouts caused by Defendant’s data scraping, legitimate customers are unable to use or the Air Canada + Aeroplan mobile app, including to search for available rewards, redeem Aeroplan points for the rewards, search for and view reward travel availability, book reward flights, contact Aeroplan customer support, and/or obtain service through the Aeroplan contact center due to the high volume of calls during brownouts.

Air Canada’s lawyers also seem wholly unfamiliar with the concept of nominative fair use for trademarks. If you’re displaying someone’s trademarks for the sake of accurately talking about them, there’s no likelihood of confusion and no concern about the source of the information. Air Canada claiming that this is trademark infringement is ridiculous:

I guarantee that no one using Seats.aero thinks that they’re on Air Canada’s website.

The whole thing is so stupid that it makes me never want to fly Air Canada again. I don’t trust an airline that can’t set up its website/API to handle someone making its flights more attractive to buyers.

But, of course, in these crazy times with the way the CFAA has been interpreted, there’s a decent chance Air Canada could win.

For its part, Carroll says that he and his lawyers have reached out to Air Canada “repeatedly” to try to work with them on how they “retrieve availability information,” and that “Air Canada has ignored these offers.” He also notes that tons of other websites are scraping the very same information, and he has no idea why he’s been singled out. He further notes that he’s always been open to adjusting the frequency of searches and working with the airlines to make sure that his activities don’t burden the website.

But, really, the whole thing is stupid. The only thing that Carroll’s website does is help people buy more flights. It points people to the Air Canada site to buy tickets. It makes people want to fly more on Air Canada.

Why would Air Canada want to stop that other than that it can’t admit that it’s website operations should all be replaced by a more competent team?

Filed Under: api, cfaa, flights, frequent fliers, scraping, screen scraping, trademark
Companies: air canada, localhost, seats.aero

WOW Fans Trick ‘AI’ ‘News’ Scraper Into Covering Fake New Game Feature

from the yes-I-can-absolutely-do-that,-Dave dept

Tue, Jul 25th 2023 05:23am - Karl Bode

Language learning technology’s (aka “AI”) introduction into journalism has been a blistering mess. And not just because the technology is undercooked (which it is), but because the folks in charge of most major media outlets are incompetent cheapskates who simply see the tech as a way to cut corners, wage war on labor, and automate all of the clickbait attention economy’s very worst impulses.

The result of that continues to go about how you’d expect, with a ton of rushed computer-generated articles filled with dumb mistakes.

But last week there was a fun wrinkle when users over at the r/wow subreddit tricked an “AI” scraping the web for news into publishing an article on a new World of Warcraft feature that doesn’t exist. The fans created an entirely new game mode and lore called Glorbo, talked about it as if it was a real thing in the subreddit, and got a website called The Portal, owned by Zleague.gg, to treat it like a real thing:

The Portal, owned by Zleague.gg, ran an SEO item on Glorbo headlined “World of Warcraft (WoW) Players Excited for Glorbo’s Introduction”, quoting the main Reddit thread directly. Though it appears The Portal has since realised its mistake and removed the post, it can still be read in full on Archive.Today. The original post does not appear to denote that the story was automated. The author byline on the piece does not lead to a bio or social media links of any kind.”

While this was a fun prank related to gaming news, the same kind of lazy rushed implementation of “AI” is also occurring in the broader field of journalism. And while the tech may improve over time, the kind of greedy, incompetent leadership we’ve seen in media generally won’t.

There are plenty of ways these language learning tools could actually help journalists do a better, more efficient job. But we’re not injecting the technology into a healthy journalism and media environment. We’re injecting it into an already very broken clickbait bullshit generation machine, effectively supercharging all of its worst tendencies.

The goal for a lot of the VC types in media is to create a giant pointless ouroboros of clickbait gibberish and ad consumption that shits money. A giant wheel of pointless, often-manufactured engagement that is largely free of any pesky concerns about silly things like paying human beings a living wage, the quality of the end product, or the health of the broader industry.

Filed Under: ai, clickbait, gaming, journalism, labor, media, news, scraping, world of warcraft
Companies: reddit, zleague.gg

Elon Musk’s ‘War’ On Possibly Imaginary Scrapers Now A Lawsuit, Which Might Actually Work

from the killing-the-open-web dept

Elon Musk seems infatuated with bots and scrapers as the root of all his problems at Twitter. Given his propensity to fire engineers who tell him things he doesn’t want to hear, it’s not difficult to believe that engineers afraid to tell Musk the truth are conveniently blaming “scraping” on the variety of problems that Twitter has had since Musk’s YOLO leadership style at Twitter has knocked out some fundamental tools that kept the site reliable in the before times.

He tried to blame bots for spam (which he’s claimed repeatedly to have dealt with, but then gone back to blaming them for other things, because he hasn’t actually stopped automated spam). His attempts to “stop the bots” has resulted in a series of horrifically stupid decision-making, including believing that his non-verification Twitter Blue system would solve it (it didn’t), believing that cutting off free API access would drive away the spam bots (it drove away the good bots), and then believing that rate limiting post views would somehow magically stop scraping bots (which might only be scraping because of his earlier dumb decision to kill off the API).

The latest, though, is that last week Twitter went to court last week to sue ‘John Doe’ scrapers in a Texas court. And while I’ve long argued that scraping should be entirely legal, court precedents may be on Twitter’s side here.

Scraping is part of how the internet works and has always worked. The war on scraping is problematic for all sorts of reasons, and is an attack on the formerly open web. Unfortunately, though, courts are repeatedly coming out against scraping.

So, while I’d argue that this, from the complaint, is utter nonsense, multiple courts seem to disagree and find the argument perfectly plausible:

Scraping is a form of unauthorized data collection that uses automation and other processes to harvest data from a website or a mobile application.

Scraping interferes with the legitimate operation of websites and mobile applications, including Twitter, by placing millions of requests that tax the capacity of servers and impair the experience of actual users.

This is not how any of this should work, and is basically just an attack on the open web. Yes, scraping bots can overwhelm a site, but it’s on the site itself to block it, not the courts.

Twitter users have no control over how data-scraping companies repackage and sell their personal information.

This sounds scary, but again is nonsense. Scraping only has access to public information. If you post information publicly, then of course users don’t have control over that information any more. That’s how information works.

The complaint says that Twitter (I’m not fucking calling it ‘X Corp.’) has discovered IP addresses engaged in “flooding Twitter’s sign-up page with automated requests.” The complaint says:

The volume of these requests far exceeded what any single individual could send to a server in a given period and clearly indicated that these automated requests were aimed at scraping data from Twitter.

This also feels like a stretch. It seems like the more likely reason for flooding a sign up page is to create spam accounts. That’s also bad, of course, but it’s not clear how this automatically suggests scraping.

Of course, there have been a bunch of scraping cases in the past, and there are some somewhat mixed precedents here. There was the infamous Power.com case, that said it could be a CFAA (Computer Fraud and Abuse Act) violation to scrape content from behind a registration wall (even if the user gave permission). Last year, there was the April ruling in the 9th Circuit on LinkedIn/HiQ which notably said that scraping from a public website rather than a registration-walled website could not be a CFAA violation.

Indeed, much of the reporting on Twitter’s new lawsuit is pointing to that decision. But, unfortunately, that’s the wrong decision to look at. Months later, the same court ruled again in that case (in a ruling that got way less attention) that even if the scraping wasn’t a CFAA violation, it was still a a violation of LinkedIn’s terms of service, and granted an injunction against the scrapers.

Given the framing in the complaint, Twitter seems to be arguing the same thing (rather than a CFAA violation, that this is a terms of service violation). On top of that, this case is filed in Texas state court, and at least in federal court in Texas, the 5th Circuit has found that scraping data can be considered “unjust enrichment.”

In other words, as silly as this is, and as important scraping is to the open web, it seems that courts are buying the logic of this kind of lawsuit, meaning that Twitter’s case is probably stronger than it should be.

Of course, Twitter still needs to figure out who is actually behind these apparent scraping IP addresses, and then show that they actually were scraping. And who knows if the company will be able to do that. In the meantime, though, this is yet another case, following in the unfortunate pattern of Facebook, LinkedIn, and even Craigslist, to spit on the open web they were built on.

Filed Under: lawsuits, scraping, terms of service, texas
Companies: twitter

Something Stupid This Way Comes: Twitter Threatens To Sue Meta Over Threads, Because Meta Hired Some Of The People Elon Fired

from the everything-is-stupid dept

Just fucking fight it out already.

The whole stupid “cage match” brawl thing was started when Meta execs made some (accurate) cracks about Elon’s management of Twitter, and Elon couldn’t handle it. But, now with the launch of Meta’s Threads, Elon feels the need to send a ridiculously laughable legal threat to Meta.

Elon’s legal lapdog, Alex Spiro, dashed off a threat letter so dumb that even his employer, Quinn Emanuel — who is famous among powerful law firms for having no shame at all — should feel shame.

Dear Mr. Zuckerberg:

I write on behalf of X Corp., as successor in interest to Twitter, Inc. (“Twitter”). Based on recent reports regarding your recently launched “Threads” app, Twitter has serious concerns that Meta Platforms (“Meta”) has engaged in systemic, willful, and unlawful misappropriation of Twitter’s trade secrets and other intellectual property.

Lol, wut? Threads is like a dozen other microblogging type services. There are no “trade secrets” one needs to misappropriate from Twitter. I mean, seriously, who in their right mind thinks that Meta with billions of users of Facebook, Instagram, and WhatsApp is learning anything from Twitter, beyond “don’t do the dumbshit things Elon keeps doing.”

Over the past year, Meta has hired dozens of former Twitter employees. Twitter knows that these employees previously worked at Twitter; that these employees had and continue to have access to Twitter’s trade secrets and other highly confidential information; that these employees owe ongoing obligations to Twitter; and that many of these employees have improperly retained Twitter documents and electronic devices. With that knowledge, Meta deliberately assigned these employees to develop, in a matter of months, Meta’s copycat “Threads” app with the specific intent that they use Twitter’s trade secrets and other intellectual property in order to accelerate the development of Meta’s competing app, in violation of both state and federal law as well as those employees’ ongoing obligations to Twitter.

Let’s break this one down, because holy shit, is it ever stupid. The reason that Meta was able to hire a bunch of former Twitter employees most likely had to do with the fact that Elon recklessly fired 85% of the existing staff, and did so willy nilly, destroying tons of institutional knowledge and knowhow. And yet, Musk claimed he had to get rid of these employees because they were not hardcore, and were useless to Twitter. Yet, now we’re being told they are somehow invaluable to Threads? That doesn’t even pass the most basic laugh test.

The claim that “these employees have improperly retained Twitter documents and electronic devices” is particularly ridiculous, given that I’ve spoken to many, many, many ex-Twitter employees who have spent months trying to return their laptops, without Twitter bothering to respond to them at all. To use that against those employees is ridiculous.

And, really, what fucking “trade secrets” or “intellectual property’ do Spiro and Musk honestly think that any former employees took with them to Meta? How to competently run a microblogging service? This is all bluff and bluster from Elon, who knows he’s fucked up Twitter and is scared of any competition.

On top of that, assuming any of those employees are in California, then state law for the last century and a half has prohibited arguments regarding non-competes or similar, because the state has a stated policy that people should be allowed to be employed. So, to the extent that Twitter thinks it can enforce some sort of quasi-non-compete agreement, that’s just not going to fly.

Update: Also, Meta has now said that none of the small team working on Threads is a former Twitter employee anyway, so the assumptions in the letter are entirely false.

The letter continues:

Twitter intends to strictly enforce its intellectual property rights, and demands that Meta take immediate steps to stop using any Twitter trade secrets or other highly confidential information. Twitter reserves all rights, including, but not limited to, the right to seek both civil remedies and injunctive relief without further notice to prevent any further retention, disclosure, or use of its intellectual property by Meta.

In short, even as we’re not paying many of our bills and are desperately short on cash, especially compared to Meta, which has a building full of litigators, we’re ready, able, and willing to file a completely bogus, vexatious lawsuit just to try to annoy you.

Then we get to the real fear: that Meta might make it easy to recreate your Twitter social graph on threads:

Further, Meta is expressly prohibited from engaging in any crawling or scraping of Twitter’s followers or following data. As set forth in Twitter’s Terms of Service, crawling any Twitter services — including, but not limited to, any Twitter websites, SMS, APIs, email notifications, applications, buttons, widgets, ads, and commerce services — is permissible only “if done in accordance with the provisions of the robots.txt file” available at https://twitter.com/robots.txt. The robots.txt file specifically disallows crawling of Twitter’s followers or following data. Scraping any Twitter services is expressly prohibited for any reason without Twitter’s prior consent. Twitter reserves all rights, including but not limited to, the right to seek both civil remedies or injunctive relief without further notice.

So, yeah. This letter is basically Elon publicly admitting he’s scared shitless of Threads and its potential impact on Twitter. This is a “holy shit, this is bad, we’re fucked” kinda letter. Not one from a position of strength. Honestly, this letter makes me think that Threads has a better chance than I initially expected, if Musk is so damn scared of it.

Of course, to date, I’ve seen no indication that Threads was looking to scrape Twitter or enable easy transfer of the Twitter social graph to threads. Of course, lots of third parties often create such tools, and we’ve already seen Elon freak out over tools that helped users find their Twitter social graph on Mastodon, so I guess this is how he competes. By throwing up bogus walls.

That said, Meta can’t really say much here. After all, it set one of the horrible precedents in court regarding scraping data from websites to build services on top of them. To the extent that Twitter actually has any legal power to stop Meta from scraping, that power was given to it via a bad lawsuit that Meta itself started and pushed to completion.

Though, again, there’s been no indication that Meta actually plans to do that. The fact that it’s able to bootstrap its network off of the (much, much, much larger than Twitter) Instagram network suggests it has no need to port Twitter’s social graph over.

Again, this legal threat letter appears to be legal bluster from the much weaker party of the two.

I doubt this turns into an actual legal dispute, though with Elon, you never really know. If it does turn into a live dispute however, assuming that Meta didn’t do something preposterously silly (like asking former Twitter employees to share internal documents), then Meta will destroy this lawsuit easily.

But, you know, if we’re going to see a cage match between these two billionaires, why not just throw this on the undercard as well.

Filed Under: alex spiro, competition, elon musk, employees, intellectual property, mark zuckerberg, scraping, threads, trade secrets
Companies: meta, threads, twitter

Surveillance Tech Firm Sued By Meta For Using Thousands Of Bogus Accounts To Scrape Data

from the breaking-the-rules-to-sell-stuff-to-cops dept

About a half-decade ago, major social media companies finally did something to prevent their platforms from being used to engage in mass surveillance. Prompted by revelations in public records, Twitter and Facebook began cutting off API access to certain data scrapers that sold their services to government agencies. Twitter blocked both Dataminr and Geofeedia from accessing its “firehose” API. Facebook did the same thing to Geofeedia, denying it access to both its core service and Instagram.

That may have had some impact on these companies’ ability to secure new government contracts, but there are plenty of others willing to fill the tiny void left by this disruption. And they’re willing to break the rules that govern users of social media platforms, just like the law enforcement agencies they sell to.

Meet Voyager Labs, first exposed late last year by The Guardian, which based its report on public records obtained by the Brennan Center. Here’s what Voyager offers to its law enforcement customers, which include the Los Angeles Police Department:

Pulling information from every part of an individual’s various social media profiles, Voyager helps police investigate and surveil people by reconstructing their entire digital lives – public and private. By relying on artificial intelligence, the company claims, its software can decipher the meaning and significance of online human behavior, and can determine whether subjects have already committed a crime, may commit a crime or adhere to certain ideologies.

But new documents, obtained through public information requests by the Brennan Center, a non-profit organization, and shared with the Guardian, show that the assumptions the software relies on to draw those conclusions may run afoul of first amendment protections. In one case, Voyager indicated that it considered using an Instagram name that showed Arab pride or tweeting about Islam to be signs of a potential inclination toward extremism.

The documents also reveal Voyager promotes a variety of ethically questionable strategies to access user information, including enabling police to use fake personas to gain access to groups or private social media profiles.

It’s that last part — the use of fake personas — that’s getting Voyager sued by Meta, Facebook’s parent company. Facebook has let law enforcement officers know — on multiple occasions — that setting up fake accounts violates its terms of use. It also informed (repeatedly) this particular enabler of ToS violations. When it was ignored to the tune of tens of thousands of bogus accounts by Voyager, it sued, as Jess Weatherbed reports for The Verge.

According to a legal filing issued on November 11th, Meta alleges that Voyager Labs created over 38,000 fake Facebook user accounts and used its surveillance software to gather data from Facebook and Instagram without authorization. Voyager Labs also collected data from sites including Twitter, YouTube, and Telegram.

Meta says Voyager Labs used fake accounts to scrape information from over 600,000 Facebook users between July 2022 and September 2022. Meta says it disabled more than 60,000 Voyager Labs-related Facebook and Instagram accounts and pages “on or about” January 12th.

The updated complaint [PDF], containing more than 1,500 pages of exhibits covering everything from Voyager’s financial statements to its communications with law enforcement users, seeks an injunction blocking Voyager from further violating Facebook’s terms of service agreement.

This is kind of a pleasant surprise. Restricting the complaint to breach of contract actions under both state and federal law keeps the oft-abused CFAA out of it. Had the CFAA been brought into this as a cause of action, it would have created the possibility that researchers, academics, and others who scrape Facebook for useful data might have been harmed by an expansive reading of the CFAA’s “unauthorized access” clause. Fortunately, the CFAA is not in play here, with Meta content to seek damages for Voyager’s repeated violations of its agreements with Facebook.

If Meta succeeds, Voyager’s “real time” scraping service will cease to be useful to its customers. And if the company gets a favorable ruling that results in the collection of damages, fewer companies will be as likely to violate rules just so they can sell stuff to cops.

Filed Under: breach of contract, fake accounts, law enforcement, scraping
Companies: meta, voyager

Federal Court Says Scraping Court Records Is Most Likely Protected By The First Amendment

from the public-access-by-any-means-necessary dept

Automated web scraping can be problematic. Just look at Clearview, which has leveraged open access to public websites to create a facial recognition program it now sells to government agencies. But web scraping can also be quite useful for people who don’t have the power or funding government agencies and their private contractors have access to.

The problem is the Computer Fraud and Abuse Act (CFAA). The act was written to give the government a way to go after malicious hackers. But instead of being used to prosecute malicious hackers, the government (and private companies allowed to file CFAA lawsuits) has gone after security researchers, academics, public interest groups, and anyone else who accesses systems in ways their creators haven’t anticipated.

Fortunately, things have been changing in recent years. In May of last year, the DOJ changed its prosecution policies, stating that it would not go after researchers and others who engaged in “good faith” efforts to notify others of data breaches or otherwise provide useful services to internet users. Web scraping wasn’t specifically addressed in this policy change, but the alteration suggested the DOJ was no longer willing to waste resources punishing people for being useful.

Web scraping is more than a CFAA issue. It’s also a constitutional issue. None other than Clearview claimed it had a First Amendment right to gather pictures, data, and other info from websites with its automated scraping.

Clearview may have a point. A few courts have found scraping of publicly available data to be something protected by the First Amendment, rather than a violation of the CFAA.

Unfortunately, all we really have is a pinkie swear from the DOJ and a handful of decisions that only have precedential weight in certain jurisdictions. But there’s more coming. As the ACLU reports, another federal court has come to the conclusion that government efforts banning web scraping violate the rights of would-be scrapers. But, as is the case in many legal actions, the details matter.

In an important victory, a federal judge in South Carolina ruled that a case to lift the categorical ban on automated data collection of online court records – known as “scraping” – can move forward. The case claims the ban violates the First Amendment.

The decision came in NAACP v. Kohn, a lawsuit filed by the American Civil Liberties Union, ACLU of South Carolina, and the NAACP on behalf of the South Carolina State Conference of the NAACP. The lawsuit asserts that the Court Administration’s blanket ban on scraping the Public Index – the state’s repository of court filings – violates the First Amendment by restricting access to, and use of, public information, and prohibiting recording public information in ways that enable subsequent speech and advocacy.

The case stems from the NAACP’s “Housing Navigator,” which scrapes publicly available info from government websites to find tenants subject to eviction in order to provide them assistance in fighting eviction orders or finding new housing. As the NAACP (and ACLU) point out, this valuable service would be impossible if the NAACP was limited to manual searches to find affected tenants.

The state of South Carolina — via a state appellate decisions — claims the NAACP is only allowed limited access — the manual searches the NAACP says render its eviction assistance efforts impossible to achieve. The federal court says the state does have the power to limit access to public records, but those limits must align themselves with the tenets of the First Amendment, which presume open access to government records by the governed.

The state comes down on the losing side here, at least for the moment. The limits proposed by the state court order nullify the services the NAACP hopes to offer. As it stands now, the state cannot escape this lawsuit because there’s enough on the record at the moment that suggests there’s a viable constitutional claim.

The NAACP alleges that without scraping, it is impossible to gather the information quickly enough to meet the ten-day deadline to request a hearing. It alleges that scraping poses at most a de minimis burden on the functionality of the website.

As discussed above, it also contends suggested alternatives to scraping, such as Rule 610, are insufficient, and that Defendants have, in any event, indicated an unwillingness to provide the information under that rule. […]

True, the evidence may eventually show that Defendants have a sufficient reason to prohibit scraping. It may indicate that the NAACP’s access to the records is unburdened by the restriction. Or, it may demonstrate that Defendants have provided sufficient alternatives to access the information. But, as alleged, the restrictions state a claim for violation of the First Amendment.

The bottom line is this: automated access to government records is almost certainly protected by the First Amendment. What will be argued going forward is how much the government can restrict this access without violating the Constitution. There’s not a lot on the record at the moment, but this early ruling seems to suggest this court will err on the side of unrestricted access, rather than give its blessing to unfettered fettering of the presumption of open access that guides citizens’ interactions with public records.

Filed Under: 1st amendment, court documents, public data, scraping
Companies: aclu, naacp

Meta Sues Scraping Firms; Is It Really Protecting Users? Or Protecting Meta?

from the potentially-problematic dept

For many years we’ve written stories regarding various lawsuits over scraping the web. Without the ability to scrape the web, we’d have no search engines, no Internet Archive, and lots of other stuff wouldn’t work right either. However, more importantly, the ability to scrape the web should result in a better overall internet, potentially reversing the trend of consolidation and internet giants that silo off your info. Most often, we’ve talked about this in the context of Facebook’s case over a decade ago against Power.com. That involved a company that was trying to build a single dashboard for multiple social media companies, allowing users to log into a single interface to see content from, and post content to, multiple platforms at once. In that case Facebook relied on the Computer Fraud and Abuse Act (the CFAA), and the courts sided with Facebook, saying that because Facebook had sent Power a cease-and-desist letter, that made the access (even with the approval of the users themselves!) somehow “unauthorized.”

Over the years, we’ve pointed out how this decision and interpretation of the CFAA is one of the biggest reasons the market for social media is not as competitive as it could be. That decision effectively said that Facebook could build its own silo, in which your data checks in but it never checks out. Other tech companies — including Craigslist and LinkedIn — have brought similar lawsuits, though in LinkedIn’s case against HiQ the court cut back the earlier Power.com ruling, and basically said that it only applied to information that was behind a registration wall. Publicly available information was legal to scrape.

More recently, Facebook parent company Meta has again gone after scraping operations. Earlier this year, we noted how the company had sued a somewhat sketchy provider of “insights” into “influencers and their audiences” that had been scraping information on Facebook. And, now, the company has announced two new lawsuits against scraping companies. Once again, neither of the defendants are as sympathetic as Power, and Meta even frames these lawsuits as “safeguarding” its users privacy.

The first lawsuit, against a company called Octopus Data, raises all sorts of questions. Octopus offers a cloud-based service called Octoparse, which allows customers to extract web data from basically any URL without having to do any coding yourself. This is actually… really really useful? Especially for researchers. The ability to scrape and extract data from webpages is not just useful, it’s how lots of services work, including search engines. But Meta is not at all happy.

Since at least March 25, 2015, and continuing to the present, Defendant Octopus Data Inc., (“Octopus”) has operated an unlawful service called Octoparse, which was designed to improperly collect or “scrape” user account profiles and other information from various websites, including Amazon, eBay, Twitter, Yelp, Google, Target, Walmart, Indeed, LinkedIn, Facebook and Instagram.

Defendant’s service used and offered multiple products to scrape data. First, Defendant offered to scrape data directly from various websites on behalf of its customers (the “Scraping Service”). Second, Defendant developed and distributed software designed to scrape data from any website, including Facebook and Instagram, using a customer’s self-compromised account (the “Scraping Software”). Defendant’s Scraping Software was capable of scraping any data accessible to a logged in Facebook and Instagram user. And Defendant designed the “premium” Scraping Software to launch scraping campaigns from Defendant’s computer network and infrastructure. Finally, Defendant claimed to use and distribute technologies to avoid being detected and blocked by Meta and other websites they scraped.

Defendant’s conduct was not authorized by Meta and it violates Meta’s and Instagram’s terms and policies, and federal and California law. Accordingly, Meta seeks damages and injunctive relief to stop Defendant’s use of its platform and products in violation of its terms and policies.

Perhaps notably, Facebook does not try to use either the CFAA or California’s state equivalent in this case. Instead, it tosses in… a copyright claim. That’s because one of the premium services of Octoparse is that it will scrape the data and store it on its own server — and Meta argues that Octoparse violates Section 1201 of the DMCA (the anti-circumvention part) because the scraping tool has to “circumvent” Meta’s technical tools put in place to block Octoparse.

Certain user generated content is also copyright protected and users grant Meta a non-exclusive, transferable, sub-licensable, royalty-free, and worldwide license to host, use, distribute, modify, run, copy, publicly perform or display, translate, and create derivative works of that content consistent with the user’s privacy and application settings.

Meta uses technological measures designed to detect and disrupt automaton and scraping and that also effectively control access to Meta’s and users’ copyright protected works, including requiring users to register for an account and login to the account before using those products, monitoring for the automated creation of accounts, monitoring account use patterns that are inconsistent with a human user, employing a reCAPTCHA program to distinguish between bots and human users, identifying and blocking of IP addresses of known data scrapers, disabling accounts engaged in automated activity, and setting rate and data limits.

Defendant has circumvented and is circumventing technological measures that effectively control access to copyright protected works and those of its users on Facebook and Instagram and/or portions thereof.

Defendant manufactures, provides, offers to the public, or otherwise traffics in technology, products, services, devices, components, or parts thereof, that are primarily designed or produced for the purpose of circumventing technological measures and/or protection afforded by technological measures that effectively control access to copyright protected works and/or portions thereof.

Defendant’s Octoparse Scraping Services or parts thereof, as described above, have no or limited commercially significant purpose or use other than to circumvent technological measures that effectively control access to Meta and its user’s copyrighted works and/or portions thereof in order to scrape copyright protected data from Facebook and Instagram.

So, much of that is bullshit. Octoparse seems like a pretty useful service for researchers and others looking to extract data from websites. There are tons of non-nefarious reasons for doing so, including research or building tools to enable people to access content on social media sites without having to set up an account and give all your info to Meta.

In other words, this lawsuit seems dangerous in multiple ways — an expansion of DMCA 1201, and a tool that Meta can use in a similar manner to what it did with Power and the CFAA to effectively limit competition and to build higher walls for its silos.

The second lawsuit, admittedly, involves a much, much sketchier defendant (which may be why Meta seems to be playing it up, and why much of the press coverage focuses on this lawsuit, rather than the Octoparse one). It’s against a guy named Ekrem Ates, who is apparently based in Turkey and runs (or possibly ran) a website with the evocative name of MyStalk.

MyStalk would scrape information from Instagram users, and repost it to its own site, so that users could follow an Instagram users’ stories without (1) having to log in to Instagram or (2) reveal to the original uploader who was viewing the video. For semi-obvious reasons you can see why this is a bit… creepy. And stalkerish (I mean, the name doesn’t help). But, there are potentially useful reasons for such a service. I mean, in some ways it’s similar to the Nitter service that some people use to view tweets without sharing information back to Twitter.

But, again, Meta insists this is nothing but evil.

Beginning no later than July 2017 and continuing until present, Defendant Ekrem Ateş used unauthorized automation software to improperly access and collect—or “scrape”—the profiles of Instagram users, including their posts, photos, Stories, and profile information. Defendant’s automation software used thousands of automated Instagram accounts that falsely identified themselves as legitimate Instagram users connected to either the official Instagram mobile application or website. Through this fraudulent connection, Defendant scraped data from the profiles of over 350,000 Instagram users. These profiles had not been set to private by the users and, beyond a limited number of profiles and posts, were publicly viewable only to loggedin Instagram users. Defendant published the scraped data on his own websites, which allowed visitors to view and search for Instagram profiles, displayed user data scraped from Instagram, and promoted “stalking people” without their noticing. Defendant also generated revenue by displaying ads on these websites.

Meta notes that it sent Ates a cease and desist letter (a la Power). Ates, apparently without a lawyer (and not very wisely) replied directly to the C&D, admitting to a bunch of stuff he probably should not have admitted to. He claimed that he shut down the services he ran and deleted the data, but also that he had sold the “mystalk” domain to someone else and no longer had control over it. Meta’s lawyers asked him to say who he sold it to, and Ates tried to use that as a negotiation tactic, saying he would reveal the information if Meta promised not to take legal action against him. Meta’s lawyers were, as lawyers are, somewhat vague, suggesting that something might be worked out, but without promising anything, and after that Ates went silent — leading to this lawsuit.

Ates does admit that he made about $1000 from the site, and says he got rid of it because it wasn’t worth it, and says he spent more than that maintaining the site.

This lawsuit is… strange on multiple levels. Ates is clearly a small time player, and he’s based in Turkey, so it seems unlikely he’s going to show up in a US federal court. A default judgment seems like the most likely outcome.

Like the Octoparse case, this one involves breach of contract and unjust enrichment claims, but then adds in California Penal Code § 502. This is the California equivalent of the CFAA.

So, yes, obviously someone setting up a website to allow people to “stalk” others is unsympathetic. But the underlying issue still remains: scraping data and extracting data is also a really useful tool. It’s useful for research. It’s useful for building additional services. It’s useful for creating competition and for limiting the ability of certain internet giants to control absolutely everything.

Yes, it can be abused. But it really feels here (yet again) that this is Meta/Facebook leaning hard on the fact that people keep complaining it doesn’t do enough to protect its users’ privacy as an excuse to get legal rulings that will increasingly shield the company from both scrutiny and competition.

Filed Under: ca penal code 502, cfaa, clone sites, competition, copyright, data, dmca 1201, ekrem ates, octoparse, research, scraping, stalking
Companies: facebook, meta, mystalk, octopus data

Appeals Court Says That Scraping Public Data Off A Website Does Not Violate Hacking Law

from the phew dept

For years now we’ve been following cases related to scraping data off of websites and the Computer Fraud and Abuse Act (CFAA). The CFAA is an extremely poorly drafted law, that has been stretched by both law enforcement and civil plaintiffs alike to argue that all sorts of things are “unauthorized access” and therefore hacking. We’ve covered many of these cases over the years. The courts have at least started to push back on some of the more extreme interpretations of the law, though it’s still problematic.

Over a decade ago, we followed a case that I still think is one of the most problematic rulings for the internet: when Facebook sued a small startup called Power.com. Power made a social media aggregator, allowing you to access all your different social media accounts through one interface and even to post messages across multiple platforms through that single interface. In order to do that, you had to provide your login to Power, which would access your social media accounts, suck out the data (or push in the data for posting). Again, this was the user willingly granting their login information. Leaving aside whether or not it’s wise to share your login info with a third party, it was still the user’s choice.

However, Facebook decided that this was hacking and in violation of the CFAA… and the courts (tragically) agreed, allowing Facebook to effectively shut down a useful service that would have prevented Facebook from locking up so much data (and becoming such a dominant player). The key reason the court sided with Facebook was it claimed that once Facebook sent a cease-and-desist letter, that effectively mean that any further scraping was “unauthorized.” I still think that we’d see an extremely different competitive landscape today if the Power case had turned out differently. It would have significantly limited the ability of the big social media players to lock in their users. Instead, the rule more or less turned Facebook into a roach motel where your data checks in, but it can never check out.

Other internet companies unfortunately followed suit, using similar lawsuits against websites providing useful complementary services. Craigslist went after 3taps, which made Craigslist data available to third party apps. LinkedIn went after a company called HiQ that was scraping and making use of LinkedIn data. Here, unlike the Power case, the courts actually ruled against LinkedIn saying that LinkedIn could not use the CFAA to block scraping of public data. The key difference between this case and the Power one was that HiQ was scraping public info (i.e., it didn’t need to log in to LinkedIn with someone’s info to access the data). LinkedIn appealed… and lost again. LinkedIn then asked the Supreme Court to weigh in, resulting in the Supreme Court vacating the 9th Circuit’s ruling and sending it back to the court to reconsider in light of last summer’s big Van Buren ruling that limited parts of the CFAA.

So now, with yet another chance… the 9th Circuit has correctly concluded the same thing. HiQ’s scraping of public information still does not violate the CFAA. There are a few different legal issues involved here, but the CFAA claims are the main event. LinkedIn argued that it sent a cease-and-desist to HiQ, so as per the Power ruling, its continued scraping violated the law.

The panel reviewing this case goes deep into the CFAA, why it exists, and what it’s supposed to do before concluding that LinkedIn’s interpretation can’t be the correct one, noting that “the CFAA is best understood as an anti-intrusion statute and not as a ‘misappropriation statute,'” and as such accessing public information shouldn’t be a violation.

Put differently, the CFAA contemplates the existence of three kinds of computer systems: (1) computers for which access is open to the general public and permission is not required (2) computers for which authorization is required and has been given, and (3) computers for which authorization is required but has not been given (or, in the case of the prohibition on exceeding authorized access, has not been given for the part of the system accessed). Public LinkedIn profiles, available to anyone with an Internet connection, fall into the first category. With regard to websites made freely accessible on the Internet, the “breaking and entering” analogue invoked so frequently during congressional consideration has no application, and the concept of “without authorization” is inapt.

As for reconsidering in light of the Van Buren ruling, that doesn’t change things.

Van Buren’s “gates-up-or-down inquiry” is consistent with our interpretation of the CFAA as contemplating three categories of computer systems

[….]

Van Buren’s distinction between computer users who “can or cannot access a computer system,” suggests a baseline in which there are “limitations on access” that prevent some users from accessing the system (i.e., a “gate” exists, and can be either up or down). The Court’s “gates-up-or-down inquiry” thus applies to the latter two categories of computers we have identified: if authorization is required and has been given, the gates are up; if authorization is required and has not been given, the gates are down. As we have noted, however, a defining feature of public websites is that their publicly available sections lack limitations on access; instead, those sections are open to anyone with a web browser. In other words, applying the “gates” analogy to a computer hosting publicly available webpages, that computer has erected no gates to lift or lower in the first place.17 Van Buren therefore reinforces our conclusion that the concept of “without authorization” does not apply to public websites.

The court again distinguishes Power from the HiQ case by saying that Facebook limited access to the data to only those who were logged in, as opposed to the more public access available on LinkedIn.

In that case, Facebook sued Power Ventures, a social networking website that aggregated social networking information from multiple platforms, for accessing Facebook users’ data and using that data to send mass messages as part of a promotional campaign. Id. at 1062–63. After Facebook sent a cease-and-desist letter, Power Ventures continued to circumvent IP barriers and gain access to password protected Facebook member profiles. Id. at 1063. We held that after receiving an individualized cease-and-desist letter, Power Ventures had accessed Facebook computers “without authorization” and was therefore liable under the CFAA. Id. at 1067–68. But we specifically recognized that “Facebook has tried to limit and control access to its website” as to the purposes for which Power Ventures sought to use it. Id. at 1063. Indeed, Facebook requires its users to register with a unique username and password, and Power Ventures required that Facebook users provide their Facebook username and password to access their Facebook data on Power Ventures’ platform. Facebook, Inc. v. Power Ventures, Inc., 844 F. Supp. 2d 1025, 1028 (N.D. Cal. 2012). While Power Ventures was gathering user data that was protected by Facebook’s username and password authentication system, the data hiQ was scraping was available to anyone with a web browser

And thus this doesn’t fix the unfortunate precedent in the Power case, but at least it limits it from getting worse, while making it clear that scraping public web pages is not hacking, even if you’re sent a cease-and-desist letter.

Filed Under: 9th circuit, cfaa, scraping, web scraping
Companies: hiq, linkedin

Court Says That Travel Company Can't Tell Others How Much Southwest Flights Cost

from the c'mon dept

A few months back, we wrote about Southwest Airlines’ ridiculously antagonistic legal strategy against aggregators that would scrape information on flights and prices from Southwest.com and help people find flights and prices. The case we covered was the one against Skiplagged, but it was related to a separate case against Kiwi.com. Skiplagged had argued that it didn’t violate Southwest’s terms of service since it wasn’t scraping info from Southwest… but rather had scraped it from a different site, Kiwi.com, which in turn had scraped it from Southwest.com.

Just the fact that we’re arguing over whether or not it’s legal to scrape data from publicly available websites should alert you to the fact that these lawsuits are nonsense. Factual data — such as flight routes and prices — are not protected by any intellectual property and if you put them out there, people can (and should!) copy them and spread them elsewhere. But, unfortunately, the court ruled against Kiwi.com last fall, granting Southwest an injunction saying that Kiwi can’t scrape its site for data any more. Realizing it was in trouble, it appears that Kiwi caved in and settled the lawsuit agreeing to no longer collect data on Southwest flights.

Given that, the court has now made the preliminary injunction a permanent injunction barring Kiwi and any of its employees from ever scraping data off of Southwest’s site. The court takes for granted that Southwest can just say in their terms of service that you can’t copy data from their website and that’s a valid contract. That seems dangerously empowering for terms of service. Can I add to Techdirt’s terms of service that by reading this site you agree to place any copyright-covered works you create into the public domain?

Southwest?s Terms & Conditions are a valid and enforceable contract, and Kiwi.com accepted those Terms & Conditions when it used the Southwest Website with knowledge of the Terms & Conditions;

Kiwi.com breached the Terms & Conditions when it, among other things, harvested and scraped data from the Southwest Website, published Southwest?s flight and fare schedules on Kiwi.com, used the Southwest Website for Kiwi.com?s own commercial purposes, and brokered and sold Southwest flights without permission;

Kiwi.com?s violations of the Terms & Conditions have caused Southwest to suffer irreparable harm, including lost traffic on its website, customer service burdens, operational disruptions, and reputational damage; and

After considering the balance of harms, the threatened injury to Southwest if the injunction was denied outweighed the harm to Kiwi.com because, among other things, Kiwi.com?s unauthorized sales of Southwest flights poses a significant disruption to its customer operations, and the public interest would be served if an injunction is granted because there is an expectation that parties to contracts will honor their contractual obligations.

Those last two paragraphs also seem like complete nonsense. If people find it easier to use a third party service than your own site, well, then that should mean you should work to improve your own site, not get to sue them in court. Lots of things lead to “lost traffic” on a website, including better service from a competitor. But we don’t say that violates the law.

Anyway, because of this no one associated with Kiwi.com can ever “extract” any information from Southwest’s website or even post data about Southwest flights on its website and I honestly don’t see how that’s possibly legal. Data is data. You shouldn’t be able to bar a company from posting data.

IT IS HEREBY ORDERED that Kiwi.com, Inc. and Kiwi.com s.r.o., as well as their officers, agents, servants, employees, and attorneys and all other persons acting who are in active concert or participation with them, are permanently prohibited, restrained, and enjoined permanently from: (1) harvesting, extracting, or scraping information from the Southwest website, www.southwest.com, or its proprietary servers, including Southwest?s flight and fare information; (2) publishing Southwest flight or fare information on the kiwi.com website, through its mobile applications or elsewhere; (3) otherwise accessing and using Southwest?s website and data for any commercial purpose; (4) selling Southwest flights; and (5) committing any other acts in violation of Southwest?s Terms & Conditions

What an unfortunate state of an events — but also a very clear reminder that Southwest is anti-consumer in its practices.

Filed Under: data, flights, prices, scraping, sharing
Companies: kiwi, southwest

Southwest's Bizarrely Antagonistic Lawsuit To Stop Consumers From Finding Better Deals

from the throwing-away-goodwill dept

This lawsuit is a couple months old, but I’m clearing out some older stories, and thought it was worth writing up still. Southwest Airlines is regularly ranked as a favorite of consumers. While it’s generally relatively low cost as airlines go, it has kept up a reputation of stellar customer service — contrary to the reputations of some other low cost airlines. However, earlier this year, Southwest not only decided to be particularly anti-consumer, but to go legal about it. The company decided to sue the site Skiplagged.

If you’ve never seen it, Skiplagged is a neat service — effectively finding secret cheaper fares by exposing some of the hidden (stupid) secrets of airline fares. I discovered it years ago, after writing about some sketchy airline pricing tricks involving multi-city travel. The secret that Skiplagged realized is that you can often find cheaper flights by booking a multi-leg trip, and not taking all the flights. As Skiplagged sums it up: “As an example, a traveler who wants to go to San Francisco from New York would book a flight that is ticketed for NYC -> San Fran -> Seattle and end their travel once they arrive in San Fran and skip the leg to Seattle.”

This can create some pretty massive details, and like those sketchy scam ads say “this one weird trick… that the airlines hate” except that it actually works. And now Southwest has decided to go to court over it.

Now, it’s important to note that unlike many other airlines, Southwest requires people to buy tickets via its own site, and refuses to have its fares offered on aggregation sites. It also has a long and somewhat unfortunate history of suing websites who try to improve on Southwest fares in some manner. A decade ago we wrote about Southwest going after sites that help flyers track their frequent flyer mileage, and a few years back, we wrote about a ridiculous lawsuit against a website that alerted Southwest flyers if they could change their ticket to a cheaper option after they booked a flight (since Southwest has a no-charge for changes policy). Unfortunately, after a court refused to dismiss that lawsuit under Texas’s anti-SLAPP law, leading the site to effectively agree to shut down permanently.

Here, Southwest is claiming a sort of double-whammy — saying that Skiplagged is getting data on Southwest flights via another company (who Southwest is already suing), Kiwi.com, and using those fares to find the skipped leg cheaper options (also referred to as “hidden city” tickets).

Southwest claims this violates basically all the laws: trademark violations, page scraping violations, unauthorized sales, unfair and deceptive practices and a few others as well.

On June 8, 2021, Southwest wrote a letter to Skiplagged from Texas, explaining that Skiplagged was violating the Southwest Terms & Conditions by scraping and/or using data scraped from Southwest.com, promoting ?hidden city? tickets, and using Southwest?s trademarked heart logo to advertise the sale of tickets on Southwest Airlines without its authorization.

Southwest explained that Southwest had ?the exclusive distribution rights to sell Southwest flights to the general public through the Southwest Website? and never authorized Skiplagged to display or sell its fares, display its trademark logos, publish its flight or fare data, or to use the Southwest Website for or in connection with offering any third-party product or service?or use Southwest?s trademarks in doing so.

Southwest further explained that Skiplagged was inducing Southwest customers to violate the Southwest Terms & Conditions and/or Contract of Carriage. Southwest included a complete copy of the Southwest Terms & Conditions, and the details of registered trademarks.

While they may have (unfortunately) a legal leg to stand on, all of this should be seen as crazy. It’s not trademark infringement, as it’s providing factual information about the flights themselves. They’re not selling counterfeit flights. The flights are real and they’re really provided by Southwest. Scraping of such public, factual information, should never be illegal. Southwest is putting that information out there, and it doesn’t get to control how it’s used. And the fact that Southwest doesn’t want people to get off a flight too early is Southwest’s problem. They set the prices that way and the fact that some people have figured out how to game that system shouldn’t be someone else’s problem. It’s only Southwest’s.

Basically all of this is Southwest enabling all of this to happen, but then suing because people who figured out how to actually use their systems, their prices, and their information in ways that Southwest doesn’t like. That should never be a violation of the law.

The whole thing seems to be an abuse of the legal process to try to stop people from taking advantage of Southwest’s data and flights in a way that Southwest does in fact offer, but in a manner in which Southwest would prefer they not be able to. That should never be illegal. If Southwest doesn’t want people doing those hidden city flights, then it should fix its pricing. Or suck it up. Not sue. And, as some are noting, this very lawsuit seems to highlight how Southwest’s “customer friendly” persona is bullshit. Look how far the company will go to block its “valued customers” from actually finding the cheapest possible flights that Southwest does in fact offer.

Filed Under: air travel, hidden city, plane fares, plane tickets, promotion, reselling, scraping
Companies: skiplagged, southwest