AI Chatbots Are Blocked by 67% of Top News Sites, Relying Instead on Low-Quality Sources - NewsGuard (original) (raw)

Most high-quality news sites request to block AI chatbots. As a result, the chatbots may be forced to rely on lower-quality and misinformation-prone sources

In tech lingo, “garbage in, garbage out” means that if bad data goes into a system, expect bad results.

The same holds true for the accuracy of AI chatbots. A NewsGuard analysis found that 67 percent of the news websites rated as top quality by NewsGuard block access to their journalism to AI models. This means the AI models must rely disproportionately on the low-quality news sites that allow chatbots to use their content. This helps explain why chatbots so often spread false claims and misinformation.

A NewsGuard analysis of the top 500 most-engaged news websites found that sites with lower NewsGuard Trust Scores — those more likely to have advanced false or misleading information, as assessed by NewsGuard — are more likely to be included in the training data accessed by the AI models. This is because they are less likely to ask web crawlers that feed data to popular AI chatbots to avoid their sites. In contrast, many high-quality news websites have put up the equivalent of “Do Not Trespass” signs, at least until the AI companies pay them through licenses to be able to access their journalism.

This means that the world’s most popular chatbots may be pulling from untrustworthy sources more often than would typically occur on the open web, such as through traditional search. However, because the chatbot companies have not disclosed exactly how they source or use their data, we cannot know for certain which specific sources are influencing their responses. Disinformation websites from Russia, China, and Iran, conspiracy websites, and health care hoax sites peddling quack cures are only too happy to have their content train the AI models. In contrast, high-quality news sites whose journalism is worth paying for want to get paid if the AI models access their journalism, not to give away their content.

Examples of low-quality sites that do not request chatbots to avoid their content include The Epoch Times (NewsGuard Trust Score: 17.5/100); ZeroHedge (Trust Score: 15/100), a finance blog that advances debunked conspiracy theories; and Bipartisan Report (Trust Score: 57.5/100), a news and commentary site that regularly mixes news and opinion without disclosing its liberal agenda. Examples of high-quality sites that request chatbots to avoid their content include NBCNews.com (Trust Score: 100/100); Today.com (Trust Score: 95/100); and TheGuardian.com (Trust Score: 100/100).

A Growing Trend: Requesting to Block Web Crawlers

Some news publishers go beyond blocking AI models and are litigating. In December 2023, The New York Times (Trust Score: 87.5/100) for example, sued OpenAI and Microsoft for copyright infringement, arguing the companies were training chatbots with its articles without a commercial agreement, and in the meantime is blocking access to its journalism.

Chatbots use data gathered from across the internet to answer questions and engage in conversations.

Web crawlers, which are bots that systematically browse and index webpages, are key to this process. They scan websites and collect information, helping to build the databases that power AI chatbots.

However, news sites are increasingly asking these crawlers to skip over them, either to protect their content and control its use or to license it directly to AI companies for revenue.

High-Quality News Sites Request to Block AI Web Crawlers, While Low-Quality Sites Allow Full Access

NewsGuard’s analysis highlights a troubling trend: Many high-quality news sites are taking steps to protect their content from AI web crawlers, while low-quality sites remain readily accessible to these bots. For this analysis, we reviewed a list of the top 500 most engaged news websites over a recent 90-day period. The websites were grouped into three categories based on their NewsGuard Trust Scores: low-quality (0-60), medium-quality (60-80), and high-quality (80-100).

We then checked each site’s “robots.txt” file, which indicates which webpages the website does or does not want web crawlers to access. We specifically looked at how these files address seven common crawlers that gather data for AI chatbots:

CCBot – used by many open-source bots, including those of Meta, which owns Facebook
GPTBot – used by OpenAI ,the creator of ChatGPT
ClaudeBot – used by Anthropic, an AI research company
Anthropic-ai – also used by Anthropic
Google-Extended – used by Google for bots like Gemini
ByteSpider – used by ByteDance, the Chinese company behind TikTok for AI products inside China
PerplexityBot – used by Perplexity, an AI search tool

Requests in robots.txt files are like polite suggestions asking web crawlers not to visit certain parts of a website. The requests are optional, meaning crawlers do not have to follow them. Some AI crawlers, including PerplexityBot and ClaudeBot, have been known to ignore these requests. However, many web crawlers do pay attention to robots.txt files when deciding what content to collect.

NewsGuard found that most “low-quality” and “medium-quality” sites allowed access to all web crawlers, while most “high-quality” news sites requested at least one crawler not to access.

Of the 23 “low-quality” sites (Trust Scores 0-60), 91 percent allowed all web crawlers.
Of the 63 “medium-quality” sites (Trust Scores 60-80), 63 percent allowed all web crawlers.
Of the 414 “high-quality” sites (Trust Scores 80-100), only 33 percent allowed all web crawlers — meaning that 67 percent blocked AI access.

When looking at each of the seven web crawlers analyzed by NewsGuard, high-quality sites were more proactive in restricting access.

The higher quality sites asked an average of three of the crawlers not to access their content. Medium-quality sites made such a request to between one and two bots, on average, while low-quality sites on average made less than one such request. For example, Yahoo.com and WashingtonPost.com, which each get a perfect 100/100 Trust Score from NewsGuard, each blocked all seven crawlers.

If MSNBC.com (Trust Score: 49.5/100) — a “low-quality” site that blocked all seven crawlers — were excluded, the average number of requests from low-quality sites would drop to nearly zero (0.04).

Not all data is created equal, and as we’ve previously reported, chatbots often “hallucinate,” or generate inaccurate or false information — sometimes due to their reliance on lower-quality sources.

While it is not possible to quantify precisely how often AI chatbots rely on low-quality sources, NewsGuard’s findings raise concerns about the potential for misinformation to spread, underscoring the need for scrutiny of the data that is used to train these tools.

Disclosure: NewsGuard licenses its data to AI companies to help improve their responses.