How I detect fake news (original) (raw)

How I traced the falsity of one internet meme, and what that teaches us about how an algorithm might do it.

December 7, 2016

I have a brother who is a big Donald Trump fan, and he frequently sends me articles from various right-wing media sources. Last week, he sent me a variant of the image below:

Fake maps

Figure 1. Fake maps claiming to correlate crime rates and Democratic votes, circulated via email.

I immediately consulted Snopes, the fact checking site for internet hoaxes, and discovered that it was, as I expected, fake. According to Snopes, these are actually both electoral maps. Per Snopes: “On 11 November 2016, the Facebook page “Subject Politics” published two maps purportedly comparing the results of the 2016 U.S. presidential election with the 2013 crime rate in in the U.S. … The map pictured on the bottom actually shows a 2012 electoral map that was created by Mark Newman from the Department of Physics and Center for the Study of Complex Systems at the University of Michigan.” Snopes was unable to verify the source of the first map, but concluded (presumably by comparing with known electoral maps) that it is in fact an incomplete electoral map from the 2016 election.

Learn faster. Dig deeper. See farther.

Snopes, which uses human editors for fact checking, does a good job, but they can’t find every fake news story. Still, when a reputable fact-checking organization like Snopes or Politifact identifies a story as false, that’s a pretty good sign.

Continuing my research, I used Google to search for other sources that might provide more insight on the relationship between the electoral map and crime rates. I quickly found this 2013 article from Business Insider, “Nine Maps That Show How Americans Commit Crime.” It shows a very different picture:

violent crime per one hundred thousand people

Figure 2. Data on violent crime per one hundred thousand people, from the FBI Uniform Crime Report, 2012.

Since Business Insider told me the source of the data (the FBI Uniform Crime Report), I could go verify it for myself. Sure enough, the data on the FBI site matched the Business Insider map.

I tell this story of two maps to emphasize that when people are discussing the truth or falsity of news, and the responsibility of sites like Facebook, Google, and Twitter to help identify it, they somehow think that determining “truth” or “falsity” is something that only humans can do. But as this example shows, there are many signals of likely truth or falsity that can be verified algorithmically by a computer, often more quickly and thoroughly than they can be verified by humans:

Does the story or graph cite any sources? If no sources are given, it is far from certain that the story is false, but the likelihood increases that it should be investigated further. Note how the fake story with which I opened this article provided no sources, and how it was debunked by Snopes by finding the actual sources of the graphs.
Do the sources actually say what the article claims they say? For example, it would have been entirely possible for Business Insider to claim that the data used in their article was from the FBI, but for there to be no such data there, or for the data there to be different. Few people trace the chain of sources to their origin, like I did. Many propaganda and fake news sites rely on that failure to spread falsity. Checking sources is something that computers are much better at doing than humans.
Are the sources authoritative? In evaluating search quality over the years, Google has used many techniques. How long has the site been around? How often is it referenced by other sites that have themselves been determined to be reputable? (Google’s PageRank algorithm, which revolutionized internet search, was a variant of scientific citation analysis, where the importance of scientific papers is evaluated by the number of other papers that reference it, and the reputation of the individuals or institutions making those references. Previous search engines had used brute force matching of the words contained in a web page with the words that the user was looking for.) Most people would find the FBI to be an authoritative source. We don’t think about the tacit knowledge that lets us make that determination, and might be surprised that an algorithm lacking that knowledge might still be able to come to the same conclusion by other means. Yet, billions of people have come to rely on Google’s algorithms to do just that.
Do the sources, if any, substantiate the account? If there is a mismatch between the story and its sources, that is a clear signal of falsity. Last week, I wrote about an eye-opening experience with fake news. I’m no Donald Trump fan, so I was prepared to believe the headline I saw on Facebook: “Mike Pence Gets ‘Booed Like Crazy’ at ‘Hamilton’.” But something quickly became apparent when I watched the actual video embedded in the story: it didn’t match the description given in the article or the headline. As shown in the video, many people cheered Mike Pence as he entered the theater, and the most apparent “Boo” sounded like it was from the person holding the camera. By contrast, in the New York Times story about the same event, the description in the text closely matched the video. Many fake new stories contain jarring discrepancies between the headline and the story or between the story and its sources. Again, this is something that can be detected by a computer program (although comparing text to video may be at the outer edge of today’s capabilities.) Note that the program does not have to find absolute truth; it just has to cast a reasonable doubt, just like a human jury.
Are there multiple independent accounts of the same story? This is a technique that was long used by human reporters in the days when truth was central to the news. A story, however juicy, would never be reported on the evidence of a single source. (The movie All the President’s Men, about the reporting of the Watergate scandal, made a powerful impression on me as a young man, as have many interactions with first rate reporters in stories that I myself have been involved in over the years since.) The Huffington Post’s “Booed Like Crazy” was a quote from a Tweet about the event. How many Tweets were there from audience members reporting booing? How many reported cheering, or a mix of cheering and booing? Again, searching for multiple confirming sources is something that computers can do very well.
If the story references quantitative data, does it do so in a way that is mathematically sound? For example, anyone who has even a little knowledge of statistics will recognize that showing absolute numbers of crime without reference to population density is fundamentally meaningless. Yes, there are more crimes committed by millions of people in New York City or Chicago than by hundreds in an area of rural Montana. That is why the FBI data referenced by the Business Insider article, which normalized crimes per 100,000 people, was inherently more plausible to me than the fake electoral maps that set me off on this particular quest for truth.

Note that when fake news is detected, there are a number of possible ways to respond:

The stories can be flagged. For example, Facebook (or Gmail, since much fake news appears to be spread by email) could show an alert, similar to a security alert, that says “This story appears likely to be false. Are you sure you want to share it?” with a link to the reasons why it is suspect, or to a story that debunks it, if that is available.
The stories can be given less priority, shown lower down, or less often. Google does this routinely in ranking search results. And while the idea that Facebook should do this has been more controversial, Facebook is already ranking stories, for example featuring those that drive more “engagement” over those that are more recent, and showing “more engaging” stories or stories related to ones we’ve already shared or liked. Once Facebook stopped showing stories in pure timeline order, they put themselves in the position of curating the feed algorithmically. It’s about time they added source verification and other “truth” signals to the algorithm.
The stories can be suppressed entirely if certainty is extremely high. We all rely on this level of extreme prejudice every day, since it is what email providers do to filter the email we actually want to see from the billions of spam messages sent every day.

As I wrote in my first article on the topic of fake news, Media in the age of algorithms, “The essence of algorithm design is not to eliminate all error, but to make results robust in the face of error.” Much as we stop pandemics by finding infections at their source and keeping them from finding new victims, it isn’t necessary to eliminate all fake news, but only to limit its spread.