As the most popular encyclopedia of all time — with some 6.5 million articles — Wikipedia is the default first stop in the hunt for research information, background material, or an answer to that nagging question about pop culture. Wikipedia can tell you that scientists named a new species of fungus Spongiforma squarepantsii, after the cartoon character SpongeBob SquarePants, or that Blackfeet Tribe member Joe Hipp was the first Native American to compete for the World Boxing Association’s World Heavyweight title.
But sometimes that quick search for information comes with a nagging doubt: How do we know whether what we’re reading is accurate? For instance, if you had read the above mentioned entry on Blackfeet Tribe member Joe Hipp a month ago, the Wikipedia citation for that claim would have been a webpage that didn’t even mention Hipp or boxing. Wikipedia is crowdsourced, so it usually requires that facts be corroborated; quotations, controversial statements, and contentious material about living people must include a citation. Volunteers double-check Wikipedia’s footnotes, but, as the site continues to grow, it’s challenging to keep pace with the more than 17,000 new articles added each month.
Automated tools can help identify gibberish or statements that lack citations, but helping human editors determine whether a source actually backs up a claim is a much more complex task — one that requires an AI system’s depth of understanding and analysis.
Building on Meta AI’s research and advancements, we’ve developed the first model capable of automatically scanning hundreds of thousands of citations at once to check whether they truly support the corresponding claims. It’s open-sourced here, and you can see a demo of our verifier here. As a knowledge source for our model, we created a new dataset of 134 million public webpages — an order of magnitude larger and significantly more intricate than ever used for this sort of research. It calls attention to questionable citations, allowing human editors to evaluate the cases most likely to be flawed without having to sift through thousands of properly cited statements. If a citation seems irrelevant, our model will suggest a more applicable source, even pointing to the specific passage that supports the claim. Eventually, our goal is to build a platform to help Wikipedia editors systematically spot citation issues and quickly fix the citation or correct the content of the corresponding article at scale.
"This is a powerful example of machine learning tools that can help scale the work of volunteers by efficiently recommending citations and accurate sources. Improving these processes will allow us to attract new editors to Wikipedia and provide better, more reliable information to billions of people around the world. I look forward to continued improvements in this area, especially as machine learning tools are able to provide more customized citations and multilingual options to serve our Wikimedia communities across more than 300 languages."
Shani Evenstein Sigalov, a researcher at Tel Aviv University and long-time Wikimedian.