Fact-checking

The automation of these tasks or their parts would greatly benefit journalism and perhaps help the public to verify the credibility of various media. It is evident that fact-checking needs external knowledge or detailed context.

However, in order to achieve the goal of a robust automatic fact-checking system, we must first find a way how to evaluate such a system. For English, there are publicly available datasets researchers can use to evaluate their systems. However, no systematic research has been conducted in West Slavic languages yet; thus we establish a common ground for further research by providing large datasets for fact-checking in Czech, Polish, and Slovak languages including initial experiments which reveal complexity of the task, set a baseline which uses a standard machine learning approach, and set an upper bound which uses manually created external knowledge.

We provide three datasets for fact-checking - one for each language downloaded from the following fact-checking websites.

Each dataset contains claims of politicians annotated with one of four classes: FALSE, TRUE, UNVERIFIABLE, and MISLEADING. The labels have the following meaning:

FALSE These statements are not in line with publicly available numbers or information. It may also be a situation where the calculation method of the indicator differs, but none of these sources confirms the number or claim in question.
TRUE Statement using the right information in the right context.
UNVERIFIABLE If it is not possible to find the source of the claim, or it is not possible to confirm or refute it based on the available information.
MISLEADING These are statements that use correct facts, but in a wrong or incomplete context, or are being torn out or otherwise distorted from the original context. These are inappropriate or disproportionate comparisons.

Article Resources

Machine Learning Approach to Fact-checking in West Slavic Languages

Corpora

Demagog — 9.1k Czech, 2.8k Polish and 12.6k Slovak labeled claims with reasoning: demagog.zip (~16.5 MB)

Licence

The Corpora are licenced under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Citation

Please, cite the appropriate article if you use any of the available resources.