Machine Learning Approach to Fact-checking in West Slavic Languages

Fake news is designed to incite agitation against an individual or a group of people. Its aim is to influence and manipulate public opinion on targeted topics. Fake news detection, including fact-checking, which can be used as the first step of a detection system, are currently receiving a lot of attention in the research community and journalism.

The automation of these tasks or their parts would greatly benefit journalism and perhaps help the public to verify the credibility of various media. It is evident that fact-checking needs external knowledge or detailed context.

However, in order to achieve the goal of a robust automatic fact-checking system, we must first find a way how to evaluate such a system. For English, there are publicly available datasets researchers can use to evaluate their systems. However, no systematic research has been conducted in West Slavic languages yet; thus we establish a common ground for further research by providing large datasets for fact-checking in Czech, Polish, and Slovak languages including initial experiments which reveal complexity of the task, set a baseline which uses a standard machine learning approach, and set an upper bound which uses manually created external knowledge.

We provide three datasets for fact-checking - one for each language downloaded from the following fact-checking websites.

Each dataset contains claims of politicians annotated with one of four classes: FALSE, TRUE, UNVERIFIABLE, and MISLEADING. The labels have the following meaning:

Article Resources

Machine Learning Approach to Fact-checking in West Slavic Languages

Corpora

Demagog — 9.1k Czech, 2.8k Polish and 12.6k Slovak labeled claims with reasoning: demagog.zip (~16.5 MB)

Licence

The Corpora are licenced under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Citation

Please, cite the appropriate article if you use any of the available resources.