Sentiment Analysis

In our research, we focus on sentiment analysis in the Czech web environment, with a special attention to social media. In our pilot paper, we created a large annotated corpus from the top 10 Czech facebook brands and achieved the recognition accuracy about 70% (see the paper Sentiment Analysis in Czech Social Media Using Supervised Machine Learning. The corpus is freely available for further research. Since NLP in Czech suffers from its large vocabulary and very rich flection in general, we furhter improved our methods by incorporating semi-supervised features based on statistical distributional semantics Semantic Spaces for Sentiment Analysis

Our experiments in both Czech and English movie review domains achieved the state-of-the-art performance on a widely used datased in the sentiment analysis task (about 92% accuracy). For details, please refer to our paper Unsupervised Improving of Sentiment Analysis Using Global Target Context.

Other datasets regarding sentiment analysis and stance detection are available here:


Aspect-Level Sentiment Analysis in Czech

Restaurant Reviews CZ ABSA — 2.15k reviews with their related target and category: (~0.3 MB)

Restaurant Reviews CZ ABSA — 1.2k reviews with their related target and category: (~0.15 MB)
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Unsupervised Improving of Sentiment Analysis Using Global Target Context

CSFD CZ — 90k reviews with their related target (movie): csfd-90k-reviews-ranlp2013.tar.bz2 (11 MB)
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Sentiment Analysis in Czech Social Media Using Supervised Machine Learning

CSFD CZ — Corpus contains 91,381 movie reviews (30,897 positive, 30,768 neutral, and 29,716 negative reviews) from the Czech Movie Database Corpus: (~13 MB)
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Facebook CZ — Corpus consists 10,000 Facebook posts (2,587 positive, 5,174 neutral, 1,991 negative and 248 bipolar posts).
Corpus: (~1.5 MB)
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
The archive contains data and statistics in an Excel file (FBData.xlsx) and gold data in two text files with posts (gold-posts.txt) and labels (gols-labels.txt) on corresponding lines.

Mall CZ — Corpus consists 145,307 user product reviews (102,977 positive, 31,943 neutral, and 10,387 negative) crawled from a large Czech e-shop
Corpus: (~7.4 MB)
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Stance Detection

Stance detection in online discussions

Czech Stance Detection v1.1 — Corpus consists 1,460 comments from a Czech news server related to two topics –
Czech president - "Miloš Zeman" (181 In favor, 165 Against, and 301 None) and "Smoking ban in restaurants" (168 In favor, 252 Against, and 393 None).
Corpus: CzechStanceDetection-v1.1.rar (~0.1 MB)
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Detecting Stance in Czech News Commentaries

Czech Stance Detection v2.0 — Corpus consists of comments from a Czech news server related to two topics –
Czech president - "Miloš Zeman" 2,638 comments (691 In favor, 1,263 Against, and 684 Neither) and "Smoking ban in restaurants" with two subsets - ALL 2,785 comments(744 In favor, 1,280 Against, and 761 Neither) and GOLD 1,388 comments(272 In favor, 485 Against, and 631 Neither).
Corpus: (~0.5 MB)
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Stance and Sentiment in Czech

Czech Stance Detection v2.0 with Sentiment and Targets — Corpus consists of comments from a Czech news server related to the target entity –
Czech president - "Miloš Zeman" 2,638 comments (691 In favor, 1,263 Against, and 684 Neither). We annotated this corpus with sentiment labels (227 positive, 813 negative, and 1,598 neutral) and detected wheter the target entity is present in the comment (1,487 Zeman and 732 Prezident). For more details see the paper "Stance and Sentiment in Czech".
Corpus: (~0.4 MB)
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.


The Corpora are licenced under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License


Please, cite the appropriate article if you use any of the available resources.