COMICORDA: A Novel Dataset for Dialogue Act Recognition in Comics

General Information

The source of comic images for our corpus is the COMICS public dataset containing more than 1.2M extracted panels.

From this database, we downloaded 800 annotated panels (speech and narration bounding boxes together with automatic text transcriptions by Google Vision) following the authors of the EMORECOM ICDAR competition. The images were manually verified to correct the errors in the text recognized by the OCR engine. This step was necessary for the evaluation of the Google Vision OCR performance.

Furthermore, the trained annotators with an excellent knowledge of English were assigned dialogue act labels for dialogue act recognition. We also established bounding boxes for faces and connected them to speech balloons. This annotation might be helpful for some tasks, for instance, emotion recognition.

Moreover, to have a broader spectrum of various comics, we picked another set of 638 comic panels from a different source that has been annotated in the same way.

License

This dataset is licensed under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0). Commercial use in any form is strictly excluded, for more information, please see Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Download

For further information about this corpus, please, see the paper below:

Jiri Martinek, Josef Baloun, Martin Prantl, Ladislav Lenc, Pavel Kral, COMICORDA: Dialogue Act Recognition in Comic Books, 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) Torino, Italy, 20-25 May 2024, pp. 3566-3578, ELRA and ICCL, ISBN: 978-2-493814-10-4, Paper.

Please, cite this paper when you used this corpus in your experiments.

Download

If you have additional questions / comments related to this corpus, please, do not hesitate to ask the contact author: Jiri Martinek jimar@kiv.zcu.cz.