Heimatkunde: Dataset for Multi-modal Historical Document Analysis

General Information

The dataset is composed of images from two historical books describing political districts in the Czech Republic - Heimatkunde des Ascher Bezirkes (Local History of the Aš District) by J. Tittmann and Heimatkunde des politischen Bezirkes Plan (Local History of the Plana District) by Georg Weidl. Due to the name of the books, we name the resulting dataset the Heimatkunde dataset. The documents contain information about the geography, agriculture, population, administration, education, and local history of the districts at the end of the 19th century. The text in both books is printed in Fraktur font and written in German.

The scanned images contain two pages. Most of the pages have a conventional one-column layout in a portrait format. The scans are grayscale with a very high resolution, 300 DPI and most of the images are around 3400 x 2500 pixels in height and width. In total, both books contain 468 images (930 pages). For our dataset, we use only a subset - 329 images, which we have manually annotated for the document layout analysis task.

There are 7 types of objects in the dataset. Although some of the original documents contain images, we decided not to include them as there are only 10 images in both books and such a sample size is not enough to perform training or validation. Consequently, all of the 7 classes contain some form of text, which should however be advantageous for multi-modal processing since the model can always utilize both sources of information. As a result of the annotation process, we obtained a dataset that can be used for layout analysis in historical documents. In total, there are 4.640 annotations across 329 images.

Structure

The archive structure is as follows:

images - default folder for COCO images;
test.json and train.json - COCO annotations for the test and train splits respectively. These files are usable e.g. in Detectron2 for instance segmentation or object detection;
classifier - contains the classification dataset, which are mapped annotations to our custom format that is loadable with scripts from our experiments;
yolo - contains the YOLO annotations for the dataset which are also applicable to object detection or instance segmentation;
ocr - contains the OCR annotations for the dataset.

Experiments

Experiments performed on this dataset are hosted on a different GitHub repository, which can be accessed at https://github.com/honzikv/multimodal-document-processing-thesis.

License

This dataset is licensed under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0). Commercial use in any form is strictly excluded, for more information, please see Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Download

For further information about this corpus, please, see the paper below:

J. Baloun, V. Honzik, L. Lenc, J. Martinek and P. Kral, Heimatkunde: Dataset for Multi-modal Historical Document Analysis, 16th International Conference on Agents and Artificial Intelligence - Volume 3, Rome, Italy, 24-26 February 2024, pp. 996-1002, SciTePress, ISBN: 978-989-758-680-4, ISSN: 2184-433X, FullText, Bibtex.

Please, cite this paper when you used this corpus in your experiments.

Download

If you have additional questions / comments related to this corpus, please, do not hesitate to ask the contact author: Pavel Kral pkral@kiv.zcu.cz.