The challenges of German archival document categorization on insufficient labeled data

Fabian Hoppe, Tabea Tietz, Danilo Dessì, Nils Meyer, Mirjam Sprau, Mehwish Alam, Harald Sack

Research output: Contribution to journalConference articlepeer-review

Abstract

Document exploration in archives is often challenging due to the lack of organization in topic-based categories. Moreover, archival records only provide short text which is often insufficient for capturing the semantic. This paper proposes and explores a dataless categorization approach that utilizes word embeddings and TF-IDF to categorize archival documents. Additionally, it introduces a visual approach built on top of the word embeddings to enhance the exploration of data. Preliminary results suggest that current vector representations alone do not provide enough external knowledge to solve this task.

Original languageEnglish
Pages (from-to)15-20
Number of pages6
JournalCEUR Workshop Proceedings
Volume2695
Publication statusPublished - 1 Jan 2020
Externally publishedYes
Event3rd Workshop on Humanities in the Semantic Web, WHiSe 2020 - Virtual, Heraklion, Greece
Duration: 2 Jun 2020 → …

Keywords

  • Cultural Heritage
  • Dataless Categorization
  • Document Exploration
  • Text Categorization

Fingerprint

Dive into the research topics of 'The challenges of German archival document categorization on insufficient labeled data'. Together they form a unique fingerprint.

Cite this