Passer à la navigation principale Passer à la recherche Passer au contenu principal

DocExtractor: An off-the-shelf historical document element extraction

Résultats de recherche: Le chapitre dans un livre, un rapport, une anthologie ou une collectionContribution à une conférenceRevue par des pairs

Résumé

We present docExtractor, a generic approach for extracting visual elements such as text lines or illustrations from historical documents without requiring any real data annotation. We demonstrate it provides high-quality performances as an off-the-shelf system across a wide variety of datasets and leads to results on par with state-of-the-art when fine-tuned. We argue that the performance obtained without fine-tuning on a specific dataset is critical for applications, in particular in digital humanities, and that the line-level page segmentation we address is the most relevant for a general purpose element extraction engine. We rely on a fast generator of rich synthetic documents and design a fully convolutional network, which we show to generalize better than a detection-based approach. Furthermore, we introduce a new public dataset dubbed IlluHisDoc dedicated to the fine evaluation of illustration segmentation in historical documents.

langue originaleAnglais
titreProceedings - 2020 17th International Conference on Frontiers in Handwriting Recognition, ICFHR 2020
EditeurInstitute of Electrical and Electronics Engineers Inc.
Pages91-96
Nombre de pages6
ISBN (Electronique)9781728199665
Les DOIs
étatPublié - 1 sept. 2020
Modification externeOui
Evénement17th International Conference on Frontiers in Handwriting Recognition, ICFHR 2020 - Dortmund, Allemagne
Durée: 7 sept. 202010 sept. 2020

Série de publications

NomProceedings of International Conference on Frontiers in Handwriting Recognition, ICFHR
Volume2020-September
ISSN (imprimé)2167-6445
ISSN (Electronique)2167-6453

Une conférence

Une conférence17th International Conference on Frontiers in Handwriting Recognition, ICFHR 2020
Pays/TerritoireAllemagne
La villeDortmund
période7/09/2010/09/20

Empreinte digitale

Examiner les sujets de recherche de « DocExtractor: An off-the-shelf historical document element extraction ». Ensemble, ils forment une empreinte digitale unique.

Contient cette citation