DocExtractor: An off-the-shelf historical document element extraction

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We present docExtractor, a generic approach for extracting visual elements such as text lines or illustrations from historical documents without requiring any real data annotation. We demonstrate it provides high-quality performances as an off-the-shelf system across a wide variety of datasets and leads to results on par with state-of-the-art when fine-tuned. We argue that the performance obtained without fine-tuning on a specific dataset is critical for applications, in particular in digital humanities, and that the line-level page segmentation we address is the most relevant for a general purpose element extraction engine. We rely on a fast generator of rich synthetic documents and design a fully convolutional network, which we show to generalize better than a detection-based approach. Furthermore, we introduce a new public dataset dubbed IlluHisDoc dedicated to the fine evaluation of illustration segmentation in historical documents.

Original languageEnglish
Title of host publicationProceedings - 2020 17th International Conference on Frontiers in Handwriting Recognition, ICFHR 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages91-96
Number of pages6
ISBN (Electronic)9781728199665
DOIs
Publication statusPublished - 1 Sept 2020
Externally publishedYes
Event17th International Conference on Frontiers in Handwriting Recognition, ICFHR 2020 - Dortmund, Germany
Duration: 7 Sept 202010 Sept 2020

Publication series

NameProceedings of International Conference on Frontiers in Handwriting Recognition, ICFHR
Volume2020-September
ISSN (Print)2167-6445
ISSN (Electronic)2167-6453

Conference

Conference17th International Conference on Frontiers in Handwriting Recognition, ICFHR 2020
Country/TerritoryGermany
CityDortmund
Period7/09/2010/09/20

Keywords

  • deep learning
  • document layout analysis
  • historical document
  • page segmentation
  • synthetic data
  • text line detection

Fingerprint

Dive into the research topics of 'DocExtractor: An off-the-shelf historical document element extraction'. Together they form a unique fingerprint.

Cite this