TY - GEN
T1 - Handwriting recognition of historical documents with few labeled data
AU - Chammas, Edgard
AU - Mokbel, Chafic
AU - Likforman-Sulem, Laurence
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/6/22
Y1 - 2018/6/22
N2 - Historical documents present many challenges for offline handwriting recognition systems, among them, the segmentation and labeling steps. Carefully annotated text lines are needed to train an HTR system. In some scenarios, transcripts are only available at the paragraph level with no text-line information. In this work, we demonstrate how to train an HTR system with few labeled data. Specifically, we train a deep convolutional recurrent neural network (CRNN) system on only 10% of manually labeled text-line data from a dataset and propose an incremental training procedure that covers the rest of the data. Performance is further increased by augmenting the training set with specially crafted multi scale data. We also propose a model-based normalization scheme which considers the variability in the writing scale at the recognition phase. We apply this approach to the publicly available READ dataset. Our system achieved the second best result during the ICDAR2017 competition [1].
AB - Historical documents present many challenges for offline handwriting recognition systems, among them, the segmentation and labeling steps. Carefully annotated text lines are needed to train an HTR system. In some scenarios, transcripts are only available at the paragraph level with no text-line information. In this work, we demonstrate how to train an HTR system with few labeled data. Specifically, we train a deep convolutional recurrent neural network (CRNN) system on only 10% of manually labeled text-line data from a dataset and propose an incremental training procedure that covers the rest of the data. Performance is further increased by augmenting the training set with specially crafted multi scale data. We also propose a model-based normalization scheme which considers the variability in the writing scale at the recognition phase. We apply this approach to the publicly available READ dataset. Our system achieved the second best result during the ICDAR2017 competition [1].
KW - CRNN
KW - handwriting recognition
KW - historical documents
KW - limited labeled data
KW - model-based normalization scheme
KW - multi-scale training
KW - variability
U2 - 10.1109/DAS.2018.15
DO - 10.1109/DAS.2018.15
M3 - Conference contribution
AN - SCOPUS:85050259096
T3 - Proceedings - 13th IAPR International Workshop on Document Analysis Systems, DAS 2018
SP - 43
EP - 48
BT - Proceedings - 13th IAPR International Workshop on Document Analysis Systems, DAS 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 13th IAPR International Workshop on Document Analysis Systems, DAS 2018
Y2 - 24 April 2018 through 27 April 2018
ER -