TY - GEN
T1 - Highly fast text segmentation with pairwise markov chains
AU - Azeraf, Elie
AU - Monfrini, Emmanuel
AU - Vignon, Emmanuel
AU - Pieczynski, Wojciech
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2020/6/5
Y1 - 2020/6/5
N2 - Natural Language Processing (NLP) models' current trend consists of using increasingly more extra-data to build the best models as possible. It implies more expensive computational costs and training time, difficulties for deployment, and worries about these models' carbon footprint reveal a critical problem in the future. Against this trend, our goal is to develop NLP models requiring no extra-data and minimizing training time. To do so, in this paper, we explore Markov chain models, Hidden Markov Chain (HMC) and Pairwise Markov Chain (PMC), for NLP segmentation tasks. We apply these models for three classic applications: POS Tagging, Named-Entity-Recognition, and Chunking. We develop an original method to adapt these models for text segmentation's specific challenges to obtain relevant performances with very short training and execution times. PMC achieves equivalent results to those obtained by Conditional Random Fields (CRF), one of the most applied models for these tasks when no extra-data are used. Moreover, PMC has training times 30 times shorter than the CRF ones, which validates this model given our objectives.
AB - Natural Language Processing (NLP) models' current trend consists of using increasingly more extra-data to build the best models as possible. It implies more expensive computational costs and training time, difficulties for deployment, and worries about these models' carbon footprint reveal a critical problem in the future. Against this trend, our goal is to develop NLP models requiring no extra-data and minimizing training time. To do so, in this paper, we explore Markov chain models, Hidden Markov Chain (HMC) and Pairwise Markov Chain (PMC), for NLP segmentation tasks. We apply these models for three classic applications: POS Tagging, Named-Entity-Recognition, and Chunking. We develop an original method to adapt these models for text segmentation's specific challenges to obtain relevant performances with very short training and execution times. PMC achieves equivalent results to those obtained by Conditional Random Fields (CRF), one of the most applied models for these tasks when no extra-data are used. Moreover, PMC has training times 30 times shorter than the CRF ones, which validates this model given our objectives.
KW - Chunking
KW - Hidden Markov Chain
KW - Named Entity Recognition
KW - Pairwise Markov Chain
KW - Part-Of-Speech tagging
U2 - 10.1109/CiSt49399.2021.9357304
DO - 10.1109/CiSt49399.2021.9357304
M3 - Conference contribution
AN - SCOPUS:85103855754
T3 - Colloquium in Information Science and Technology, CIST
SP - 361
EP - 366
BT - 6th International IEEE Congress on Information Science and Technology, CiSt 2020 - Proceeding
A2 - El Mohajir, Mohammed
A2 - Al Achhab, Mohammed
A2 - El Mohajir, Badr Eddine
A2 - Ane, Bernadetta Kwintiana
A2 - Jellouli, Ismail
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 6th International IEEE Congress on Information Science and Technology, CiSt 2020
Y2 - 5 June 2020 through 12 June 2020
ER -