Highly fast text segmentation with pairwise markov chains

Elie Azeraf, Emmanuel Monfrini, Emmanuel Vignon, Wojciech Pieczynski

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Natural Language Processing (NLP) models' current trend consists of using increasingly more extra-data to build the best models as possible. It implies more expensive computational costs and training time, difficulties for deployment, and worries about these models' carbon footprint reveal a critical problem in the future. Against this trend, our goal is to develop NLP models requiring no extra-data and minimizing training time. To do so, in this paper, we explore Markov chain models, Hidden Markov Chain (HMC) and Pairwise Markov Chain (PMC), for NLP segmentation tasks. We apply these models for three classic applications: POS Tagging, Named-Entity-Recognition, and Chunking. We develop an original method to adapt these models for text segmentation's specific challenges to obtain relevant performances with very short training and execution times. PMC achieves equivalent results to those obtained by Conditional Random Fields (CRF), one of the most applied models for these tasks when no extra-data are used. Moreover, PMC has training times 30 times shorter than the CRF ones, which validates this model given our objectives.

Original languageEnglish
Title of host publication6th International IEEE Congress on Information Science and Technology, CiSt 2020 - Proceeding
EditorsMohammed El Mohajir, Mohammed Al Achhab, Badr Eddine El Mohajir, Bernadetta Kwintiana Ane, Ismail Jellouli
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages361-366
Number of pages6
ISBN (Electronic)9781728166469
DOIs
Publication statusPublished - 5 Jun 2020
Event6th International IEEE Congress on Information Science and Technology, CiSt 2020 - Agadir - Essaouira, Morocco
Duration: 5 Jun 202012 Jun 2020

Publication series

NameColloquium in Information Science and Technology, CIST
Volume2020-June
ISSN (Print)2327-185X
ISSN (Electronic)2327-1884

Conference

Conference6th International IEEE Congress on Information Science and Technology, CiSt 2020
Country/TerritoryMorocco
CityAgadir - Essaouira
Period5/06/2012/06/20

Keywords

  • Chunking
  • Hidden Markov Chain
  • Named Entity Recognition
  • Pairwise Markov Chain
  • Part-Of-Speech tagging

Fingerprint

Dive into the research topics of 'Highly fast text segmentation with pairwise markov chains'. Together they form a unique fingerprint.

Cite this