Adding missing words to regular expressions

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Regular expressions (regexes) are patterns that are used in many applications to extract words or tokens from text. However, even hand-crafted regexes may fail to match all the intended words. In this paper, we propose a novel way to generalize a given regex so that it matches also a set of missing (previously non-matched) words. Our method finds an approximate match between the missing words and the regex, and adds disjunctions for the unmatched parts appropriately. We show that this method can not just improve the precision and recall of the regex, but also generate much shorter regexes than baselines and competitors on various datasets.

Original languageEnglish
Title of host publicationAdvances in Knowledge Discovery and Data Mining - 22nd Pacific-Asia Conference, PAKDD 2018, Proceedings
EditorsBao Ho, Dinh Phung, Geoffrey I. Webb, Vincent S. Tseng, Mohadeseh Ganji, Lida Rashidi
PublisherSpringer Verlag
Pages67-79
Number of pages13
ISBN (Print)9783319930367
DOIs
Publication statusPublished - 1 Jan 2018
Event22nd Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2018 - Melbourne, Australia
Duration: 3 Jun 20186 Jun 2018

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10938 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference22nd Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2018
Country/TerritoryAustralia
CityMelbourne
Period3/06/186/06/18

Fingerprint

Dive into the research topics of 'Adding missing words to regular expressions'. Together they form a unique fingerprint.

Cite this