TY - CHAP
T1 - Efficiently Identifying Disguised Missing Values in Heterogeneous, Text-Rich Data
AU - Bouganim, Théo
AU - Galhardas, Helena
AU - Manolescu, Ioana
N1 - Publisher Copyright:
© 2022, Springer-Verlag GmbH Germany, part of Springer Nature.
PY - 2022/1/1
Y1 - 2022/1/1
N2 - Digital data is produced in many data models, ranging from highly structured (typically relational) to semi-structured models (XML, JSON) to various graph formats (RDF, property graphs) or text. Most real-world datasets contain a certain amount of null values, denoting missing, unknown, or inapplicable information. While some data models allow representing nulls by special tokens, so-called disguised missing values (DMVs, in short) are also frequently encountered: these are values that are not syntactically speaking nulls, but which do, nevertheless, denote the absence, unavailability, or inapplicability of the information. In this work, we tackle the detection of a particular kind of DMV: texts freely entered by human users. This problem is not tackled by DMV detection methods focused on numeric or categoric data; further, it also escapes DMV detection methods based on value frequency, since such free texts are often different from each other, thus most DMVs are unique. We encountered this problem within the ConnectionLens [6–8, 12] project where heterogeneous data is integrated into large graphs. We present two DMV detection methods for our specific problem: (i) leveraging Information Extraction, already applied in ConnectionLens graphs; and (ii) through text embeddings and classification. We detail their performance-precision trade-offs on real-world datasets.
AB - Digital data is produced in many data models, ranging from highly structured (typically relational) to semi-structured models (XML, JSON) to various graph formats (RDF, property graphs) or text. Most real-world datasets contain a certain amount of null values, denoting missing, unknown, or inapplicable information. While some data models allow representing nulls by special tokens, so-called disguised missing values (DMVs, in short) are also frequently encountered: these are values that are not syntactically speaking nulls, but which do, nevertheless, denote the absence, unavailability, or inapplicability of the information. In this work, we tackle the detection of a particular kind of DMV: texts freely entered by human users. This problem is not tackled by DMV detection methods focused on numeric or categoric data; further, it also escapes DMV detection methods based on value frequency, since such free texts are often different from each other, thus most DMVs are unique. We encountered this problem within the ConnectionLens [6–8, 12] project where heterogeneous data is integrated into large graphs. We present two DMV detection methods for our specific problem: (i) leveraging Information Extraction, already applied in ConnectionLens graphs; and (ii) through text embeddings and classification. We detail their performance-precision trade-offs on real-world datasets.
UR - https://www.scopus.com/pages/publications/85139825171
U2 - 10.1007/978-3-662-66111-6_4
DO - 10.1007/978-3-662-66111-6_4
M3 - Chapter
AN - SCOPUS:85139825171
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 97
EP - 118
BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
PB - Springer Science and Business Media Deutschland GmbH
ER -