TY - JOUR
T1 - Graph integration of structured, semistructured and unstructured data for data journalism
AU - Anadiotis, Angelos Christos
AU - Balalau, Oana
AU - Conceição, Catarina
AU - Galhardas, Helena
AU - Haddad, Mhd Yamen
AU - Manolescu, Ioana
AU - Merabti, Tayeb
AU - You, Jingmao
N1 - Publisher Copyright:
© 2021 Elsevier Ltd
PY - 2022/2/1
Y1 - 2022/2/1
N2 - Digital data is a gold mine for modern journalism. However, datasets which interest journalists are extremely heterogeneous, ranging from highly structured (relational databases), semi-structured (JSON, XML, HTML), graphs (e.g., RDF), and text. Journalists (and other classes of users lacking advanced IT expertise, such as most non-governmental-organizations, or small public administrations) need to be able to make sense of such heterogeneous corpora, even if they lack the ability to define and deploy custom extract-transform-load workflows, especially for dynamically varying sets of data sources. We describe a complete approach for integrating dynamic sets of heterogeneous datasets along the lines described above: the challenges we faced to make such graphs useful, allow their integration to scale, and the solutions we proposed for these problems. Our approach is implemented within the ConnectionLens system; we validate it through a set of experiments.
AB - Digital data is a gold mine for modern journalism. However, datasets which interest journalists are extremely heterogeneous, ranging from highly structured (relational databases), semi-structured (JSON, XML, HTML), graphs (e.g., RDF), and text. Journalists (and other classes of users lacking advanced IT expertise, such as most non-governmental-organizations, or small public administrations) need to be able to make sense of such heterogeneous corpora, even if they lack the ability to define and deploy custom extract-transform-load workflows, especially for dynamically varying sets of data sources. We describe a complete approach for integrating dynamic sets of heterogeneous datasets along the lines described above: the challenges we faced to make such graphs useful, allow their integration to scale, and the solutions we proposed for these problems. Our approach is implemented within the ConnectionLens system; we validate it through a set of experiments.
KW - Data journalism
KW - Heterogeneous data integration
KW - Information extraction
U2 - 10.1016/j.is.2021.101846
DO - 10.1016/j.is.2021.101846
M3 - Article
AN - SCOPUS:85110514934
SN - 0306-4379
VL - 104
JO - Information Systems
JF - Information Systems
M1 - 101846
ER -