TY - GEN
T1 - Declarative XML data cleaning with XClean
AU - Weis, Melanie
AU - Manolescu, Ioana
PY - 2007/1/1
Y1 - 2007/1/1
N2 - Data cleaning is the process of correcting anomalies in a data source, that may for instance be due to typographical errors, or duplicate representations of an entity. It is a crucial task in customer relationship management, data mining, and data integration. With the growing amount of XML data, approaches to effectively and efficiently clean XML are needed, an issue not addressed by existing data cleaning systems that mostly specialize on relational data. We present XClean, a data cleaning framework specifically geared towards cleaning XML data. XClean's approach is based on a set of cleaning operators, whose semantics is well-defined in terms of XML algebraic operators. Users may specify cleaning programs by combining operators by means of a declarative XClean/PL program, which is then compiled into XQuery. We describe XClean's operators, language, and compilation approach, and validate its effectiveness through a series of case studies.
AB - Data cleaning is the process of correcting anomalies in a data source, that may for instance be due to typographical errors, or duplicate representations of an entity. It is a crucial task in customer relationship management, data mining, and data integration. With the growing amount of XML data, approaches to effectively and efficiently clean XML are needed, an issue not addressed by existing data cleaning systems that mostly specialize on relational data. We present XClean, a data cleaning framework specifically geared towards cleaning XML data. XClean's approach is based on a set of cleaning operators, whose semantics is well-defined in terms of XML algebraic operators. Users may specify cleaning programs by combining operators by means of a declarative XClean/PL program, which is then compiled into XQuery. We describe XClean's operators, language, and compilation approach, and validate its effectiveness through a series of case studies.
U2 - 10.1007/978-3-540-72988-4_8
DO - 10.1007/978-3-540-72988-4_8
M3 - Conference contribution
AN - SCOPUS:38149109874
SN - 9783540729877
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 96
EP - 110
BT - Advanced Information Systems Engineering - 19th International Conference, CAiSE 2007, Proceedings
PB - Springer Verlag
T2 - 19th International Conference on Advanced Information Systems Engineering, CAiSE 2007
Y2 - 11 June 2007 through 15 June 2007
ER -