Anhai Doan , Pedro Domingos , Alon Halevy , Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach Proc. of ACM SIGMOD Conf. on Management of Data 2001
Abstract: A data-integration system provides access to a multitude of data
sources through a single mediated schema. A key bottleneck in building
such systems has been the laborious manual construction of semantic
mappings between the source schemas and the mediated schema. We
describe LSD, a system that employs and extends current
machine-learning techniques to semi-automatically find such
mappings. LSD first asks the user to provide the semantic mappings for
a small set of data sources, then uses these mappings together with
the sources to train a set of learners. Each learner exploits a
different type of information either in the source schemas or in their
data. Once the learners have been trained, LSD finds semantic
mappings for a new data source by applying the learners, then
combining their predictions using a meta-learner. To further improve
matching accuracy, we extend machine learning techniques so that LSD
can incorporate domain constraints as an additional source of
knowledge, and develop a novel learner that utilizes the structural
information in XML documents. Our approach thus is distinguished in
that it incorporates multiple types of knowledge. Importantly, its
architecture is extensible to additional learners that may exploit new
kinds of information. We describe a set of experiments on several
real-world domains, and show that LSD proposes semantic mappings with
a high degree of accuracy.