Data Integration with Uncertainty
September 8, 2008
Dr. Xin (Luna) Dong
AT&T Research, Florham Park, NJ
Monday, Sept 8, 2008, 2:00 PM,
Babbio Room 221
Stevens Institute of Technology
Many data management applications, such as managing enterprise data, scientific data, personal data, and integrating data on the web, need to manage a multitude of data sources. These data sets can be highly heterogeneous, describing the same domain using different schemas. To enable data sharing across heterogeneous sources, data integration systems specify a mediated schema, which provides an integrated and virtual view of the disparate sources, and build schema mappings from the source schemas to the mediated schema. Despite recent progress, setting up and maintaining a data integration application still requires significant upfront effort and expertise in creating a mediated schema and semantic mappings between the schemas.
We posit that data integration systems need to handle uncertainty on the semantics of data and do so in a principled fashion. This can be because there are too many schema mappings to be created and maintained, or because in some domains (e.g., bioinformatics) it is not clear what the mappings should be. For this purpose, we propose the new concepts of probabilistic schema mappings, probabilistic mediated schemas, and probabilistic functional dependencies. We analyze their formal foundations and describe how to automatically create them from data sources and use them in answering user's queries. Based on these concepts, we have built the first completely self-configuring data integration system. Our experiments show that the system can produce high-quality answers with no human intervention.
Dr. Luna Xin Dong is currently a researcher in the Data Management Department at AT&T Research, Florham Park, NJ. She received her Ph.D. in Computer Science and Engineering at University of Washington in 2007. Her research interests include data integration, data cleaning, Web search, personal information management, community information management, enterprise data management, Web-service discovery and composition, and XML query optimization.