Page Areas:



Current Submenu:


Position Indication:

Content

Data Quality Assessment for Integrated Information Systems on Schema-Level

Student: Lisa Ehrlinger
Supervisor: a.Univ.-Prof. DI Dr. Wolfram Wöß

Motivation and Challenges
Data is fundamental for strategic decisions in enterprises and organizations as well as personal decisions in everyday life. In many cases this basis is obtained by querying an information system. The correctness of such decisions directly depends on the quality of data and the data extraction algorithms. In the era of high-performance computing and the availability of large amounts of data, data quality still remains a bottleneck affecting the resulting output.

While the World Wide Web serves as a primary data source for personal decisions, large enterprises often store their data in several historically developed and heterogeneous information systems. In both cases, prior to decision-making, an integration of the available data is necessary to compare and assess the content of single sources. Therefore, a comprehensive research about today's data quality issues has to take into account the integration aspect and its challenges, which are heterogeneity, distribution and autonomy of different information sources.

The majority of existing research work concerning data quality refers to the actual content of an information source, but does not adequately take into account the quality of its schema. Hence, this thesis aims on complementary research in data quality on schema-level and data quality in information integration system, a combination of two areas that have not been addressed sufficiently so far.

Objective
This work introduces an approach for automatically assessing data quality in information integration systems on schema-level. The potentially distributed and heterogeneous character of the single sources in an integration system raises the demand for a homogeneous description provided in machine-readable form. An existing hybrid ontology implementation serves as starting point for the integration of schema descriptions, which are then analyzed and evaluated in this thesis.

An automatically generated quality report on an information integration schema is implemented as proof-of-concept. This report considers redundant attributes and concepts, includes an analysis of the schema normalization if relational data is provided, and evaluates several quality dimensions. Quality dimension measures are computed by directly analyzing the original information sources as well as their ontological representations. Consequently, a quality rating can be assigned on different aggregation levels, including attributes, concepts, single information sources and an entire integration system.