Student: Alexander Gindlhumer (2021)
Supervisor: a.Univ.-Prof. DI Dr. Wolfram Wöß
Co-Supervisor: DI Lisa Ehrlinger, BSc
Motivation and Challenges
Data quality (DQ) is currently perceived as the greatest challenge in operative data management. In order to ensure high-quality queries and data analytics results, it is necessary to measure and know the quality of the used data. This can be achieved by continuously observing whether the quality of the data in an information system continues to conform to standards. The definition of such standards (i.e., the desired qualitative condition) is usually considered a manual task that is done by domain experts. In order to automate this task and support a domain expert with an initial reference profile that can be verified and adjusted if necessary, an automatically generated reference data profile would be a good starting point. Such a DB reference profile should represent the “desired” or “normal” condition of the data, for example, information like the mean, standard deviation, etc. for each attribute.
Example data quality measures that should be calculated against the data profile:
Objective
The aim of this thesis is to create a concept of how such a reference profile could look like, what information it contains, and in which format it should be ideally stored. In addition, a program should be implemented to automatically generate a reference data set from a relational DB. In a follow-up work, it should be possible to measure the quality of a DB by calculating statistics about the deviation of this stored reference dataset.