Facilitating Data Quality Assessment Utilizing Visual Analytics: Tackling Time, Metrics, Uncertainty, and Provenance
Visual and interactive data analysis is a large field of research that is successfully used in commercial tools and systems to allow analysts make sense of their data. Data is often riddled with issues, which makes analysis difficult or even not feasible. Pre-processing data for downstream analysis also involves resolving these issues. We may employ Visual Analytics methods to identify and correct issues and eventually wrangle the data into a usable format. Various aspects are critical during issue correction: (1) how are the issues resolved, (2) to what extent did this affect the dataset, and (3) did the used routines actually resolve the issues appropriately. In this thesis I employ data quality metrics and uncertainty to capture provenance from pre-processing operations and pipelines. Data quality metrics are used to show the prevalence of errors in a dataset, and uncertainty can quantify the changes applied to a data values and entries during processing. Capturing such measures as provenance and visualizing it in an exploratory environment can allow analysts to determine how pre-processing steps affected a dataset, and if the issues, that were initially discovered, could be resolved in a minimal way, so the data is representative of the original dataset.
Within the course of this thesis I employed a user-centered design methodology to develop Visual Analytics prototypes and visualization techniques that combine techniques from data quality, provenance, and uncertainty research. This work presents (1) a novel method to create and customize data quality metrics that can be employed to explore quality issues in tabular and time-oriented datasets, (2) a provenance model for capturing provenance from data pre-processing, leveraging data quality metrics, and using visualization to show the development of quality throughout a pre-processing workflow, and (3) methods for quantifying and visualizing uncertainty in univariate and multivariate time series to analyze the influence of pre-processing operations on the time series. These approaches were developed using real-world use cases and scenarios and were evaluated using qualitative and quantitative user studies to validate the appropriateness of my approaches. The results of the iterative design and evaluation shows that data quality metrics and uncertainty quantified from data pre-processing can be used to assess the overall quality of a dataset. The data quality can furthermore be used to annotate provenance captured during data wrangling, which allows analysts to understand and track the development of quality in a dataset. Uncertainty quantified from pre-processing can be used to assess the impact that pre-processing operations have on datasets and thus support analysts find a balance between necessary and excessive pre-processing.
|Year of Publication||
Institute of Visual Computing and Human-Centered Technology