Facilitating Data Quality Assessment Utilizing Visual Analytics: Tackling Time, Metrics, Uncertainty, and Provenance

	Thesis
Teaser Image
Author	Christian Bors
Advisor	Silvia Miksch
Reviewer	Kai Xu Axel Polleres
Abstract	Visual and interactive data analysis is a large field of research that is successfully used in commercial tools and systems to allow analysts make sense of their data. Data is often riddled with issues, which makes analysis difficult or even not feasible. Pre-processing data for downstream analysis also involves resolving these issues. We may employ Visual Analytics methods to identify and correct issues and eventually wrangle the data into a usable format. Various aspects are critical during issue correction: (1) how are the issues resolved, (2) to what extent did this affect the dataset, and (3) did the used routines actually resolve the issues appropriately. In this thesis I employ data quality metrics and uncertainty to capture provenance from pre-processing operations and pipelines. Data quality metrics are used to show the prevalence of errors in a dataset, and uncertainty can quantify the changes applied to a data values and entries during processing. Capturing such measures as provenance and visualizing it in an exploratory environment can allow analysts to determine how pre-processing steps affected a dataset, and if the issues, that were initially discovered, could be resolved in a minimal way, so the data is representative of the original dataset. Within the course of this thesis I employed a user-centered design methodology to develop Visual Analytics prototypes and visualization techniques that combine techniques from data quality, provenance, and uncertainty research. This work presents (1) a novel method to create and customize data quality metrics that can be employed to explore quality issues in tabular and time-oriented datasets, (2) a provenance model for capturing provenance from data pre-processing, leveraging data quality metrics, and using visualization to show the development of quality throughout a pre-processing workflow, and (3) methods for quantifying and visualizing uncertainty in univariate and multivariate time series to analyze the influence of pre-processing operations on the time series. These approaches were developed using real-world use cases and scenarios and were evaluated using qualitative and quantitative user studies to validate the appropriateness of my approaches. The results of the iterative design and evaluation shows that data quality metrics and uncertainty quantified from data pre-processing can be used to assess the overall quality of a dataset. The data quality can furthermore be used to annotate provenance captured during data wrangling, which allows analysts to understand and track the development of quality in a dataset. Uncertainty quantified from pre-processing can be used to assess the impact that pre-processing operations have on datasets and thus support analysts find a balance between necessary and excessive pre-processing.
Keywords	data quality provenance data quality metrics Visual analytics data quality assessment data uncertainty temporal uncertainty
Year of Publication	2020
Academic Department	Institute of Visual Computing and Human-Centered Technology
Degree	PhD, Dr.-techn.
Date Published	02/2020
Thesis Type	Monography
University	TU Wien
City	Vienna
DOI	10.34726/hss.2019.76147
Funding projects	VISSECT - Visual Segmentation and Labeling of Multivariate Time Series CVAST - Centre for Visual Analytics Science and Technology
Attachments	thesis-cb-final.pdf
reposiTUm Handle	20.500.12708/1340
Download citation	DOI Google Scholar BibTeX