Visualizing Provenance from Data Wrangling and Data Cleansing
If properly recorded, provenance information of a data set reveals the story about its origins, which hands it passed through, and how the data has been edited, transformed, extended, and refined. Moreover, each operation may have considerable impact on the quality of the data set—for the better or for the worse of different quality metrics. This information is of great value for everyone who needs to decide if the quality of a data set is sufficient for further processing. However, current approaches in data quality assessment feature only a limited amount of provenance information.
QualityFlow, an interactive visualization approach that provides the history of operations on a data set and their influence on respective data quality metrics to support sensemaking.
To achieve this we provide an interactive visualization to combine provenance information obtained from data transformation and cleansing steps with overview quality measures represented by data quality dimensions - also known as quality metrics. Additionally, we will retain a vertical transformation operation structure to have a visual representation familiar to users of data wrangling and cleansing applications to facilitate understanding of the visualization technique.
As data source we employ an extension to OpenRefine, a commonly used Open Source data wrangling server with a web front-end, which calculates quality metrics for available projects. This information is utilized to create the visualization and interactions.