Semi-Automated Data Cleansing of Multivariate Time Series

Problem

Many application domains involve a large number of time series, e.g., the energy sector and industrial quality management. However, such data is often afflicted by data quality problems, like missing values, outliers, and other types of anomalies. For various downstream tasks, it is not sufficient to merely detect such quality problems, but to cleanse the data, e.g., by imputing missing values by plausible estimated values. Doing this manually for regularly acquired data may become very time-consuming. On the other hand, fully automated data cleansing may cause a lack of trust in the data by domain experts.

Aim

The goal of this work is to design and implement a software prototype that supports a semi-automated process of cleansing time series data. Based on (existing) automated checks for detecting data quality problems, the key idea is to offer the user different mechanisms for cleansing data problems which are suggested by the system in a context-specific way. The flexibility of the user should range from a fully automated "cleanse everything" action to a detailed manual inspection of each detected problem and a corresponding individual choice of cleansing strategy. This also involves the identification of novel interaction techniques for specifying data transformations such as offsets directly within well-known visualizations, e.g., time-series plots, scatterplots, and histograms. As another aspect, the prototype should keep track about which data has been modified in which way. This provenance information should be communicated to the user in a non-obtrusive way.

Other information

Starting point(s) for research:

Arbesser, C., F. Spechtenhauser, T. Mühlbacher, and H. Piringer, Visplause: Visual Data Quality Assessment
of Many Time Series Using Plausibility Checks, IEEE Transactions on Visualization and Computer Graphics, 23(1):641 - 650, 2017.
Gschwandtner, T., W. Aigner, S. Miksch, J. Gärtner, S. Kriglstein, M. Pohl, and N. Suchy, TimeCleanser: A Visual Analytics Approach for Data Cleansing of Time-Oriented Data, 14th International Conference on Knowledge Technologies and Data-driven Business (i-KNOW 2014), Graz, Austria, ACM Press, pp. 1-8, 2014.
Bögl, M., W. Aigner, P. Filzmoser, T. Gschwandtner, T. Lammarsch, S. Miksch, and A. Rind, Visually and Statistically Guided Imputation of Missing Values in Univariate Seasonal Time Series, Poster Proceedings of the IEEE Visualization Conference 2015, Chicago, USA, 2015.

Implementation:

design and prototypical implementation in D3 (Data-Driven Documents), Vega, or Vega-Lite, etc.

Previous knowledge:

Visual Analytics
Python, Java/JaveScript

Contact

Silvia Miksch

Further information

Topics

Data Cleansing, Visual Analytics, Anomaly Detection

Area

Data Quality

Visual Analytics (VA)

Language Not specified

Scope

Project page

Data Quality

Status

open