Semi-Automated Data Cleansing of Multivariate Time Series

Problem

Many application domains involve a large number of time series, e.g., the energy sector and industrial quality management. However, such data is often afflicted by data quality problems, like missing values, outliers, and other types of anomalies. For various downstream tasks, it is not sufficient to merely detect such quality problems, but to cleanse the data, e.g., by imputing missing values by plausible estimated values. Doing this manually for regularly acquired data may become very time-consuming. On the other hand, fully automated data cleansing may cause a lack of trust in the data by domain experts.

Aim

The goal of this work is to design and implement a software prototype that supports a semi-automated process of cleansing time series data. Based on (existing) automated checks for detecting data quality problems, the key idea is to offer the user different mechanisms for cleansing data problems which are suggested by the system in a context-specific way. The flexibility of the user should range from a fully automated "cleanse everything" action to a detailed manual inspection of each detected problem and a corresponding individual choice of cleansing strategy. This also involves the identification of novel interaction techniques for specifying data transformations such as offsets directly within well-known visualizations, e.g., time-series plots, scatterplots, and histograms. As another aspect, the prototype should keep track about which data has been modified in which way. This provenance information should be communicated to the user in a non-obtrusive way.

Other information

Starting point(s) for research:

Collaboration:

  1. design and prototype implementation based on the Visual Analytics software platform "Visplore" (C++, OpenGL).
  1. design and prototypical implementation in  D3 (Data-Driven Documents), Vega, or Vega-Lite, etc.

Previous knowledge:

  • Visual Analytics
  • C++, OpenGL, Python, Java/JaveScript

Contact

Further information

Topics
Data Cleansing, Visual Analytics, Anomaly Detection
Area
Data Quality
Not specified
Scope
SE
BA
PR
MA
Project page
Status
open