Semi-Automated Data Cleansing of Multivariate Time Series
Problem
Many application domains involve a large number of time series, e.g., the energy sector and industrial quality management. However, such data is often afflicted by data quality problems, like missing values, outliers, and other types of anomalies. For various downstream tasks, it is not sufficient to merely detect such quality problems, but to cleanse the data, e.g., by imputing missing values by plausible estimated values. Doing this manually for regularly acquired data may become very time-consuming. On the other hand, fully automated data cleansing may cause a lack of trust in the data by domain experts.
Aim
The goal of this work is to design and implement a software prototype that supports a semi-automated process of cleansing time series data. Based on (existing) automated checks for detecting data quality problems, the key idea is to offer the user different mechanisms for cleansing data problems which are suggested by the system in a context-specific way. The flexibility of the user should range from a fully automated "cleanse everything" action to a detailed manual inspection of each detected problem and a corresponding individual choice of cleansing strategy. This also involves the identification of novel interaction techniques for specifying data transformations such as offsets directly within well-known visualizations, e.g., time-series plots, scatterplots, and histograms. As another aspect, the prototype should keep track about which data has been modified in which way. This provenance information should be communicated to the user in a non-obtrusive way.
Other information
Starting point(s) for research:
-
Arbesser, C., F. Spechtenhauser, T. Mühlbacher, and H. Piringer, Visplause: Visual Data Quality Assessment
of Many Time Series Using Plausibility Checks, IEEE Transactions on Visualization and Computer Graphics, 23(1):641 - 650, 2017. -
TimeCleanser: A Visual Analytics Approach for Data Cleansing of Time-Oriented Data, 14th International Conference on Knowledge Technologies and Data-driven Business (i-KNOW 2014), Graz, Austria, ACM Press, pp. 1-8, 2014.
-
Visually and Statistically Guided Imputation of Missing Values in Univariate Seasonal Time Series, Poster Proceedings of the IEEE Visualization Conference 2015, Chicago, USA, 2015.
Collaboration:
- This project is performed in close cooperation with the Visual Analytics Group at the VRVis research center (Harald Piringer, hp [at] vrvis.at)
- There are two ways to tackle and evaluate this problem
- design and prototype implementation based on the Visual Analytics software platform "Visplore" (C++, OpenGL).
- A short description and a video about previous work on data quality assessment can be found at http://download.vrvis.at/va/papers/visplause/visplause.html
-
German PR video of VRVis illustrating the Visplore software framework for industrial manufacturing data: http://download.vrvis.at/va/video/IndustrialQM.mp4
- design and prototypical implementation in D3 (Data-Driven Documents), Vega, or Vega-Lite, etc.
Previous knowledge:
- Visual Analytics
- C++, OpenGL, Python, Java/JaveScript