Data Profiling: Identifying Data Quality Problems

Problem: 

Suitable data formats and high data quality are essential prerequisites for almost any form of data analysis and exploration. However, in reality, there is hardly any real-world data set that would not contain wrong or missing data.

As a first step towards good data quality, the data has to be checked for various quality problems. Found quality problems, then, need to be communicated to the user.

Aim: 

The candidate will implement a tool which checks a data table against predefined data quality checks (e.g., missing data, wrong data formats, etc.). Moreover the tool should provide a possibility for users to flexibly define a list of data quality problems (possibly via regular expressions). For instance, the user wants to check all email-addresses for correct syntax. Thus, she/he needs to define a quality check to make sure all email addresses are composed like this (simplified example):

{small letters}* ‘@’ {small letters, ‘.’}*

The found problems should be visually communicated to the user (e.g., highlighting, simple charts, etc.)

Topics: 
Data Analysis, Data Quality, Data Profiling
Previous knowledge: 
Java, optional: prefuse
Scope: 
BA
Scope: 
PR
Scope: 
MA
Assigned as: 
Master thesis/Diplomarbeit
Contact: 
Theresia Gschwandtner, by appointment, gschwandtner [at] ifs.tuwien.ac.at
Area: 
Visual Analytics (VA)
Status: 
in progress