Regex complexity at its finest ...
Regex complexity at its finest ...

Market Driven Data Quality (Data Darwinism)

Summary:

A contrarian idea to improve data quality; let "market forces" like peer pressure, earn-in, and competitive prioritization encourage data owners to do their own quality monitoring.

Just trying a little contrarian thought this week …

Have you ever noticed how much time and energy goes in to data validation? I think it stems from visual forms development and the wide variety of clever data entry controls that are available – everyone wants to write an app that gets the oooo, cool! vote of approval. But how much of that energy spills over from “value added” to “feature creep”?

When your IT peers are showing off their internally developed tools, or when other internal teams put so much creativity into their departmental data collection apps, try stepping back for a moment and taking a look at the amount of development, documentation, training, and maintenance work that gets generated. These amazing, subtle, and visually compelling methods for gathering and validating data can grow into complex validation processes that try to guarantee that only pristine data is ever added to the list.

Is all of this really necessary? Is there real value-add to this approach? Over time, the validation rules get so complex that the code becomes fragile, and a burden on future maintenance programmers. Another common problem – many specialized, departmental, and/or narrowly vertical applications have broad ranges of acceptable data – and the rules for permissible values need to be wildly flexible and adaptive.

But how about NOT validating the input? Why not let “market forces” take over?

I am talking about instances where people are trying to get data into a System That Makes Some Problem Visible – for example, a database of projects or technical resource requests that have to be prioritized, or financial data that has to successfully post into a centralized data collection / aggregation system.

It might be easier to just document the requirements for the data, and then let the best quality data survive …

For your Project / Resource Prioritization application, a project will not get added to the prioritization list until all the data is complete and correct. Even if it is complete, it helps to make the project description easy to understand, compelling, and business relevant – or else someone else will get the resources.

Your monthly data submission has to conform to these [data structure] rules. If it does not conform, it will be kicked out / flagged with errors. You are responsible for getting your data cleaned up and compliant with the specification, and your data submitted by [the deadline] – else your submission will be late.

Yes, this puts pressure on the need to document the data formats and requirements clearly – but this work is probably faster and easier than creating a gallery of brilliant, high-quality, tested, automated rule checkers to validate input. And, when this documentation is proven to be complete, correct, and sufficient (i.e. not too complex), it would make a pretty good spec for an automated data validation program.

Just a wacky idea – as system designers, we don’t have to control the world. Try making market forces work in your favor, just like content struggling for readership on the internet or new products looking for sales …

… may the cleanest data win!

thanks to ex-parrot for the regex from the illustration

19 February, 2011

This Post Has One Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Articles

The Last 5 Minutes: Digital Innovation When You Aren’t Expecting It

Breakthrough ideas often emerge at the very end of a focused conversation - as an afterthought, an "oh, by the way", just "one more quick thing" that slips out just when you thought it was time to go. The first in a series - real stories of breakout ideas and insights that happened in the last 5 minutes of the meeting.

Read more