Monday, November 11 2013
Keywords: Data Quality, Accuracy, Completeness, Conformity, Consistency, Duplication, Validity
The six pillars of Data Quality
We live in an “Agile” world…developers, PMs, POs and managers are all concerned about getting the “features” done…database developers are pressed to comply with whichever agile methodology the company has implemented and to develop and fix bugs as quickly as possible…there should be no database programming bottlenecks! Proclaims the young Jedi Scrum master…most developers get their hands dirty with the database because they don’t want to wait for a specialist who seems to be the bottleneck…everyone is thinking about the Beta Release... Tables are created in the database on daily basis by enthusiastic members of the agile team…everything looks fine at first glance…but one thing has got completely neglected and only the DBA knows that: DATA QUALITY!
I present to you, the six pillars of data quality. To provide examples, let’s create an imaginary company called ‘Bird Watchers Ltd.’ that publishes information in various formats including a hard copy called “Bird Watching!” magazine. They would like to send a Thank You $10 rebate coupon to all their customers in South America who have been a paid member for at least 5 years.
Do the columns in the table represent factual values? “Client X has been a paid member for 8 years.” 8 is a reasonable and valid number (as opposed to 208 or 1039). However the fact is that client X has only been a member for 4 years thus not meriting a coupon. The number is valid…yes…but not accurate at all. Client Y lives in Peru and has been a member for 6 years. But wait! He doesn’t really live in Peru, he used to…he’s been living in Sweden for the last 6 years…so Peru is a valid country name…but it is not accurate that the client has been living there for the last 6 years.
How complete is our Clients database? We lost the history of 25% of our clients due to a disaster? Then the seniority field is only 75% complete. We don’t have the full mailing address of 20% of our South American members? Then our address field is only 80% complete. We only know the residing country of 50% of our client base? Then our country field is only 50% complete.
Your ASP.NET’s txtCountry.text field is returning the name of the country that client had literally typed in, but our customer service group uses a different app that returns country codes, So Chile and 173 are both referring to the same country. We have a conformity problem on hand. What’s more is that for publication type, different systems had input different values, per example: magazine, ‘paper mag’, 3, ‘Yes’, ‘True’ and finally the actual proper name of the publication in some cases.
Peru, perou, pe and pérou are all referring to the same country, but they have different spelling. So do the words mag, magazine, magazin, revue, paper format, paper mag, and mze.
Are there multiple records in the database representing the same exact customer? Perhaps the customer moved and has a different address. Fuzzy Logic should be used here to find and eliminate as much duplicates as possible.
Values should be within a reasonable and valid range. Per example if country codes start from 1, then there couldn’t be a country code 400. There can’t be a client with age of 200. There couldn’t be a client who has membership to -2 magazines. And a client cannot be a member since 130 years. Constraints should be used wherever possible, not only to validate data but also using them optimizes the performance as the SQL Server optimizer engine is very well aware of their existence and does use them in its algorithms in a good way to enhance performance.
Written by Ramin Haghighat