Before you launch into your data quality initiative, agree on the exact terms you'll be using for the programme. Bryn Davies explains why this is so important and provides a guideline.
One of the biggest stumbling blocks that we see in data quality initiatives is the lack of a common and agreed vocabulary within an organisation for the associated activities. Whilst there is highly sophisticated and very capable data quality technology available today, if everyone isn't on the same page regarding the terminology used for a task, it cannot function efficiently.
Profiling or Assessing
The type of work done in improving data quality is often new to many people, and so everyone goes in with their own preconceived ideas and associated language. For example, data profiling makes up only a small portion of measuring data quality, and is focused mainly on the shape and form of the data, whilst a full blown data quality assessment must measure that data for compliance against both generic and client specific rules.
When it comes to data "cleansing", there are a number of very specific activities involved in what is typically a highly complex process. When different people use the same word with different meanings, this naturally leads to unintended miscommunication and misunderstandings, which cause project delays and frustration. If the data cleansing is part of, for example, a data migration project, which in turn is part of a large ERP implementation, this seemingly small point of a common language can have serious consequences. It is therefore recommended to break 'cleansing' down to more discrete steps such as standardising, de-noising and parsing, as each performs a very specific action on the data, and must all relate back and be kept in synch with the validation rules described above.
In order to help avoid costly mistakes, delays and budget overruns, InfoBluePrint has developed a glossary of terminology which can be used to align everybody upfront. However, with the understanding that all organisations and projects are different, this is a guideline that may be tweaked and tuned to fit your specific circumstances, already existing terminology and the terms used by your chosen data quality tool. The point is to make sure that everybody uses the same vocabulary from the get go. It is also important to note that the activities involved in "assessing" and "cleansing" data are closely associated with the specific data quality dimensions that need to be focused on, and these in turn need to be agreed up front. These dimensions will be the subject of a future article that we will publish.
What follows is a general guideline that can help kick-start your data quality programme, and which you can customise to suit your own needs. We hope that you find this useful - please feel free to provide feedback and to engage with us to assist.
Measuring the quality of the data across a range of standard dimensions, such as completeness, datatype, max value, min value, pattern formats, uniqueness etc For example:
- Does the ID Number field contain only unique numbers
Checking data for conformance to Business Rules across a range of generic complex dimensions as well as client specific rules. These checks may or may not require comparison to reliable reference data sets and/or, in the case of the accuracy dimension, to the actual real world object. For example:
- Does the data contain any codes that are not on the reference list of valid Product Codes
- Does the data conform to the set standard format
- Combination of fields: does the suburb apply for the postcode
- Does the ID number belong to the person
Calculating a value (%) to provide an indication of the confidence of the reliability of an attribute, record or data set. For example:
- A particular record has an overall score of 85% reliability, and can therefore be considered for a marketing campaign
Identify the source record or data field and routing it to its destination record or field, for example:
- Source field ADD4, = Destination field, ADD_CITY
Resolving data into component parts and locating it in its correct field, for example:
- Name split into First Name, Surname
- Remove ID number from the Address3 field and place it in the SAID field
Removing irrelevant data such as blanks or special characters, for example:
- Leading and/or trailing blanks
- "/" in a telephone number
- "*" in the name of a natural person
Applying minor modifications causing the data to conform to a standard, for example:
- Mr, mr or MR = Mr.
- Rd or Rd. = Road
- Cc, c c, c.c = c.c.
- Replace "Kaapstad" with "Cape Town"
- Correct "Clairmont" to "Claremont"
Allocating data to a distinct category. For example:
- Postal Address vs Physical Address
- ID type = "SAID" or "Passport"
Identifying relationships to determine record linkages. There are two types of matching: Identical: hard identifier to categorically conclude relationship. This may involve the comparison of one or more data fields, sometimes through the construction of a hash key for faster comparisons. For example:
- ID number in record 23 = ID number in record 350
- Bank account number and VAT number are the same across three different records
Fuzzy: one or more data fields containing similar values to identify potential matches. A match is based on an percentage threshold. For example:
- If the company name and address are greater than 95% similar, then they relate to the same instance
- If the company name and address are less than 60% similar, then they are unique
- If the company name and address are between 60% and 95% similarity, then manual review is required
Consolidating data from separate records that have been matched and identified as duplicates into one record, for example:
- Combining two records for the same Customer into one Golden Record
The rules for which record is chosen as the Golden Record, and which attributes from which duplicate records should be used in the Golden Record, are known as Survivorship
Physically deleting or purging separate records that have been matched and identified as duplicates, for example:
- Deleting two out of three duplicate records
The rules for which record is chosen as the Surviving Record, and which attributes from which duplicate records should be used in the Surviving Record, are known as Survivorship
Sustaining, via for example keys, the relationship between duplicate and/or related records. For example:
- Identifying one record as a Master Record and permanently linking it to the remaining matched records.
Identifying, through matching, a relationship between records in some meaningful way. For example:
- Grouping all records with similar surnames at the same address
- Grouping all records with the same bank account number
- Building a hierarchy of corporate customers. For example Distell is a subsidiary of SAB, or Standard Bank Adderley Street is a branch of Standard Bank SA
Adding new information from an existing available source (no extra cost), for example:
- Assigning a Region Code to a Customer Record
Adding new information from a new source (usually with additional costs), for example:
- Engaging a third-party supplier to obtain LSM data or SIC codes for customers
Creating data based on another source, for example:
- ID No = 7205010198082, therefore DOB = 01/05/1972
- ID No = 7205010198082, therefore gender = female