This article is part of a three-part series which addresses the end-to-end process of improving Master Data, from the perspective of a Data Analyst. This series covers the following topics across three articles:
Part 1: Identifying Data Quality issues
Part 2: Data profiling and assessment, and
Part 3: Data cleansing.
Data cleansing is defined as "...process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted". This section covers what the Data Analyst does in each of these steps of the data cleansing process.
• Develop the cleansing logical data model and mapping from the assessment model
The Data Analyst creates a logical cleansing data model and maps the assessment model to the cleansing model. The logical cleansing data model is used by the Architect to create the physical cleansing tables that would store the cleansed data. The cleansing model must adhere to the table and column naming standard that is used in the physical database tables and columns.
• Define cleansing rules
The Data Analyst analyses the assessment results at a rule level to identify opportunities for automated cleansing. For example, if the data rule: “Company name must be in title case” had records failing the rule, then create the cleansing rule: “Set name to title case”. This cleansing rule must be documented in the same way as the data rule and also be linked to the same business rule that the data rule is linked to. Not all data rule failures can be cleansed automatically; therefore, only document cleansing rules can be automated.
Cleansing rules must be designed for reusability. For example, if the data rule: “Trading name must be in title case” exists, then the same cleansing rule “Set name to title case” can be re-used.
The Data Analyst must also design generic cleansing rules like “Remove leading spaces” or “Remove multiple spaces between words” and re-use them when documenting cleansing rules per attribute.
The Business Data Steward uses this document to evaluate all the cleansing rules before implementation, in the same way as he/she evaluated and approved data rules. Doing so ensures that the cleansing rules correlate with business expectations before they are developed.
• Define cleansing design
Cleansing rules have to be executed in a predefined sequence of steps per attribute. Even the sequence in which attributes are cleansed needs to be documented. It is good practice to start cleansing each attribute by removing leading and trailing spaces. Cleansing normally follows the sequence below:
o Customer name: “ABC Store Pty (Ltd) T/A My ^Store“
Remove spurious characters
o Customer Name: “ABC Store Pty (Ltd) T/A My Store“
o Customer Name: “ABC Store Pty (Ltd) “
Trading Name: “T/A My Store “
o Customer Name: “ABC Store (Pty) Ltd “
Trading Name: “My Store “
o Retrieve the VAT number from an external source using the cleansed Company name: “ABC Store (Pty) Ltd”
VAT number: 4123456789
The Data Analyst must document the order in which all the cleansing rules must be executed at the entity and attribute level. This document must be very clear, noting the entity, attribute, rule identification, rule short definition and steps. The Developer uses this document, together with the documented cleansing rules to create and implement the cleansing rules in the database incrementally. When the rules are executed according to the cleansing design, the last step of the last column cleansed per table will result in a thoroughly cleansed record.
• Develop and Present cleansing results
The Data Analyst must present the data before and after cleansing to the Business Data Steward. If the Data Quality tool used to cleanse the data is not adequately suited for use by the Business Data Steward, the Data Analyst must design a “Cleansing Results” report to show the before and after cleansing results per attribute. The Developer uses this design to create the report, and the Business Data Steward uses this report to inspect the changes made to the data.
The Data Analyst must also demonstrate to business stakeholders that cleansing the data improved the data quality. To achieve this, the cleansed data needs to be re-assessed, using the existing data rules and comparing the data scorecards before and after cleansing.
• Define Audit Report
Because cleansing the data is done incrementally, it is particularly important to keep track of changes made during the cleansing process. The Data analyst must design an audit report to show the data changes of all the attributes included for automated cleaning per step, per entity. This report is useful for problem-solving in an ongoing Data Quality program.
Once the data has been satisfactorly cleansed, it will be possible, if necessary, to move onto the next major phase of the overall master data improvement process, which is that of identifying duplicates eg same customer but different account. It may be useful to understand which customers have multiple accounts and should therefore be merged or linked in some way so as to have a better overall view of the customer, aka a 360 degree view. This matching phase is separate and distinct from the cleansing described above, and should not be performed as part of cleansing, which would result in inconsistency in both data cleansing as well as the match/merge results. Once a confident match is identified, the subsequent action could either be to remove the duplicates (this is seldom done as there may be a good business reason for a customer to have multiple accounts), or to merge into one single master record, or to merely link the matched records using a new group key, which could be the primary key of that record deemed to be the master account. This discussion expands on the “merge” option.
The Data Analyst must define the principles for matching and merging and present it to the Business Data Steward for approval before starting the deduplication process. Here’s an example of a match principle: “Only active customers can be matched”, and here’s one of a merging principle: “The most recent value of an attribute survives” (for two customers with the same name, but different VAT numbers, the most recently captured VAT number survives). These principles are documented in the deduplication strategy developed by the Architect.
• Define match rules
The Data Analyst must analyse the cleansed data for identification of duplicate records and then define the criteria for matching the records. This is done via a set of matching rules. For instance, if one or more customers have the same company name, they are likely to be duplicated. Still, if one or more customers have the same VAT number and company registration number, they are also likely to be duplicated even though the company names might not be exactly the same. Identifying matching rules is done jointly by the Architect and the Data Analyst. The matching rules are defined as exact matching or fuzzy logic matching or a combination of both. It is highly recommended to use a Data Quality tool for matching duplicate records. The level of refinement in the match rules specified by the Data Analyst is subject to the ability of the tool to implement those rules.
The Developer then uses the defined rules to implement and execute the rules against the data. This is an iterative process until the Architect is satisfied that the rules deliver the required matches, eliminating false positives. When using a Data Quality tool, the Business Data Steward would be able to resolve probable matches manually (that being a match or not).
• Define merge rules
The merge principles inform the Data Analyst on how to merge the matched records into a ‘golden record’, which represents the best version of the truth. The Data Analyst must define the merge rules that determine the surviving value per attribute in a set of matched records. For example, for a set of matched customers, the surviving VAT number is determined by the implemented merge rules; e.g. “If only one customer record has a VAT number, that VAT number survives” and “If more than one VAT number exists, the most recent VAT number survives”.
• Present matching and merging
The Data Analyst must present the deduplication of data sets to the business using an example. The presentation must include a set of data that is matched, together with the applicable match rules used to match these records. The Data Analyst must then show the golden record with the applicable merge rules used to create the golden record.
The Data Analyst must also show Data Quality improvement. This is done by re-assessing the golden records and comparing the data scores of the golden records to the scores before cleansing.
It is the responsibility of the Data Analyst to check that all the rules, scorecards, designs and reports are implemented and function as designed.
The Data Analyst uses a test harness (that the Developer created) to set up test scenarios ensuring that each rule functions as designed. Testing is done after completing each Data Quality Improvement process step, ensuring that:
We hope you found this in-depth look at the end-to-end process of improving Master Data insightful.
Need to discover how we can help your organisation attain business success through well-managed data?
Please get in touch for more information. Call Bryn Davies on +27 (0)82 941 9634 or email us email@example.com
Sign up to receive regular Data Management articles and other notifications. *
*By clicking subscribe you consent to receiving communications from InfoBluePrint.