This article is part of a three-part series which addresses the end-to-end process of improving Master Data, from the perspective of a Data Analyst. This series covers the following topics across three articles:
Part 1: Identifying Data Quality issues
Part 2: Data profiling and assessment, and
Part 3: Data cleansing.
Data profiling entails gathering data statistics from the source system’s database (we highly recommend using a Data Quality profiling tool to profile data). These statistics can provide useful information, such as the distribution of customers across regions, e.g. 80% of customers are from Gauteng or that 60% of the customers do not have an email address.
The following elements are crucial in enabling a Data Analyst to profile data:
• Scope the data for inclusion
Ideally, all the columns in all identified tables need to be profiled. Fortunately, the Data Analyst can inspect large tables via queries to the database and omit columns with no data at all, or columns containing only a default value.
• Analyse the data
The analysis of the statistics produced by the profiled data leads the Data Analyst to ask specific questions to the business stakeholders. For example, if only 0.5% of the customer base is from foreign countries, should low priority be given to cleansing that subset of customer data? The answer is not that straight forward as those few customers could be high-value customers. In another example, if the Gender column of the customer record is sparsely populated, it might not offer value to the business and therefore would not be a priority for improvement. While analysing the data, the Data Analyst must keep note of the findings and questions related to the relevant profiled tables and columns. Analysing data inevitably leads to recommendations to business that would prevent bad data from being captured - for example, prevent free text where a drop-down list of values is applicable.
• Present statistics - paint a picture of the as-is
The Data Analyst must present the statistics with a view to demonstrating the short-comings in the Data Quality - in other words, highlight the areas for improvement. The statistics must be accompanied by examples of the actual incorrect data for a better understanding of the problem. The presentation should be concise and must highlight the most relevant data problems in relation to pain points communicated by the business. The presentation should conclude with recommendations. For example, if a company name contains the name and the trading name, it is recommended to have separate fields in the database.
• Prioritise – size matters
Improving the quality of data should be done iteratively, starting with the attributes that would be most beneficial to relieving business pain points. This could be customer contact details (telephone number and email address). The Data Analyst must also define the scope of each iteration. The trick here is to have the first iteration scope small enough to show business value in the shortest possible time. This is key to the success of the whole project.
The Data Analyst must document a comprehensive overview of the profile findings. This document must include all the profiled tables, findings as well as the recommendations to the business. Furthermore, the attributes identified for assessment and improvement must be listed and divided into iterations based on the attribute priority. Any findings that lead to exclusion from scope should also be noted here. This is particularly useful when any disputes arise later in the project.
Where profiling is done on the source data itself, the assessment is done in a separate environment where a generic model is used to assess the data attributes. For example, if a customer name is captured in more than one source system and persisted to different databases, the customer name is pulled into a generic customer name in the assessment model for assessment.
It is useful to create a logical data model of the in-scope source data elements and the assessment attributes; as well as define the mapping from the source to the assessment model attributes. This is then used to physically create the assessment tables/views and is useful in keeping the links to the source data elements.
The Data Analyst uses the logical assessment model entity and attribute names to proceed to the next steps in the data assessment process:
• Define business and data assessment rules
Assessing the data is done via a set of data rules, that when executed, give a score related to the quality of the data when measured against the rules. At the lowest level, the score is per rule for one attribute (or set of attributes), but it can be rolled up to one score for the entire data domain. The domain level score is most useful when reporting on data quality to the executives. By executing these data rules on a continuous basis with regular intervals, the improvement or degradation of the data can be made visible.
The Data Analyst uses the data standards document, the notes from the Analysis phase, the profile overview document and the logical assessment data model to identify and document the business rules for the in-scope data elements and the related data rules to assess the data. For example, to assess a company name, the business rule would be: “Company name must be authenticated” and tied to this business rule, the data rules would be “Company name must not contain the trading name” and “Company name must be in title case”.
When the Data Analyst defines the data rules, he/she must also document all the relevant information - at a minimum, the rule identifier, rule definition, rule classification, rule dimension, and rule weight are required. This information, when implemented, is useful for creating data scorecards and reporting on data quality per dimension.
The Developer then uses all the relevant information documented for each data rule to create and execute the rule. It is highly recommended that a Data Quality tool is used to implement these rules. A tool simplifies the creation of scorecards and gives a drill-down capability for the Business Data Steward to inspect at a row-level where an attribute has failed a specific data rule.The Business Data Steward uses this document to evaluate all the rules before implementation. This ensures that rules correlate with business expectations before they are developed.
• Identify Data entry validation
Data rules that expose invalid values entered into the system can often also be classified as validation rules to be implemented at the data entry level. For example, a data rule that exposes invalid email addresses can be implemented in the source system as a validation rule to reject an email from being saved to the database if it does not comply with the valid format of an email address.
• Document assessment results
The Data Analyst analyses the data failing rules per rule and documents the findings in an assessment measurement report. These findings include a summary of the assessment results. For example, 10% of VAT numbers contain a company registration number or 80% of the South African Company Registration Numbers are not in a consistent, valid format. The assessment measurement report must also reflect the data quality scores based on the parameters used to calculate the scores. It is important to note that the assessment results are based on the data extracted from the source at a specific point in time and not ion real-time data.
• Present assessment results
The Data Analyst must summarise the key data element scores into three categories; “good quality”, “of concern”, and “poor quality”. The Data Analyst must also visually present the assessment results to business users by means of an infographic, while backing it up with actual scores. The presentation of the assessment findings must also show examples of rule failures for clarity and demonstrate the importance to improve the data quality – always tying it back to the business problem that is being addressed.
• Reports - Define failures at various levels
The Data Analyst must design any shortfall in reporting that the assessment tool may have on data rule failures. For instance, at an attribute level, the number of attribute values passing all rules for that attribute versus the total number of values for the attribute. Also, at a row-level, the number of rows where all attributes pass all rules versus the total number of rows.The Developer uses the report design to develop the reports. These reports are helpful to the Business Data Steward that monitors the data quality.
Continue to read Part 3 of 3: Improving Master Data Quality from a Data Analyst’s perspective here - Part 3.
Need to discover how we can help your organisation attain business success through well-managed data?
Please get in touch for more information. Call Bryn Davies on +27 (0)82 941 9634 or email us info@infobluebrint.co.za
Sign up to receive regular Data Management articles and other notifications. *
*By clicking subscribe you consent to receiving communications from InfoBluePrint.
Sign up to receive regular Data Management articles and notifications of Training dates. *
*By clicking subscribe you consent to receiving communications from InfoBluePrint.