PALEBLUESKY, Author at InfoBluePrint

Using Data Quality to Help Drive Data Governance

As the need for effective data governance becomes more and more critical, we are seeing many organisations battling to get traction on their data governance initiatives. This is not uncommon as, depending on an organisation’s overall data management maturity, implementing data governance can be fraught with challenges, not least some fundamental misunderstandings, difficult politics and in particular a lack of connection to clear business value. The subject of data governance is vast, and InfoBluePrint has developed a framework containing a set of clear guidelines for anyone contemplating establishing data governance in their organisations, the most critical of which is to ensure the development of a business case to clarify the value proposition which must be aligned to corporate goals.

Coupled with this in almost every case is the issue of data quality, or rather non-quality, as a common concern, and one of the most valuable insights that we have gained over the years is to leverage data quality as an effective driver of a data governance initiative, and by extension of overall data management maturity. The reason for this is simple: data quality is a data management functional area where tangible improvement can be measured and shown, supported by meaningful, business aligned reporting. As a yardstick of data governance success, data quality monitoring lends itself as a natural KPI. Furthermore, as data quality processes evolve, they uncover real and practical data management shortcomings that can and should be addressed by data governance functions, roles and responsibilities.

If this is not done, many data governance councils eventually flounder when, quite simply, its members cannot make the connection between time and money spent on formal data governance processes and actual business value. Data quality is an effective means of helping to close that gap. This is a practical approach to data management that targets the sustained improvement of the quality of the underlying data used across the organisation, driven by the bottom-up requirements of the programme whilst triggering the necessary top-down elements of governance and control as and when required. By adopting this approach, data governance roles and responsibilities will have meaningful issues to monitor, manage and address, thus helping to deliver sustainable and value adding data governance to your organisation.

Need to discover how we can help your organisation attain business success through well-managed data? Please email us at info@infoblueprint.co.za.

Improving Master Data Quality: A Data Quality Analyst’s perspective – Part 3 of 3

This article is part of a three-part series which addresses the end-to-end process of improving Master Data, from the perspective of a Data Analyst. This series covers the following topics across three articles:

Part 1: Identifying Data Quality issues
Part 2: Data profiling and assessment, and
Part 3: Data cleansing.

Data cleansing is defined as "...process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted". This section covers what the Data Analyst does in each of these steps of the data cleansing process.

Data cleansing

• Develop the cleansing logical data model and mapping from the assessment model
The Data Analyst creates a logical cleansing data model and maps the assessment model to the cleansing model. The logical cleansing data model is used by the Architect to create the physical cleansing tables that would store the cleansed data. The cleansing model must adhere to the table and column naming standard that is used in the physical database tables and columns.

• Define cleansing rules
The Data Analyst analyses the assessment results at a rule level to identify opportunities for automated cleansing. For example, if the data rule: “Company name must be in title case” had records failing the rule, then create the cleansing rule: “Set name to title case”. This cleansing rule must be documented in the same way as the data rule and also be linked to the same business rule that the data rule is linked to. Not all data rule failures can be cleansed automatically; therefore, only document cleansing rules can be automated.

Cleansing rules must be designed for reusability. For example, if the data rule: “Trading name must be in title case” exists, then the same cleansing rule “Set name to title case” can be re-used.

The Data Analyst must also design generic cleansing rules like “Remove leading spaces” or “Remove multiple spaces between words” and re-use them when documenting cleansing rules per attribute.

The Business Data Steward uses this document to evaluate all the cleansing rules before implementation, in the same way as he/she evaluated and approved data rules. Doing so ensures that the cleansing rules correlate with business expectations before they are developed.

• Define cleansing design
Cleansing rules have to be executed in a predefined sequence of steps per attribute. Even the sequence in which attributes are cleansed needs to be documented. It is good practice to start cleansing each attribute by removing leading and trailing spaces. Cleansing normally follows the sequence below:

o Customer name: “ABC Store Pty (Ltd) T/A My ^Store“

 Remove spurious characters

o Customer Name: “ABC Store Pty (Ltd) T/A My Store“

 Parse

o Customer Name: “ABC Store Pty (Ltd) “

Trading Name: “T/A My Store “

 Cleanse/standardise

o Customer Name: “ABC Store (Pty) Ltd “

Trading Name: “My Store “

 Augment

o Retrieve the VAT number from an external source using the cleansed Company name: “ABC Store (Pty) Ltd”

VAT number: 4123456789

The Data Analyst must document the order in which all the cleansing rules must be executed at the entity and attribute level. This document must be very clear, noting the entity, attribute, rule identification, rule short definition and steps. The Developer uses this document, together with the documented cleansing rules to create and implement the cleansing rules in the database incrementally. When the rules are executed according to the cleansing design, the last step of the last column cleansed per table will result in a thoroughly cleansed record.

• Develop and Present cleansing results
The Data Analyst must present the data before and after cleansing to the Business Data Steward. If the Data Quality tool used to cleanse the data is not adequately suited for use by the Business Data Steward, the Data Analyst must design a “Cleansing Results” report to show the before and after cleansing results per attribute. The Developer uses this design to create the report, and the Business Data Steward uses this report to inspect the changes made to the data.

The Data Analyst must also demonstrate to business stakeholders that cleansing the data improved the data quality. To achieve this, the cleansed data needs to be re-assessed, using the existing data rules and comparing the data scorecards before and after cleansing.

• Define Audit Report
Because cleansing the data is done incrementally, it is particularly important to keep track of changes made during the cleansing process. The Data analyst must design an audit report to show the data changes of all the attributes included for automated cleaning per step, per entity. This report is useful for problem-solving in an ongoing Data Quality program.

Data Matching and Merging

Once the data has been satisfactorly cleansed, it will be possible, if necessary, to move onto the next major phase of the overall master data improvement process, which is that of identifying duplicates eg same customer but different account. It may be useful to understand which customers have multiple accounts and should therefore be merged or linked in some way so as to have a better overall view of the customer, aka a 360 degree view. This matching phase is separate and distinct from the cleansing described above, and should not be performed as part of cleansing, which would result in inconsistency in both data cleansing as well as the match/merge results. Once a confident match is identified, the subsequent action could either be to remove the duplicates (this is seldom done as there may be a good business reason for a customer to have multiple accounts), or to merge into one single master record, or to merely link the matched records using a new group key, which could be the primary key of that record deemed to be the master account. This discussion expands on the “merge” option.

The Data Analyst must define the principles for matching and merging and present it to the Business Data Steward for approval before starting the deduplication process. Here’s an example of a match principle: “Only active customers can be matched”, and here’s one of a merging principle: “The most recent value of an attribute survives” (for two customers with the same name, but different VAT numbers, the most recently captured VAT number survives). These principles are documented in the deduplication strategy developed by the Architect.

• Define match rules
The Data Analyst must analyse the cleansed data for identification of duplicate records and then define the criteria for matching the records. This is done via a set of matching rules. For instance, if one or more customers have the same company name, they are likely to be duplicated. Still, if one or more customers have the same VAT number and company registration number, they are also likely to be duplicated even though the company names might not be exactly the same. Identifying matching rules is done jointly by the Architect and the Data Analyst. The matching rules are defined as exact matching or fuzzy logic matching or a combination of both. It is highly recommended to use a Data Quality tool for matching duplicate records. The level of refinement in the match rules specified by the Data Analyst is subject to the ability of the tool to implement those rules.

The Developer then uses the defined rules to implement and execute the rules against the data. This is an iterative process until the Architect is satisfied that the rules deliver the required matches, eliminating false positives. When using a Data Quality tool, the Business Data Steward would be able to resolve probable matches manually (that being a match or not).

• Define merge rules
The merge principles inform the Data Analyst on how to merge the matched records into a ‘golden record’, which represents the best version of the truth. The Data Analyst must define the merge rules that determine the surviving value per attribute in a set of matched records. For example, for a set of matched customers, the surviving VAT number is determined by the implemented merge rules; e.g. “If only one customer record has a VAT number, that VAT number survives” and “If more than one VAT number exists, the most recent VAT number survives”.

• Present matching and merging
The Data Analyst must present the deduplication of data sets to the business using an example. The presentation must include a set of data that is matched, together with the applicable match rules used to match these records. The Data Analyst must then show the golden record with the applicable merge rules used to create the golden record.

The Data Analyst must also show Data Quality improvement. This is done by re-assessing the golden records and comparing the data scores of the golden records to the scores before cleansing.

Testing

It is the responsibility of the Data Analyst to check that all the rules, scorecards, designs and reports are implemented and function as designed.

The Data Analyst uses a test harness (that the Developer created) to set up test scenarios ensuring that each rule functions as designed. Testing is done after completing each Data Quality Improvement process step, ensuring that:

Data rules catch incorrect data
Cleansing rules correct the data
Matching rules find duplicates
Merging rules deliver golden records

We hope you found this in-depth look at the end-to-end process of improving Master Data insightful.

View articles 1 & 2 here:
Improving Master Data Quality from a Data Analyst’s perspective - Part 1
Improving Master Data Quality from a Data Analyst’s perspective - Part 2

Need to discover how we can help your organisation attain business success through well-managed data?

Please get in touch for more information. Call Bryn Davies on +27 (0)82 941 9634 or email us info@infobluebrint.co.za

Improving Master Data Quality: A Data Quality Analyst’s perspective – Part 2 of 3

Part 1: Identifying Data Quality issues
Part 2: Data profiling and assessment, and
Part 3: Data cleansing.

Data profiling entails gathering data statistics from the source system’s database (we highly recommend using a Data Quality profiling tool to profile data). These statistics can provide useful information, such as the distribution of customers across regions, e.g. 80% of customers are from Gauteng or that 60% of the customers do not have an email address.

Data profiling

The following elements are crucial in enabling a Data Analyst to profile data:

• Scope the data for inclusion
Ideally, all the columns in all identified tables need to be profiled. Fortunately, the Data Analyst can inspect large tables via queries to the database and omit columns with no data at all, or columns containing only a default value.

• Analyse the data
The analysis of the statistics produced by the profiled data leads the Data Analyst to ask specific questions to the business stakeholders. For example, if only 0.5% of the customer base is from foreign countries, should low priority be given to cleansing that subset of customer data? The answer is not that straight forward as those few customers could be high-value customers. In another example, if the Gender column of the customer record is sparsely populated, it might not offer value to the business and therefore would not be a priority for improvement. While analysing the data, the Data Analyst must keep note of the findings and questions related to the relevant profiled tables and columns. Analysing data inevitably leads to recommendations to business that would prevent bad data from being captured - for example, prevent free text where a drop-down list of values is applicable.

• Present statistics - paint a picture of the as-is
The Data Analyst must present the statistics with a view to demonstrating the short-comings in the Data Quality - in other words, highlight the areas for improvement. The statistics must be accompanied by examples of the actual incorrect data for a better understanding of the problem. The presentation should be concise and must highlight the most relevant data problems in relation to pain points communicated by the business. The presentation should conclude with recommendations. For example, if a company name contains the name and the trading name, it is recommended to have separate fields in the database.

• Prioritise – size matters
Improving the quality of data should be done iteratively, starting with the attributes that would be most beneficial to relieving business pain points. This could be customer contact details (telephone number and email address). The Data Analyst must also define the scope of each iteration. The trick here is to have the first iteration scope small enough to show business value in the shortest possible time. This is key to the success of the whole project.

The Data Analyst must document a comprehensive overview of the profile findings. This document must include all the profiled tables, findings as well as the recommendations to the business. Furthermore, the attributes identified for assessment and improvement must be listed and divided into iterations based on the attribute priority. Any findings that lead to exclusion from scope should also be noted here. This is particularly useful when any disputes arise later in the project.

Data assessment

Where profiling is done on the source data itself, the assessment is done in a separate environment where a generic model is used to assess the data attributes. For example, if a customer name is captured in more than one source system and persisted to different databases, the customer name is pulled into a generic customer name in the assessment model for assessment.

It is useful to create a logical data model of the in-scope source data elements and the assessment attributes; as well as define the mapping from the source to the assessment model attributes. This is then used to physically create the assessment tables/views and is useful in keeping the links to the source data elements.

The Data Analyst uses the logical assessment model entity and attribute names to proceed to the next steps in the data assessment process:

• Define business and data assessment rules
Assessing the data is done via a set of data rules, that when executed, give a score related to the quality of the data when measured against the rules. At the lowest level, the score is per rule for one attribute (or set of attributes), but it can be rolled up to one score for the entire data domain. The domain level score is most useful when reporting on data quality to the executives. By executing these data rules on a continuous basis with regular intervals, the improvement or degradation of the data can be made visible.

The Data Analyst uses the data standards document, the notes from the Analysis phase, the profile overview document and the logical assessment data model to identify and document the business rules for the in-scope data elements and the related data rules to assess the data. For example, to assess a company name, the business rule would be: “Company name must be authenticated” and tied to this business rule, the data rules would be “Company name must not contain the trading name” and “Company name must be in title case”.

When the Data Analyst defines the data rules, he/she must also document all the relevant information - at a minimum, the rule identifier, rule definition, rule classification, rule dimension, and rule weight are required. This information, when implemented, is useful for creating data scorecards and reporting on data quality per dimension.

The Developer then uses all the relevant information documented for each data rule to create and execute the rule. It is highly recommended that a Data Quality tool is used to implement these rules. A tool simplifies the creation of scorecards and gives a drill-down capability for the Business Data Steward to inspect at a row-level where an attribute has failed a specific data rule.The Business Data Steward uses this document to evaluate all the rules before implementation. This ensures that rules correlate with business expectations before they are developed.

• Identify Data entry validation
Data rules that expose invalid values entered into the system can often also be classified as validation rules to be implemented at the data entry level. For example, a data rule that exposes invalid email addresses can be implemented in the source system as a validation rule to reject an email from being saved to the database if it does not comply with the valid format of an email address.

• Document assessment results
The Data Analyst analyses the data failing rules per rule and documents the findings in an assessment measurement report. These findings include a summary of the assessment results. For example, 10% of VAT numbers contain a company registration number or 80% of the South African Company Registration Numbers are not in a consistent, valid format. The assessment measurement report must also reflect the data quality scores based on the parameters used to calculate the scores. It is important to note that the assessment results are based on the data extracted from the source at a specific point in time and not ion real-time data.

• Present assessment results
The Data Analyst must summarise the key data element scores into three categories; “good quality”, “of concern”, and “poor quality”. The Data Analyst must also visually present the assessment results to business users by means of an infographic, while backing it up with actual scores. The presentation of the assessment findings must also show examples of rule failures for clarity and demonstrate the importance to improve the data quality – always tying it back to the business problem that is being addressed.

• Reports - Define failures at various levels
The Data Analyst must design any shortfall in reporting that the assessment tool may have on data rule failures. For instance, at an attribute level, the number of attribute values passing all rules for that attribute versus the total number of values for the attribute. Also, at a row-level, the number of rows where all attributes pass all rules versus the total number of rows.The Developer uses the report design to develop the reports. These reports are helpful to the Business Data Steward that monitors the data quality.

Continue to read Part 3 of 3: Improving Master Data Quality from a Data Analyst’s perspective here - Part 3.

Need to discover how we can help your organisation attain business success through well-managed data?

Please get in touch for more information. Call Bryn Davies on +27 (0)82 941 9634 or email us info@infobluebrint.co.za

Improving Master Data Quality: A Data Quality Analyst’s perspective – Part 1 of 3

Part 1: Roles required and identification of Data Quality issues

Part 2: Data profiling and assessment, and

Part 3: Data cleansing.

In general all initiatives with an objective of improving master data quality should tackle the problem from both a bottom up as well as a top down approach. Simply speaking, bottom up (correction) is all about correcting data quality issues in existing data, whilst top down (prevention) puts measures into place to prevent issues from re-occuring over and over again. It is pointless to spend time and effort on cleansing data if poorly designed or managed processes and systems continue to allow low quality data back in on an ongoing basis.

For the bottom up or corrective component, due to high volumes of data and/or complexity of data quality issues, it is frequently necessary to employ programmatic cleansing methods, sometimes using specialised data quality technology platforms. Of course there will always be those data quality issues that will not be able to be resolved without controlled human intervention, but the trick is to minimise this by ensuring that the programmatic approach resolves as muc as possible.

This series of three articles focuses primarily on the bottom up programmatic component.

In a Master Data Improvement (MDI) project, the Data Quality Analyst has a role to play in all stages of the Data Quality Improvement process.

A simple definition of Data Quality Improvement is:"...a process of measuring the data quality by assessing it and then correcting the data. By creating a repeatable process, trends can be monitored over time."

The typical roles needed for an MDI project are; Data Architects, Analysts, Developers, Project Managers and Business Data Stewards. Each role can have one or more people assigned to it, but generally, there would only be one Project Manager and one Architect. In smaller projects, however, it’s possible to have a person assigned to more than one role.

Let’s explain how these roles would work by using a building/construction project as an analogy.

The Data Architect:

Much like in a building project, the Architect draws up the plan with all the components and how they all fit together. This plan needs to be approved by all stakeholders (like the homeowner in a building project). The Architect then oversees the construction of the building to the required specifications.

The Data Quality Analyst:

The Data Quality Analyst methodically plans the work to be done through analysing, defining, documenting, quality controlling and reporting on all processes, much like the builder in a building project.

The Developer:

The Developer does the ‘hard work’ and labour, technically implementing through code the different aspects of the project per the documentation provided by the Data Quality Analyst and Architect. The Developer is like the Brick Layer, Plumber or Electrician in a building project.

Project Manager:

The Project Manager ensures that everything works according to plan, within a specified time and budget. He/she also ensures that the project team has what they need at any given time during the project. The Project Manager liaises with the project team and the business to resolve matters as they arise, resembling much of the same duties as a Project Manager in a building project.

Business Data Steward:

The Business Data Steward understands the data and works closely with the project team and Data Owners. He/she is ultimately responsible for ensuring quality data and therefore needs to approve all the rules defined by the Data Quality Analyst before the rules are implemented and also verifies the rules post implementation. The Business Data Steward monitors Data Quality trends and manages the ongoing improvement process after the project has completed, much like a Care-Taker that would look after the property once the building project is completed.

Identifying Data Quality issues

In a Master Data Quality Improvement project, the first step in the process involves identifying the primary data domains (e.g. customer, product, supplier) and then investigating each area. The subsequent process for each data domain is the same.

The steps involved in identifying and resolving data quality issues are:

Gathering requirements
Data profiling
Data quality assessment

Gathering requirements
Discover business pain points
Companies choose to improve the quality of their data for a reason. For the Data Analyst, the trick is to decipher the business problem down to the data characteristics that contribute to the problem. For example, if the customer delivery address is incorrect, then goods would be delivered to the incorrect address. This low-quality data would impact cost, delivery times and customer satisfaction.
Define data standards
Companies typically have more than one source application where data is captured. It is, therefore, essential that the Data Quality Analyst defines and documents the data standards of all capture fields where Master Data is captured. This is done together with the Business Data Steward. Naturally, one needs to know for each data field on a system if the value is optional, mandatory or even not applicable. This process evolves as the project progresses, and more information becomes available. The data standards document is critically important for the Data Analyst; it helps to identify business rules and the related assessment and cleansing rules. This document is also useful to business in improving data capturing procedures.

Continue to read Part 2 of 3: Improving Master Data Quality from a Data Analyst’s perspective here.

Need to discover how we can help your organisation attain business success through well-managed data?

Please get in touch: infoblueprint.co.za/contact

"Big Data" has fast moved from being a buzzword to being used and understood as a business asset. An asset, when linked with effective management and rigorous data analysis, drives informed decision-making, increases business efficiency and identifies new business opportunities.

The continued management and analysis of Big Data is proving to be a significant differentiator in large organisations. We are now seeing the rise and massive potential of advanced data technologies to support the staggering growth of data that's being generated worldwide.

Below are 5 trends in Big Data for now and beyond:

1. Automated data analysis
As data analysis is further built into business processes, it has been found that automating the analysis speeds up decision-making and reduces the response time to opportunities. The International Data Corporation (IDC) predicts that by 2025, about 30% of all data will be streamed in real-time, compared to the 15% that was streamed in 2017.

2. Natural Language Analytics (NLA)
A critical element to making data-driven decisions is making the data available to be used by all decision-makers; NLA enables this. NLA allows computers to understand human language. Doing so empowers non-technical business users to query complex data using words and phrases (via voice or text). Gartner has predicted that by the end of 2020, 50% of data queries will be processed using natural language processing technology, instilling a greatly enhanced data-driven culture in organisations.

3. Edge Computing
By 2025, more than 150 billion devices will be connected across the globe, and these devices are predicted to generate 90 zegabytes of data. As this massive data growth creates data management and analysis challenges, new technologies and computing processes naturally follow. In comes Edge Computing - a distributed computing model where data is processed close to the area where data is generated, instead of a centralised server or the cloud. This innovative infrastructure uses sensors to collect data and edge servers to securely process data in real-time and on location, while also connecting other devices to the network, such as laptops and smartphones. The most significant value of edge computing is in reducing the costs that come with latency and network traffic. Gartner predicts that by 2022, approximately 75% of all data will be analysed and actioned at the edge.

4. Data-as-a-Service (DaaS)
More and more, businesses are providing data access and digital files via DaaS, both internally and commercially. DaaS is defined as "a data management strategy that uses the cloud to deliver data storage, integration, processes and/or analytics services via a network connection". It is predicted that large organisations, beyond 2020, will bundle data with business intelligence tools to boost revenue. This is possible as DaaS leverages the data ecosystem and real-time data analytics to create customised and real-time datasets - changing the game for marketers. They will further move from modelling data based on what they think a prospect might do, towards having real-time insights into the actual behaviour of a prospect.

5. Evolving from predictive to prescriptive analytics
In recent years, data has been used for predictive analysis to give insight into the possible outcome of business decisions. We are now seeing the rise of prescriptive analysis software, which suggests decision options to benefit from the predictions. Ventana Research expects that by 2021, about 67% of analytical processes will be prescriptive. For instance, prescriptive analytics give insight into whether to publish material for readers, based on search and social shares data of related topics.

Moving into the future, data will continue to be king. InfoBluePrint is excited to be a part of the revolutionary journey that will come with the proper management and analysis of high-quality Big Data.

Want to discover how we can help you attain business success through well-managed quality data?

Please get in touch - call us on +27 (0)21 551 2410 or send an email to info@infoblueprint.co.za.

In recent years the accuracy of information has become increasingly important to business success, putting a prominent focus on data quality. With the growing emphasis on Quality, businesses are looking to all employees to play a part in the Data Quality process. However, in order for employees to play that key role in the data quality process, businesses need to empower their employees by investing in training.

Employees are normally trained 'how to do the job'. Typically they are not trained in the effect their job has on the Data / Information they create, modify, and use on a daily basis. Employees need to see the bigger picture when it comes to Data Quality; they need to understand how the data is being used and how a business is impacted if the data is not used correctly, or if the data used is inaccurate.

Employees are generally not aware of:

The downstream effects of how they create the data;
Regulatory / Compliance requirements;
Data protection requirements;
The risks of inappropriate handling of information.

In short they are not aware of what Data Quality means and why it is important to successful business operations.

We are all too familiar with the phrase 'Data is an asset'. What does this actually mean? For example, data as a business asset provides a single holistic view of customers which helps organisations improve their customer interactions. Businesses can leverage the value of this data to better understand customers, improving the way they interact with them and in turn also improving customer experience overall.

Well, let's go back to basics. Employees need to be upskilled and trained within the Data Quality realm and understand why data should be treated as a business asset. Training is key to equip employees to play a part in end to end Data Quality Management.

All employees should understand:

Importance of data and information to their job and to the organisation as a whole;
Issues, risks and impact of poor data quality;
What are the best practices for data quality?
How does one address and resolve data quality issues?

Where does one start? A well-rounded training program for various employee roles on Data Quality Management is essential in helping transform organisational culture to welcome, understand and embrace the importance of data quality.

InfoBluePrint, specialists in Information Management solutions, offers a range of courses that addresses these points, allowing business users to get a better understanding of Data Quality and motivating them to strive towards achieving it.

For a detailed breakdown of our range of courses, please visit InfoBluePrint Training.

One of the most prevalent characteristics of a data migration project is the significant extent of continual change that must be dealt with, often late into the programme. This is understandable because, despite best efforts to define and decide as much as possible up front, the business is generally not completely ready in the early stages. Not having been able to fully comprehend and frame the expected target landscape, decisions typically evolve organically as the programme proceeds. Another factor from a data management perspective is due to unknown data quality and modelling issues which are discovered down the road only when the data issues and relationships start becoming clearer.

Coupled with the other challenges covered in my earlier articles, this all comes down to a distinct requirement for well-defined, solid processes to manage continual change within the data migration stream. It is therefore worthwhile spending considerable time as early as possible to define the roles, ownerships, processes, artefacts and technologies that will be required to keep things under control. Particular attention must be given to the artefacts that record and help manage business and data rules across all migration activity, including not least extraction, exclusion, validation, cleansing, mapping and target front-end validation rules. Formal templates for these artefacts must be crafted, agreed and process proved, long before they are used in earnest when (controlled!) chaos becomes the norm down the road! It must be easy to not only keep track of all these changing rules, but to understand end to end impacts of the changes as regards data models overall. Complicating this all is the fact that work is taking place across various technologies and databases, as well as data areas such as source, landing, staging etc, and also on multiple platforms (eg Dev, QA, Prod, DR) and all of these must be kept in synch via a well-controlled release process.

Whilst Excel is frequently the tool of choice to record and manage business and data rules, it must be borne in mind that a high degree of automation should be built in to ensure alignment across multiple artefacts, because it is very challenging to manually keep up with the rate of change and resultant cross-dependency impacts, and of course anything manual is bound to be error prone.

Finally, it would make sense to call on professionals who have already thought through the requirements, made the mistakes, and have as part of their data migration arsenal a well-defined and proven set of templates to put to confident use as early as possible in the programme. The subject of Data Migration is vast, and I hope that this short series of five articles has assisted in understanding where to focus extra effort so as to ensure success to the migration, and in turn to the parent programme.

Good luck!

We often hear from the project stakeholders that "it is too early for the data migration team to start as we have not yet defined the target". What absolute rubbish! The reality is that there is so much data related work to do that the earlier you start the better. For starters there is the Data Migration Strategy which needs to be fleshed out. Yes you will not be able to complete it 100% but the mere fact is that the topics to be addressed are so vast and mere mention of them will at the very least kick-start the many important discussions that need to occur. A typical Data Migration Strategy needs to cover areas such as Data Migration Architecture, Approach, Forums, Governance, Technology Choices, Extract, Load, Transform, Data Quality, Cleanse Approach, Audit Control, Reconciliation Approach, Cutover Approach, Testing and many more, and so there is a massive amount of work to be done in fleshing all of these out! Granted that a lot of this will be a work in progress for a while to come and there will be many unknowns in the early phases, but at the very least you will have identified and prioritised these!

Another recommendation is to profile the source data as early as possible, as this will provide very useful insights into what exactly you will be dealing with from a source data structure and content perspective. We all know that source system documentation is generally sorely lacking, if it exists at all, and so data profiling is a relatively quick and easy way to establish a foundation of fact, especially as regards to what data gremlins are lurking in your legacy systems. The earlier you discover these and put plans in place to deal with them, the better. You want a well-informed and thoroughly considered Extract Cleanse Transform and Load solution to be built, or else you will suffer from the obvious time and budget setbacks inherent in the typical "Code, Load and Explode" solutions that we often see.

Finally, there is a ton of work to do on business and data rules that need to be considered in the context of the new system and, even if the target is not defined, there are many that are based on simple truths that must be catered for regardless. At the very least make a start on the processes and templates that will be needed to manage rules within what will seem to be an ever changing landscape (covered in the next article). For a Data Migration there is no such thing as beginning too early. Just start. You will be amazed at how much there is to do!

There is often a lot of hype and a lot of expectation generated in the process of selling the 'new system', both from external and internal parties, as the motivations and business cases run their merry courses. Consequently, prior to every new system implementation, we hear common expectations expressed such as 'the data quality will be better' and 'we will have a 360deg view of our customer'. Well this does not just happen and the new technology never magically 'sorts the data out': left to its own devices the data in the target will be as good, as bad and as fragmented as it is in the source! Complicating this situation is the fact that the System Integrator (SI) selected for the implementation will, even though they may take on the actual data migration, explicitly exclude the resolution of data quality problems from their project charter. They will leave it up to you, the client to sort out (as if you don't already have enough to do!)

What we have seen works best is to outsource key aspects of the data migration and cleansing to professionals, and to contract directly with them rather than through the SI, so that they are working directly with you the client to address what are generally very complex issues. This also means profiling the data early and often, and putting into place systems, technology and processes that regularly validate data integrity and identify data quality problems that will cause the new system to fall short of the expected objectives, and those that will likely cause the data load into the target to simply fail. This requires careful articulation, management and alignment of business and data rules (to be covered in the fifth article in this series) across all activity, including not least extraction, exclusion, validation, cleansing, mapping and target front-end validation rules. These also need to be regularly made visible to the business via the Data Migration Working Group (see previous article: Mistake #2 'Not Involving Business Early Enough') and then decisively dealt with through each iteration as the target evolves.

Another problem that regularly crops up is that most organisations expect to deal with data non-quality without specialised data quality tools. Whilst manual data quality resolution is usually always a part of a data migration, the extent thereof can be minimised by following a programmatic cleansing approach where possible. Especially with high data volumes or complex data problems (or typically a combination of both!) it will be impossible to adequately, predictably and consistently deal with data quality issues using home-grown SQL/Excel/Access type solutions. And often the objective to build a 'single view' for the new system will require sophisticated match/merge algorithms that are generally only found in specialised data quality tools.

The bottom line is that data migration is not simply a 'source to target mapping' exercise, and is not just about Extract Transform and Load (ETL), but about Extract Cleanse Transform and Load (ECTL). Ultimately a Data Migration sub-project needs to take a holistic approach that includes consideration of the high expectations of the organisation with regard to 'better data' in the new application. Finally, don't forget to also prevent the data quality problems from happening all over again in the new system, and therefore ensure that preventative measures that match the corrective ones taken during the cleansing, effectively protect the new database(s) from re-contamination.

In the next articles I will cover:

Mistake #4: Delaying Because "The Target is Undefined"
Mistake #5: No Processes for Managing Dynamic Business & Data Rules

The data in an organisation belongs to the business, and they have to be the ones making the business decisions about it. Coupled with this, in a data migration project there is always an expectation that "the new system" will make everyone's lives easier. But this very likely involves new or improved processes that will in turn be reliant on a well-defined and aligned data representation in the target system. Business must provide the vision and together with subject matter experts they need to guide the implementation team toward a sound target model that will support the desired business outcomes. This model, however, is typically very new and different to that found in legacy, and often relies on data elements that have never been considered or even captured! Together with the inevitable data quality problems in source (discussed in the next article), this leads to a situation where focused and timeous input and decisions are needed from business stakeholders.

We have found that the early (in the planning phase!) establishment of a formal platform for interaction with business is often the best way to ensure this. Such a forum - we like to call it the "Data Migration Working Group" - needs to meet regularly (at least weekly) with the data migration leads to be appraised of the data landscape, data related risks and migration project progress, and to help make the required business decisions and provide overall data related guidance.

In the next articles I will cover: