Data Quality Implementation in Data Pipeline Phases

Data Quality Implementation in Data Pipeline Phases

By Contributing Writer
Sumit Kumar Tewari
  |  October 15, 2024



In modern enterprise data platforms, Data Quality and Data Governance protocols are necessary to ensuring the reliability, integrity, and accuracy of an organization’s data. Enterprise data quality includes key dimensions, such as completeness, validity, conformity, accuracy, consistency  timeliness, integrity, and data decay, while data governance focuses on establishing the policies, standards, and procedures that enable organizations to capture reliable data, comply with regulatory standards, and support company-wide decision-making. Implementing governance structures, automatic authentication methods, and continuous data monitoring increases operational efficiency and facilitates the development of advanced data-driven, intelligence including artificial intelligence (AI) and research.

To facilitate these processes, major public cloud platforms such as Google (News - Alert) Cloud (GCP), Amazon Web Services (AWS), and Microsoft Azure offer data quality products that are fully customizable and can be integrated with cloud data stores or open sources.

Google Dataplex data quality services are ideal for validating data in production pipelines, continually monitoring the quality of data collection, and generating regulatory reports. Based on scalable YAML (Yet Another Markup Language)-based rules, the serverless configuration requires no infrastructure, and the automatic pushdown to BigQuery, Google Cloud’s data warehouse, ensures zero-copy execution. Data quality checks can be configured in Dataplex or integrated into external designers such as Cloud Composer. Additionally, Dataplex offers a managed experience powered by the open-source CloudDQ engine.

Azure Data Quality Services (DQS) provides robust solutions to manage and improve data, ensuring it meets business Data Quality Objectives. This platform uses a knowledge-based approach, combining automation and communication tools to manage data integrity. DQS allows users to search, create, and manage their data, which can then be used for data cleansing, matching, profiling, and other tasks. Additionally, DQS allows integration with cloud-based reference data providers to improve data quality across industries.

AWS Glue Data Quality helps build, deploy, and track data quality rules on your data layers to improve data quality scores. The foundation of this product is the Deequ framework, which is a serverless service leveraging domain-specific Data Quality Definitions Language (DQDL) to define standard data quality rules. This tool ensures seamless integration of data quality analytics and provides flexible and scalable solutions to maintain high data standards.

The question of at which phase in the data pipeline data quality measures should be applied, and to what extent, is one that is frequently discussed among engineering teams.  Ensuring data quality at the source is essential to identifying and correcting errors early in the data lifecycle, thereby reducing the spread of downstream problems. By addressing data quality at its inception, companies can prevent errors from affecting subsequent operations and improve overall system reliability.

Another phase for consideration is the transformation layer, since implementing data quality checks while data is in the state of going through business rules implementation can provide assurances that data remains consistent, accurate, and functional during this phase. This step is critical to maintaining the integrity of the data as it passes through various stages of change.

Finally, at the consumption layer, data is evaluated for its usability before it is used in research or quantitative reporting. Validating data at this stage assures that end users will receive high-quality, actionable insights that drive informed decision-making. A multi-layered approach to data quality, covering source, transformation, and consumption, establishes a comprehensive framework that enhances the reliability of the enterprise data space.

Understanding data quality implementation across the data pipeline

1. Source (News - Alert) System Layer

The data pipeline begins with the source. Establishing data quality at this primary phase is the first step in maintaining reliable, accurate, and consistent management of information  organization-wide. Handling issues at the point of entry prevents errors from going downhill, where fixing them can be more difficult and expensive, and where compromised data may lead to inaccurate decisions and negative consequences.

In the source phase, organizations must conduct an investment analysis of resources and compute and storage capacity, identify critical data extraction processes and data contracts with source systems, and get commitments from source system data subject matter experts as a prerequisite. Even though the complexity level of  implementing data quality at this stage is higher, due to the lack of full visibility of data usage, it still offers longer-term benefits, such as reduced downstream data correction efforts, and fewer disruptions in analysis, reporting, and decision-making. Furthermore, proactively identifying and correcting data at this stage prevents regulatory compliance issues, and minimizes the risk of penalties due to inaccurate data delivery during later phases.

2. Staging or Raw Layer

Implementing data quality at the staging or raw level is an important step in building a robust data quality system. This phase acts as a second best line of defense, wherein data coming from sources is initially captured before anything changes. By implementing data profiling at this stage, organizations can quickly identify and correct errors and layers. Standard techniques such as data profiling, validation, and cleaning can be used to ensure data integrity and accuracy. This proactive approach not only improves the overall quality of information, but also simplifies downstream analysis and reporting. Additionally, it allows for consistent data governance practices, ensuring that all stakeholders have access to reliable information. Investing at this level may require initial allocations, but these are generally outweighed by the long-term benefits, such as reduced maintenance costs and improved resolution, and controlling any contamination in the next data layer.

Developing an effective implementation at this layer requires the application to renegotiate data contracts with the source system based on the data quality defects findings, and design a process to reload the incorrect data as adjustments in the snapshots. Applications managing only the top of stack information do not need to provision data reprocessing, as staging data will be overwritten with corrected values when it flows form the source system in the upcoming run. Although the cost of fixing data here is higher than implementation of data quality in the source system, the complexity decreases due to the visibility of data profiling available at the staging layer. Any application should take informed “go/no go” decisions regarding data movement from this layer based on their service level objectives (SLO) and risk objectives.

3. Transformation or Work Layer

Because the transformation or work layer is where the raw data is acquired and profiled into a useful format for analysis and decision making, in line with business regulations and requirements, it is important to build data quality implementation at this stage into an enterprise-grade data platform. The main advantage of ensuring data quality at this layer is to ensure expected outcomes of the business implementation, as well as a significant reduction in the minimum error propagation to downstream and consumption layers. By validating and cleaning the data during this time, organizations increase the integrity and reliability of their research.

Adding data quality checks at this stage comes with many challenges. First, the checks can complicate the transformation process and slow down data pipelines. Organizations face the challenge of balancing due diligence with the need for speed, especially in areas that require real-time analytics. Additionally, integrating this quality assessment into existing workflows may require substantial flexibility, which is time- and resource-intensive.

Despite these challenges, the rewards are great. Optimizing data at this level improves the accuracy of reporting and analysis, and increases user confidence in the data. By prioritizing data quality in the transformation phase, organizations are gearing up for better decision-making and operational efficiencies, paving the way for successful data-driven businesses. If these challenges are embraced, they can deliver more flexible and effective data quality controls. It’s also worth noting that while detecting data quality issues at this stage is relatively easy, this may result in higher fixed costs. An application should effectively manage the customer and downstream expectations, and follow the service level agreement (SLA) document to avoid penalties or regulatory compliance issues.

4. Semantic or Consumption Layer


At the semantic layer, implementing data quality is important for delivering reliable and actionable insights to end-users and downstream applications. This layer acts as the crucial bridge between staging data and the final analytics stage. Underscoring the importance of rigorous validations at this stage, data quality here enables users to enhance decision-making processes that can be supported with more accurate and contextually relevant data, that in turn allows them to make informed decisions aligned with organizational strategies.

However considerable the benefits, implementing data quality measures at this stage may also have significant challenges. A key benefit of focusing on the semantic layer is the capability to customize records to provide the enterprise with analytical business insights, offering customers tailored information, without the need to burden them with implementing business rules. This customization can considerably enhance consumer engagement and satisfaction. Conversely, the implementation of data quality here may introduce complexity into the data management of the semantic layer, particularly when dealing with numerous and heterogeneous information sources.

Another challenge associated with data quality at this stage is found in preserving the data consistency and coherence view across numerous reports and dashboards. To navigate those challenges correctly, the organization must invest in both studying the data governance frameworks and advanced data quality tools, such as Data Quality Assessment Framework (DQAF), Collibra Data Quality (CDQ) and Data Plux. Such investments facilitate a streamlined data quality validation technique at the enterprise level, ensuring that the information delivered is accurate, reliable and actionable.

Is there a best stage for data quality implementation?

In conducting analyses on implementing data quality measures at different stages of data pipelines, we observed that implementation costs increase as the data progress forward in the pipeline. Therefore, implementing data quality (DQ) at the early stages will minimize issue resolution costs.

In similar analyses, we found that the level of resolution complexity for data issues decreased as the data progressed forward in the pipeline. This is attributable to the fact that as we get more information about data, it is easier to define data quality rules with adaptive assumptions.

While data engineers may have differing opinions on the use of optimal data in different pipeline stages, it is important to emphasize that checking data integrity is an ongoing and iterative process. By developing a protocol for managing data quality at all stages—from data acquisition, transformation, and storage to consumption—organizations can reduce the risk of data problems spreading through the system.

Rapid anomaly detection at every stage ensures data-driven decisions are based on high quality data, reliability, and increased operational efficiency. Automated tools and manual controls improve data accuracy and precision, especially as data complexity and volume increases, and have the added benefit of saving costs in managing large data applications.

Finally, availability and timeliness are critical data quality dimensions, to measure how quickly the data is ready to use after it is stored, and how fresh the data is when it is used. Analyzing the time it takes to update data in each phase, and understanding how long it takes to make it ready for consumption, will contribute to effective processes that can maintain high quality of the data throughout the enterprise system’s data pipeline.




Sumit Tewari, Senior Manager Data Engineering for the world’s largest retailer, has more than 20 years of experience in the data domain, including specialized expertise in designing and optimizing complex enterprise data lake systems and pipelines, modernizing software systems, and spearheading technology migrations to cloud-based platforms.

Sumit is currently based in Frisco, Texas (US), where he leads large, cross functional, global software engineering teams in developing scalable, highly available, enterprise-grade data flows to drive process improvement and high performance in Design Extract, Transform and Load (ETL) architecture, analytics solutions, End of Life (EOL) and End of Services (EOS) product Service Level Agreements/Objectives for data sets, and resilience disaster recovery processes, among a broad range of functional imperatives. Sumit earned his Master’s degree in Computer Applications in 2001 from Jawaharlal Nehru National College of Engineering (JNNCE) in Shimoga, India. Prior to his current role, he was Vice President of Software Engineering for the fifth largest global financial institution.



Get stories like this delivered straight to your inbox. [Free eNews Subscription]