Evolving Data Quality: From Static Gates to Continuous Observability

Evolving Data Quality: From Static Gates to Continuous Observability

By Contributing Writer
Sumit Kumar Tewari
  |  October 22, 2025



In modern enterprise data platforms, data quality and data governance protocols are essential to ensuring the reliability, integrity, and accuracy of an organization's data. Enterprise data quality includes key dimensions such as completeness, validity, conformity, accuracy, consistency, timeliness, and integrity, while data governance focuses on establishing the policies, standards, and procedures that enable organizations to capture reliable data, comply with regulatory standards, and support company-wide decision-making.

The landscape of data quality has evolved significantly in 2025. We've seen a fundamental shift from static quality gates to continuous observability and AI-driven automation. As organizations increasingly deploy AI models and autonomous agents, the "garbage in, garbage out" principle has never been more critical. Modern data quality isn't just about validation checks anymore - it encompasses monitoring, lineage tracking, drift detection, alerting, and proactive remediation through integrated feedback loops across the entire data lifecycle.

Cloud Platform Enhancements

Major public cloud platforms have significantly enhanced their data quality offerings with ML-based anomaly detection, serverless architectures, and deeper governance integrations. For example:

  • Google (News - Alert) Dataplex has introduced Auto Data Quality as part of its Universal Catalog, enabling rule recommendations and automated data profiling with minimal setup. The platform leverages scalable YAML-based rules with serverless configuration requiring no infrastructure management. Automatic pushdown to BigQuery ensures zero-copy execution, and you can configure quality checks directly in Dataplex or integrate them into external orchestrators like Cloud Composer.

  • Microsoft (News - Alert) Purview now provides a unified data governance experience (GA September 2024) featuring embedded AI capabilities, "data health" scoring integrated into the catalog, and comprehensive governance integrations across hybrid and multi-cloud environments. Purview has expanded its data quality capabilities within Fabric lakehouse and Unity Catalog environments, with on-premises SQL and Oracle (News - Alert) scanning support coming in 2025.

  • AWS Glue Data Quality has advanced substantially with ML-based anomaly detection, dynamic rule generation, and expanded rule types including file freshness monitoring and referential integrity checks. The service supports incremental evaluation for cost optimization and has broadened table format compatibility to include Apache Iceberg, Apache Hudi, and Delta Lake.

Understanding Data Quality Across Pipeline Phases in 2025

The question of where in the data pipeline to apply quality measures is frequently discussed among engineering teams. A multi-layered approach covering source, transformation, and consumption establishes a comprehensive framework that enhances reliability across the enterprise data space.

I. Source (News - Alert) System Layer

The data pipeline begins with the source. Establishing quality at this phase is the first step in maintaining reliable, accurate information organization-wide. Handling issues at the point of entry prevents errors from cascading downstream, where fixing them becomes more difficult and expensive.

  • Data Contracts Become Standard
    The biggest shift is organizations moving from informal agreements to documented contracts between source system owners and data platform teams. These versioned contracts clarify responsibilities and enable automated checks that catch problems before bad data enters the pipeline.

  • AI-Assisted Validation
    New tools using artificial intelligence spot schema violations and data anomalies automatically. Instead of writing rules for every possible problem, these systems learn what normal data looks like and flag anything unusual, catching issues at the earliest possible point.
  • Schema (News - Alert) Enforcement
    Modern platforms now block data that doesn't match expected formats right at the door. If source data doesn't meet contract requirements, it simply won't enter the pipeline, preventing the "we'll fix it later" problem that used to cause downstream issues.

Organizations are establishing formal data contracts with versioned specifications that enable automated validation and clearer accountability. Even though the complexity level of implementing data quality at this stage is higher due to limited visibility of data usage, it offers longer-term benefits such as reduced downstream correction efforts and fewer disruptions in analysis and reporting.

II. Staging or Raw Layer

Implementing data quality at the staging level is an important step in building a robust system. This phase acts as a second line of defense, capturing data from sources before anything changes.

  • Smarter Scanning Strategies
    Instead of checking every single record, organizations now use incremental scans and sampling techniques. Critical data gets full validation, while less important data gets statistical sampling. This cuts costs significantly without sacrificing quality where it matters most.

  • Continuous Drift Detection
    Rather than waiting for someone to notice problems, systems now monitor constantly for changes in schemas, data distributions, and patterns. When something shifts unexpectedly, alerts go out immediately, preventing small issues from becoming big problems.
  • Serverless Architecture Benefits
    The shift to serverless, pushdown-based quality checks means data doesn't need to move around as much. Checks happen where the data already lives, cutting processing time and cost while making frequent quality checks practical.
  • Real-Time Decision Support
    New observability platforms provide dashboards showing Service Level Indicators in real time. Teams can see immediately whether data is ready to move forward or needs attention first, replacing the old "wait and see" approach with informed decisions.

The cost-benefit analysis has improved substantially with serverless quality checks that minimize data movement and processing overhead. Applications should make informed "go/no go" decisions regarding data movement based on service level objectives and risk profiles.

III. Transformation or Work Layer

The transformation layer is where raw data becomes business-ready through profiling and transformation aligned with business regulations and requirements. Quality checks at this stage ensure expected outcomes and minimize error propagation to downstream layers.

  • Differential Testing
    Tools like SmartDiff now let teams compare data before and after transformations automatically. This catches bugs that slip through traditional testing, especially ones that don't cause obvious errors but quietly corrupt results. The system explains what changed and why, making debugging much faster.

  • Selective Quality Checks
    Modern tools support incremental evaluation and selective rule application. You can run comprehensive checks on critical transformations while using lighter checks on routine operations, achieving better coverage without slowing down pipelines that need to run fast.
  • AI-Suggested Validation Rules
    Large language model systems can look at transformation logic and suggest appropriate validation rules automatically. They can even generate test cases and propose fixes when checks fail, shifting quality work from manual rule writing to reviewing and refining AI suggestions.

Adding quality checks at this stage comes with challenges. The checks can complicate transformation processes and slow down pipelines. Organizations must balance thoroughness with speed, especially in areas requiring real-time analytics. However, the rewards are significant. Optimizing data at this level improves reporting accuracy and increases user confidence in the data.

IV. Semantic or Consumption Layer

At the semantic layer, implementing data quality is important for delivering reliable, actionable insights to end-users and downstream applications. This layer acts as the crucial bridge between staging data and final analytics.

  • Consumer-Level Validation
    Quality checks now verify data consistency across different views and consumption patterns. Systems validate that end users see consistent information regardless of how they access it, addressing the "why do these two reports show different numbers" problem.

  • Integrated Lineage Tracking
    When quality issues appear upstream, integrated lineage tracking shows exactly which downstream reports and dashboards will be affected. Teams can fix problems in priority order based on actual business impact rather than guessing.
  • Unified Catalog and Quality Workflows
    Systems like Microsoft Purview tie catalog, quality, and governance together. When quality checks fail, the system can automatically flag datasets, notify the right people, or restrict access until someone fixes the problem, replacing manual coordination with automated workflows.

  • Semantic Validation
    Embedding-based and vector-based anomaly detection can spot contextual problems that traditional rules miss. These systems understand meaning, not just format, which matters especially for AI and machine learning applications that need semantically correct data.

Foundational Pillars for Ensuring Data Quality and Resilience in 2025.

  1. Observability and Data Contracts

The paradigm shift from static quality gates to continuous observability represents one of the most significant evolutions in data quality practice. Rather than treating quality as isolated checkpoints, modern architectures implement integrated feedback loops that provide real-time visibility into data health across all stages.

  1. Continuous Monitoring
    Instead of checking data at specific points, modern systems monitor data health constantly across all pipeline stages. Problems get caught as they happen, not days later when someone notices bad reports. Data observability encompasses comprehensive monitoring of pipelines, including automated alerting on quality metric degradation, schema changes, volumetric anomalies, and freshness violations.

  1. Data Health Metrics
    Organizations now track Service Level Indicators and Service Level Objectives specifically for data quality, treating it with the same importance as application uptime. Data quality becomes measurable and reportable.

  1. Schema Evolution Management
    Platforms track every schema change across the entire pipeline. When upstream changes threaten to break downstream systems, alerts go out before anything breaks. Version management lets old and new schemas coexist during transitions without compromising quality.

  1. Formal Data Contracts
    Data contracts specify exactly what producers promise and consumers can expect - schemas, quality levels, update frequencies, and ownership. Combined with automated validation, these contracts make quality problems easy to trace and fix. Everyone knows their responsibilities, and the system enforces them automatically.

  1. AI and Automation in Data Quality

Artificial intelligence and machine learning are increasingly embedded not only in analytics workloads but in the data quality infrastructure itself, representing a fundamental evolution from manual rule authoring to intelligent, adaptive quality systems.

Modern platforms like Google Dataplex's Auto Data Quality and AWS Glue can analyze data patterns and automatically suggest appropriate validation rules. Some systems leverage large language models to generate validation logic from natural language descriptions of data quality requirements, dramatically reducing the time and expertise required to establish comprehensive quality checks.

Rather than relying solely on predefined rules, ML models can learn normal patterns in data distributions, relationships, and temporal behaviors, then flag deviations that might indicate quality issues. This is particularly valuable for detecting subtle problems that wouldn't trigger traditional threshold-based rules.

Emerging systems not only detect quality issues but propose or even automatically implement corrections. This might include filling missing values based on learned patterns, correcting formatting inconsistencies, or reconciling conflicting records across sources. AI systems can automatically classify data types, identify personally identifiable information, infer relationships between datasets, and generate documentation.

Best Practices for Implementation

While data engineers may have differing opinions on the optimal placement of quality controls, it's important to emphasize that checking data integrity is an ongoing and iterative process. By developing protocols for managing data quality at all stages - from acquisition, transformation, and storage to consumption - organizations can reduce the risk of data problems spreading through the system.

Rapid anomaly detection at every stage ensures data-driven decisions are based on high quality data. The cost-coverage tradeoff has become more nuanced with modern tooling capabilities. Organizations can now employ strategies such as:

  • Full validation at critical checkpoints (source ingestion, pre-consumption)
  • Sampled or incremental checks at intermediate stages to balance cost and coverage
  • Pushdown execution that minimizes data movement and processing overhead
  • Dynamic rule adjustment based on observed data patterns and risk profiles

Modern observability platforms provide real-time dashboards tracking these metrics, enabling teams to identify bottlenecks and optimize for both quality and latency.

Looking Ahead

The data quality landscape continues to evolve rapidly. The data mesh paradigm's emphasis on domain-owned data products is reshaping quality responsibilities. As enterprises operate across cloud providers and maintain on-premises systems, native support for quality validation across heterogeneous environments becomes increasingly important. Beyond structural and statistical checks, semantic validation using embeddings, knowledge graphs, and contextual AI will detect consistency issues that simpler approaches miss. Future pipelines may leverage LLMs to conduct continuous self-assessment, generating their own quality checks and adapting validation logic as data patterns evolve.

For organizations looking to update legacy data quality implementations, the journey toward comprehensive, observable data quality is ongoing. Key recommendations include evaluating existing quality checks across pipeline stages, moving beyond static gates to continuous monitoring, formalizing agreements between data producers and consumers, and taking advantage of serverless, ML-enhanced quality tools that reduce infrastructure burden. Organizations that embrace these modern practices position themselves to leverage data as a strategic asset with confidence.

Sumit Tewari, Senior Manager Data Engineering for the world's largest retailer, has more than 20 years of experience in the data domain, including specialized expertise in designing and optimizing complex enterprise data lake systems and pipelines, modernizing software systems, and spearheading technology migrations to cloud-based platforms.



Sumit is currently based in Frisco, Texas (US), where he leads large, cross-functional, global software engineering teams in developing scalable, highly available, enterprise-grade data flows to drive process improvement and high performance in ETL architecture, analytics solutions, and resilience disaster recovery processes. Sumit earned his Master's degree in Computer Applications in 2001 from Jawaharlal Nehru National College of Engineering (JNNCE) in Shimoga, India. Prior to his current role, he was Vice President of Software Engineering for the fifth largest global financial institution.



Get stories like this delivered straight to your inbox. [Free eNews Subscription]