Data Engineering with Technology Transformation

Data Engineering with Technology Transformation

By Contributing Writer
Sanjay Puthenpariyarath
  |  September 13, 2024



Data is always a top priority for application developers and a data engineers, particularly when a business depends on processed data before serving its customers. The architecture for data processing can vary across organizations due to several factors, such as data volume, data availability and usage, tools and technologies in use, infrastructure, and allocated budget. Over time, a system may go through various stages of its lifecycle, and without technological upgrades or process improvements, data processing can become increasingly challenging. An optimized data processing system undoubtedly results in faster query times, higher output, and lower costs, while ensuring scalability, availability, adaptability, and security.

In a traditional data processing system, application logic, data quality checks, and business rules may often be centralized and executed from a single location, which can consume significant resources and impact processing speed. However, leveraging modern tools, technologies, and innovative advancements can lead to significant improvements and unexpected breakthroughs in data processing. For example, organizations often seek to reduce costs by exploring the advantages of open-source tools and technologies, aiming to eliminate the hefty expenses associated with licensing fees.

Defining the problem: a case study in outdated infrastructure

Mobile carriers, insurance providers, and the device protection industry manage billions of customer records daily, making the timely enrollment of subscribers into various programs crucial to meeting customer needs. But what happens when the system architecture is outdated, requiring hours of processing time? This was the case at one of the leading insurance providers, where billions of customer enrollment data was being processed using an aging infrastructure across the organization.

In this scenario, the system relied on on-premises architecture, utilizing traditional Oracle (News - Alert)-stored procedures, packages, SQL loaders, and middleware software to process data into the Operational Data Store (ODS) or staging database. From there, the middleware publisher transferred incremental (delta) data to the application databases, where customer data setup and front-end operations were performed. The enterprise data warehouse and reporting teams pulled data from the staging database to generate critical business reports. This setup incurred significant costs due to  licensing fees and the ongoing maintenance of the on-premises system.

Customer data from daily enrollment files was processed through SQL Loader using control files, with Oracle procedures playing a key role in data cleansing, transformation, loading and applying business rules. The system consistently performed database reads to identify changes (delta) between the current day's data and the previous day's data. After calculating the delta, the system would store the snapshot in a designated table and log the changes in a changelog table within the staging database. Finally, the fully processed delta was published to the target system using a middleware publisher, where customer setup was completed.

However, data processing frequently failed due to high resource consumption, connection timeouts, and service call failures. Customer data setup at the target system also required data retrieval from various systems via service calls, further contributing to delays. Processing was extremely slow, often taking hours, and failures would lead to data inconsistencies across the organization, resulting in customer complaints and escalations. An additional challenge was that incomplete or unavailable data in the target system prevented premium billing and/or caused inaccurate premium calculations, leading to revenue loss and potential legal issues.

Framing an open-source solution

This outdated infrastructure clearly needs a complete overhaul. A cloud-based solution such as Amazon Web Services (News - Alert) (AWS) could significantly reduce physical infrastructure costs, but that alone will not  sufficiently address the issue. To really solve the current problems, the entire architecture needs to be redesigned. Relying on a relational database management system (RDBMS) for data processing is inefficient, as the system is overwhelmed by data cleaning, transformation, loading, and applying business rules, eventually bottlenecking during read and write operations with the Oracle database. Is it possible to execute data processing in a distributed manner? Absolutely! Open-source technologies like Hadoop and Spark are a good place to start.

Do we really need to write everything from enrollment files into a database and continuously use a relational database for processing? Not necessarily. We could leverage Spark Data Frames instead. In-memory computation would eliminate the need for using relational databases in this step. The execution rules should be rewritten more efficiently using Spark SQL and optimized with Scala code to handle data processing in the ODS/staging database. Additionally, we must move away from using the middleware software entirely. Data retrieval from the target system, transformation, computation, and publishing should be managed by creating new AWS Lambda functions in conjunction with SNS topics. We can push the changelog data (delta) to the SNS topic, then subscribe to it using SQS queue, which would trigger AWS Lambda to perform the necessary operations. This would eliminate the need for middleware software altogether.

Design and implementation

Rebuilding this infrastructure requires us to create a high-speed, real-time data processing system capable of managing large volumes of customer data using advanced technologies that greatly improve data management. Key to achieving significant results in handling high data volume and velocity with minimal latency will be the selection of technologies, design of data flows, and integration with existing systems. For optimal performance in this case study, we have chosen Hadoop, Apache Spark, Scala, AWS, and Node.js to enable design and implementation.

The in-house scheduler initiates the AWS data pipeline execution based on job schedules. The AWS data pipeline plays a key role in data ingestion, data processing, transformation, and data orchestration. At a high level, the AWS data pipeline is designed to perform two main steps:

  1. Execute shell scripts on an EC2 instance to validate inbound data files, ensuring file quality, correct formatting, encoding, preservation of special characters, proper delimiter count, and accurate record counts.
  2. Run Spark job on an EMR Cluster in a distributed manner. This job, written in Scala and utilizing Spark SQL, includes detailed steps for data processing.

The inbound flat file data, processed using SQL loader and control files, is fully transformed with Spark. Rather than writing all data from flat files into a table, it is converted into data frames using Spark.

The Spark job will read daily enrollment files from the AWS S3 bucket and create a data frame for the inbound data according to the configurations defined in the S3 bucket or DynamoDB. It will also read and construct a history data frame from the previous day's snapshot data stored in the S3 bucket as flat files. The job will then perform data quality checks on the inbound data frame using the Artificial Intelligence and Machine Learning (AI/ML) models integrated into our data framework for identifying and resolving data quality issues. The process will generate a delta data frame, representing changes compared to the previous day's snapshot, by comparing the inbound data frame with the history data frame. Business rules will be applied to the delta data frame to ensure data consistency and integrity. These operations are conducted in memory, marking a significant shift from traditional processing methods that performed extraction, loading, and transformation within the database, often leading to high memory and resource consumption.

The delta data will be categorized into four transaction types: 1. New customer setup; 2. Updates to existing customers; 3. Termination of existing customers; and 4. Re-enrollment of customers into the program.

The delta data (changes) will be loaded into a changelog table in the Oracle database for enterprise reporting, facilitating the derivation of metrics such as daily enrollments, program changes, sales agent commission disbursal, and premium collection reports. The final snapshot data, which includes customer data changes, will be loaded into the snapshot table in the Operational Data Store/Staging database and saved as a file in the snapshot folder on AWS S3 for the next day’s processing. This approach avoids the need to pull snapshot data from the database daily, reducing overhead costs associated with data access from relational databases. Finally, the changelog delta will be prepared and published to the changelog Amazon SNS topics.

AWS event-based Lambdas are set up for each functionality and are subscribed to the changelog SNS topics:

  1. The Process Enrollment Lambda publishes changes to the target system, where it either sets up new customers or updates existing customer data based on the transaction type.
  2. The Letter Setup Lambda prepares master data for sending physical letters to customers in accordance with transaction types and legal requirements.
  3. The Billing Profile Lambda generates billing events from the changelog data and publishes them to the target system for customer premium collection.
  4. The Enterprise Reporting Lambda replicates source data into the Enterprise Data Store (EDS (News - Alert)) and then transfers it to the Enterprise Data Warehouse (EDW). This data is utilized by Power BI, Tableau, Redshift, and the in-house reporting framework for organizational-level business reporting, including sales prediction, agent commission calculations, client product growth analysis, and training of AI/ML models.

Logs from these processes are written to Kibana, facilitating easy tracking of the entire data processing workflow.

New architecture offers new advantages

This architectural overhaul has led to the development of a sophisticated system with zero fault tolerance and significantly enhanced the scalability of our data platform. Overall, investing in this technological transformation has delivered substantial benefits to the organization by delivering measurable advantages, including:

  • A highly secure, high-speed data processing platform
  • $1 million saved in licensing costs by replacing vendor-specific software with open-source tools and technologies
  • 40% improvement in processing time through in-memory computation
  • 50% reduction in operational costs by transitioning from on-premises to cloud-based infrastructure
  • 70% decrease in production tickets related to data integrity, availability, and quality
  • Promotion of a data-driven culture and adherence to data governance principles.

The innovative integration of AI/ML models into our data framework for identifying and resolving data quality issues will potentially save millions of manual hours, producing significant tangible benefits for organizations, as well as the data engineering field.


Author:
Sanjay Puthenpariyarath is recognized as an expert data engineer for his original contributions and success in designing and implementing scalable data architecture solutions for Fortune 500 companies in the banking telecom, and e-commerce industries. For nearly two decades, he has specialized in big data processing, data pipeline development, cloud data engineering, data migration, and database performance tuning, using cutting-edge technologies that have enabled him to optimize data workflows and achieve significant improvements in financial and operational outcomes. Sanjay received a Bachelor of Engineering degree in Electronics and Communication Engineering from Anna University, India, and earned a Master of Science degree in Information Technology from the University of Massachusetts, Lowell (US). As a senior leader, he enjoys mentoring data engineers, promoting data-driven organizational cultures, and delivering complex projects on time and within budget.


 
Get stories like this delivered straight to your inbox. [Free eNews Subscription]