Controlling Cost Drivers for Enterprise ML Projects

Controlling Cost Drivers for Enterprise ML Projects

By Contributing Writer
Akul Dewan
  |  May 07, 2025



The IBM Global AI Adoption Index 2023 published in January 2024 concluded that 42% of enterprise level organizations have adopted Artificial Intelligence (AI) for one or more operations, and an additional 40% of the enterprises are in the process of experimenting with/or adopting AI. Even with a marked shortage of AI expertise, good/clean data scarcity, and ethical concerns, the number of enterprises embarking on AI adoption initiatives has risen exponentially in a relatively short period. This can be attributed to recent growth in technology availability, among several other factors. Concurrently, we have seen massive growth in compute and storage resources, critical for Machine Learning (ML) models, by hyperscale cloud vendors. Finally, there has been an increase in dedicated ML platforms, libraries, and pretrained models that have helped to expedite research and development, and consequently time-to-market.

Working within the bounds of an enterprise software system, an ML project is expected to adhere to certain principles, policies, procedures, and controls, like any other software project. By extension, the ML innovators are also expected to understand the cost implications of the technologies they need to adopt or plan to use, to achieve the standards of non-ML software projects. Yet, many enterprises seeking to adopt AI experience unanticipated cost inflation—commonly known as “cost creep”—over the project’s development lifecycle. Controlling cost creep requires an understanding of possible cost drivers, and their possible financial impacts.

A Common Cause of ML Cost Creep

Accepted principles of software engineering prescribe the systematic approach and discipline required for developing and executing a mature software solution. Software enterprises employ a stack of tools, resources, and processes to adopt these principles.

Non-AI/ML, traditional software enterprises often introduce ML into their ecosystems via small and focused Proof-of-Concept (POC) projects. The POCs are limited in data, scope, and observability. Pressured by time-to-market and budget demands, these POCs risk being deployed to production without proper resources and guardrails in place, which may result in system instability, cost creep, and unforeseen resource consumption—a  common pitfall. Software engineering principles are as necessary for ML projects as they are for traditional software. Additionally, the tools, technologies, and resources currently used for software may not fit the extended requirements of ML projects.

Cost Impact of Software Engineering Best Practices for ML

To control unwanted expenditures, it is important to understand the principles of software engineering that have a major cost impact on ML projects. Some of these cost drivers are introduced because new tools or technologies are found to be needed during development or implementation, while others are due to the procurement of resources, both compute and human.

1. Traceability

The ability to register and link artifacts across the system promotes visibility, control, and validity. These artifacts include, but are not limited to, code, configuration, logs, and solicited and unsolicited messages. Traditional software uses a variety of tools to version, persist, and log artifacts being generated, induced, provisioned, and deprovisioned.

ML projects require the same traceability. In addition, an ML system requires a specialized repository for lifecycle management of model and model artifacts. The lifecycle of a model starts from its inception, usually within the researcher's experimental notebook, and ends when the model is superseded with an evolved model. During its lifecycle, depending on the maturity of the software development lifecycle (SDLC) processes within an enterprise, each model may pass several milestones of development, quality assurance, staging, production, and eventually decommissioning. This requires, similar to code version control, a model version control, and an access control. A model version control enables traceability on a model’s journey across the system, while access control ensures that at each stage of the lifecycle, appropriate accesses are granted and revoked.

During the lifecycle, cost drivers such as procurement cost, compute costs, storage costs, and operational costs are impacted by several factors, which may include features required, open source vs proprietary solution selection, self vs vendor managed solution, average model size, and others.

Although a GIT-like solution might be acceptable for version control of ML artifact deployments for the short term, a Model Registry solution like MLFlow, AWS SageMaker Model Registry, or Azure ML Registry is imperative in the long run. This is especially necessary for projects that either employ a per customer model or have a faster release cadence for a model. In such cases, the compatibility of the ML model with the rest of the software becomes fragile and requires continuous manual compatibility testing across multiple versions. Furthermore, in this situation, change management processes become bulkier and impractical, as release teams have to collaborate repeatedly to ensure the new release version of the artifact is compatible with the rest of the system, or vice-versa.

For example, I designed an ML pipeline for a multi-tenanted software in the automation industry. Software upgrades for each tenant required the customer’s availability and prior approval. This resulted in a need to operate several versions of the software in production. Here, the challenge was to ensure that a new model availability, possibly only compatible with newer software versions, does not disrupt older model execution on older software versions. Model version control was used in this pipeline, allowing software and models to update in production independently.

2. Repeatability

Enterprise software needs to produce deterministic results consistently under the defined circumstances. This requires replicating or mimicking a production-like environment and then testing the system functions. The replication is achieved by several means; a few key methods include capturing and replaying customer interaction scenarios, snapshotting message flows, and stressing the software under the production-like system settings. Another facet of repeatability is consistent results expected from repeated tasks, like deployments, benchmark testing, etc. This is achieved by leveraging automation tools. Continuous Improvement/ Continuous Deployment (CI/CD) tools and integrated automated test suites are very common.

ML models are probabilistic by nature, but similar to traditional software, deterministic results are expected from them. Determinism is induced in two ways, by data version control and by enforced safety checks.

Dun & Bradstreet’s AI Survey 2025 showed that 54% of the organizations using AI are concerned with trustworthiness and quality of the data used. Similar to code version control (and model version control, as previously described), data version control enables GIT-like branching of the datasets. This allows researchers to curate a controlled dataset for training the model, and the quality assurance validators to certify the expected functionality against exactly the same dataset. Controlled updates to the dataset can be facilitated with GIT-like check-ins, reverts, and branching. Further, production snapshots can be versioned and repeated later as required. Developing and versioning an Invariance Dataset to perform this validation not only provides a single point of reference across model evolutions, but also reduces the cost of dataset procurement, labelling, and collection.

Safety checks are employed on the outputs of the ML model to verify outliers, and out-of-ordinary outputs are filtered or blocked as required. This step is critical for ML projects deployed in medicine, banking, transportation and other industries where system failures can have a direct negative impact on safety or finance. The mandated compliance requirements for such industries also demands controls and monitors in place to ensure that model results are validated. For example, SR 11-7 and OCC 2011-12 of Model Risk Management (MRM) requires that safety checks be in place.

For instance, I designed a configuration optimizing a recommendation engine for a cybersecurity product. The engine analyzed incoming traffic patterns, the enabled features, and the existing configuration of the product. Machine Learning was leveraged to produce recommendations on how the product might be re-configured. Because an inaccurate or suboptimal recommendation from an ML  model can compromise the security posture of customers, potentially making them more vulnerable to security breaches, a Safety Check process was deployed to mitigate inaccuracy concerns. This process checked for edge cases, anomalies in input or output of the system, and subject matter expert (SME)-defined rules that collectively validated, and blocked if necessary, each ML output.

With CI/CD, Continuous Training (CI/CD/CT) is also required for ML models.

A well-defined Continuous Training pipeline ensures the model is constantly trained on newer data and evolves in efficacy over time. Along with tackling the problem of Concept-Drift, CT also helps in reducing the time required by researchers to consistently keep up with changing trends in the actionable data and periodically retraining the model.

In essence, Repeatability is achieved by employing a Data Versioning solution individually, and with CT. Solutions like Project Nessie and Pachyderm are mature solutions in this space; the cost drivers of these are primarily storage cost. While CT can be achieved in many ways, it requires almost duplicating the same compute and monitoring resources that are used for training; adding double the training cost to the overall cost of the system.

3. Efficiency

As aforementioned, compute resources are one of the major contributors to the operational cost. One way to reduce the compute cost is by identifying the optimal resource setup required to support the expected, and the spiked, customer traffic. This is achieved by evaluating system performance, while stress-testing the system with the production environment setup replicated in lower environments. Another way is to optimize the software execution. Optimization typically requires using instrumentation tools to monitor the utilization of processors, memory, I/O operations, and execution latency of the compute resources, especially under stress. Engineers optimize the systems by reducing code complexity, patching data leaks, and other strategies that optimize code executions. For example, async batched I/O operations is a common technique to reduce latency caused by I/O operation throttling the executing thread.

Depending on the type and size of the ML model, compute resource requirements change. On one hand, for example, executing Large Language Model (LLM)-sized models on a CPU resource may produce results at the rate of one token per minute, or may eventually fail with out-of-memory error; GPUs are critical for such models. On the other hand, a CPU machine can be sufficient to execute a Decision Tree or a Regression Model. The challenge for researchers is to experiment with algorithms, techniques, and libraries to reach an acceptable trade-off between computation cost and accuracy.

Because a majority of the models require preprocessing and postprocessing of data, instrumentation supporting optimization of the preprocessing and postprocessing steps are needed to improve its efficiency. A common approach is to identify the separation of responsibility within the ML project pipeline. The ML researchers are responsible for the model development, optimizations, and maintenance, while preprocessing, postprocessing, and control signals to the entire pipeline are the responsibility of (ML) engineers. This division of ownership promotes healthy separation of responsibility, and allows the teams to be laser-focused on their subject matter expertise. This also requires enhancements to change management, since the generators of the source code are not purely from engineering, but researchers and engineers.

As a best practice, I make every engineering team I work with adopt this practice, even though it is often a polarizing recommendation. Commonly, a leadership focused on financial numbers and resource counts will look at this from a spreadsheet point of view. Their initial response may be to question the idea of researchers developing a code for pre/post processing, while the same code is redeveloped by engineering for productionization. To them it looks like an unnecessary code redevelopment. They prefer researchers either to be engineers, or engineers to be AI/ML experts, but neither is practically possible. Convincing enterprise leaders of the value inherent in this approach may take some effort, but in my experience the outcome is worthwhile. For example, for the aforementioned recommendation engine project, I demonstrated  with simple calculations that an efficient software code executing for one year covered the investment of engineering, redeveloping, optimizing, and operations of the ML pipeline. Adopting this change management strategy for ML further ensured updates to the ML model and processes are tracked, and a paper trail is maintained.

Just like with traditional software, it is necessary to optimize ML models by finding the simplest ML algorithm that can represent the problem space most appropriately (often referred to as the Occam's Razor solution). Unoptimized software code or unnecessarily complex ML models will lead to increased computational costs on any project. Cases of memory leak, code complexity, and I/O intensive operations further increase the cost of execution.

4. Compliance

Security controls, policies, and monitors are put in place to ensure that sensitive data access, storage, and transit is controlled and only provisioned for privileged users on an as-needed basis. Depending on the country and industry, several compliances may also be mandated by the government.

Along with the above controls, additional data controls are required for ML. The Stanford University AI Index Report 2024 showed that U.S. regulations on AI increased by 56% within one year. These regulations impact the government sector and other regulated industries, such as banking, insurance, healthcare, energy, telecommunications, aviation, etc. Depending on the contracts, expectations, and specific industry regulations, one enterprise may be prohibited from using customer data for ML model training, while others might allow some or anonymized data. Additionally, some global enterprises might also need to adhere to cross-border data access restrictions and rules, such as the General Data Protection Regulation (GDPR) and the Digital Personal Data Protection Act (DPDPA)., Recently, with the advent of gen-AI based Chat Agents, the data compliance requirements are further defining the type of model being trained, as well.

Compliance guardrails are easier to manage when the data, its access, and controls are centralized. As shown in McKinsey’s State of AI 2025 report, a majority of organizations are adopting a centralized “Hub” model for AI Data Governance. Here, data lake implementations or similar solution offerings by cloud vendors like AWS, Databricks, and Cloudera are often employed. The cost of a self- or vendor-managed solution will have similar cost drivers of storage and computation. As the critical driver of this, along with the individual capabilities of the offering, the decision of which solution and/or model to use should also be based on data location. Major cost drivers  will be data ingress and egress cost; consumers and producers typically within the same vendor ecosystem are cheaper and faster.

When I worked in sentiment analysis of brand products with an enterprise that served global customers, we carefully scrubbed our data of all customer identifiers (i.e. IPs, locations, email, social media handles, etc.) before using the data for training ML models. This ensured that the data did not breach the local laws when it was copied over to central data storage and for training purposes. Such a process also makes things easier when it comes to addressing laws such as GDPR’s Right to be Forgotten regulations.

Controlling cost creep

Recognizing how the cost drivers related to the four software engineering principles of Traceability, Repeatability, Efficiency, and Compliance can influence budgets in ML project adoption can be critical to mitigating cost creep and producing better financial and operational outcomes. In the last two decades, the software engineering industry has matured with these best practices and advanced tool offerings. It is imperative for ML models to now also embrace these practices. The best approach to avoid cost creep is a well-defined roadmap that assimilates the next generation of ML supporting tools, processes, and resources necessary for productionizing ML projects.

Akul Dewan is a Senior Product Architect at Akamai (News - Alert) Technologies, where he is responsible for developing architecture designs for the App and API Protections team developing cybersecurity products. With more than a decade of software engineering experience, plus specialized academic and career subject matter expertise in AI and ML tools, he leads cross-functional teams in designing complex AI/ML platforms, oversees governmental regulatory and compliance initiatives, and leads projects with ML capabilities for sophisticated cyber protection applications. He has been awarded two US patents and has three patents pending for his original innovations. Akul received his bachelor’s degree in Information Technology in his native India and earned a Master of Science in Artificial Intelligence from the University of Georgia, Athens (US). The opinions expressed in this article are his own.



Get stories like this delivered straight to your inbox. [Free eNews Subscription]