The Need for Cloud-Centric Network Performance Monitoring

Cloud Management

The Need for Cloud-Centric Network Performance Monitoring

By Special Guest
Jim Frey, VP of strategic alliances at Kentik
  |  December 07, 2016

Digital business is forever altering the landscape for how organizations brand themselves, maintain customer loyalty, go to market, generate revenue, and more. Digital business practices have driven a rapid shift to hybrid cloud-based IT, and along with that, a resurgent and critical dependence on the role that networks and network performance play. This shift is also not simply a question of IT being more agile or cost efficient, it is highly visible all the way up the chain to the C-suite, because digital business is one of the primary levers for competitiveness and profitability in today’s enterprises.

Network performance monitoring has been around for quite a while, and recently it has been getting attention for falling behind the times. In fact, in May of this year, Gartner (News - Alert) Research Director Sanjit Ganguli released a research note entitled “Network Performance Monitoring Tools Leave Gaps in Cloud Monitoring” that makes some insightful critiques of the state of affairs in NPM in light of how applications and their network traffic patterns have changed. In the May note, he explains the gap in cloud monitoring, saying traditional NPM solutions are of a bygone era.

“Today’s typical NPMD vendors have their solutions geared toward traditional data center and branch office architecture, with the centralized hosting of applications,” he writes.

Mind the Gap

Why does this represent a gap in functionality? Well, not long ago, enterprise networks were comprised almost exclusively of one or more private data centers, connected to a series of campuses and branch offices by a private WAN, commonly based on MPLS VPN technology outsourced from a major telecom carrier. In that situation, you could park some NPM appliances in major data centers, perhaps directly connected to router/switch span ports where major points of traffic aggregation occurred, such as data center ingress/egress, or via a packet broker from the likes of a Gigamon (News - Alert) or Ixia. You’d put a few others (with emphasis on the word few because these appliances have never been inexpensive) at the largest campuses, so you could see a bit of what else might be going on within those LANs. Voila, you covered (most of) your network. The devices would do packet capture; derive and crank out performance alerts and summary reports; and retain a small window of raw details for closer examination if you could get back to them before they were over-written – typically a few hours at best.

But in the age of digital business, cloud is a pervasive reality. Cloud can mean a lot of things, but in this case refers to the move to distributed application architectures, where components are no longer all resident on the same server or in the same data center, but rather are spread across networks including, on an increasingly common basis, the internet, and accessed via API calls. The days of monolithic applications running in a data center, serving users located strictly in campuses and branch offices that are connected over a strictly private WAN, are not quite over, but they are the legacy architecture of the past. Cloud also means that users aren’t just internal users. Digital business initiatives represent more and more of the revenues and profits for most companies, so users are now end customers, audiences, clients, partners, and consumers, spread out across the internet and perhaps across the globe. The effect on network traffic is profound.  

As Ganguli observes in his research note: “Migration to the cloud, in its various forms, creates a fundamental shift in network traffic that traditional network performance monitoring tools fail to cover.”

The Legacy Reality

Why can’t traditional network performance monitoring tools cover the new reality? Well, if you’re using an appliance, even a virtual one, you’re assuming that you can connect to and record packets off a network interface of some sort. That assumption worked when applications were relatively monolithic and you had relatively few points in the network where important communication is happening. In a traditional scenario, that meant where the LAN and the WAN met. However, cloud realities break that assumption. There are way more connectivity points that matter. API calls may be happening between all sorts of components within a cloud that don’t have the kind of network interface to which an appliance can attach.  

Ganguli observes in his note that: “Packet analysis through physical or virtual appliances do not have a place to instrument in many public cloud environments.”

In the modern environments of web enterprises, SaaS (News - Alert) companies, cloud service providers, and the enterprise application development teams that act like them, appliances are pretty much unthinkable. But even if you could use appliances, there are so many more connectivity points that matter. You’d need to distribute physical and virtual appliances very broadly. That sounds like a lip-smacking feast for legacy NPM appliance vendors, but the problem is that it has never been economically feasible for network managers to exhaustively deploy packet-based NPM because the cost of the appliances is non-trivial.

The result is functional blindness, or to put it more mildly—it’s really hard to figure out what’s happening. Ganguli’s words illustrate this.

“Compute latency and communications latency become much harder to distinguish, forcing network teams to spend more time isolating issues between the cloud and network infrastructure,” he writes.

This is not a new problem, but the new cloud reality adds an aggravating new twist. In legacy architectures, the network was the best place to start your triage and analysis when any non-obvious application performance problem arose. But now that you can’t see or touch much of the network, you can’t easily tell if the root cause is the network or not, nor use the network viewpoint to isolate the most likely source of the issue.

Manual software tools like TCPDump and wireshark won’t work, nor will traditional NPM appliances, for the reasons outlined above. And worse yet, it’s actually an existential problem. Remember digital business, revenues, profits, audiences, brand, competitiveness? They all go down the tubes when user experience goes into the can. If you’re a network manager, you’re not going to be getting the dreaded 3 a.m. call from the CIO alone, it’s going to be the CMO, CRO, CFO, and CEO hammering you too.

Cloud-Centric NPM Considerations

If we accept that there is a gap in NPM due to the realities of the cloud, what should network managers consider going forward? Let’s start by focusing on a number of key requirements for cloud-centric NPM.

  • Based on real traffic: It’s tempting to think with the shift to cloud and Internet traffic flows, that you can replace real traffic-based performance measurements with synthetic/test traffic. Over time, the industry has tried synthetic transaction approaches, and network managers have voted with their wallets that measuring real traffic is far more useful. The reason is simple – when you’re measuring actual application traffic, you don’t have to do any intermediate correlation to know if some change in abstract performance metrics is affecting your applications. Synthetic test can still be useful, but as a secondary add-on, not a primary NPM technique.
  • Deployable and feasible in scale-out style environments: Complete and total network performance monitoring will always require some mix of sensors, probes, and host-based software agents. Even a few of those legacy NPM appliances can come in handy, on the parts of the infrastructure that you own and to which you have ready access. But on the cloud side of hybrid infrastructures, where the traditional appliance model breaks down, you’ll need software probes and agents that focus on the endpoints, since you won’t have access to the network path. Excellent options here are deploying an agent onto an application server instance or on load balancers such as HAProxy or NGINX servers.
  • Internet path and geolocation-aware: Even in the age of cloud, geographic and topological location can play a major role in network performance experience. Since so much of cloud traffic traverses the internet between API-connected applications and then on its way to customers, subscribers, and end users, it’s essential to be able to understand how performance issues correlate to location. This means establishing clear visibility into things such as the exit path to the internet being utilized, which major provider and transit networks the traffic traversed, and which networks in which geographies are having issues. The ability to rapidly pivot around and through these indicators when looking at network performance makes a huge difference in understanding if it’s the network or not.
  • Cloud-scale data analysis: Aside from the fact that they are not deployable in many cloud environments, one of the major downsides of the appliance model for NPM is that it is based on pre-cloud computing and storage assumptions. A cloud-centric NPM approach should utilize a cloud-based and cloud-scale approach to collecting and analyzing data, offering cloud or SaaS-like time to value, while being able to scale horizontally to store and rapidly analyze many billions of data records. Such solutions will provide a platform upon which it becomes possible to conduct effective volumetric traffic analytics, providing important contextual information for tracking and understanding network performance.
  • API-friendly: Cloud architectures utilize APIs liberally to create value-added integration points. Network performance monitoring needs to join the API economy as well, and break out of the box it’s been in. Most appliances, due to their compute and storage constraints, are engineered primarily around their UI, and leave API connectivity as an afterthought or to central reporting servers that cannot hope to provide access to full/raw granular data. With cloud-scale platforms, there is no such constraint, so API-friendliness should be expected as a given.

Plenty of Options

The good news is that the cloud itself makes many of the above requirements achievable. Network managers can turn to a growing number of commercial IaaS and open-source platforms that are capable of scaling out with tons of compute and storage power. If their teams have strong development skills, there are options for standing up homegrown solutions today.

Of course, not every organization has the means or will place a priority on developing and maintaining a solution. Thankfully, while network performance monitoring has been a bit long in the tooth, the industry is responding to fill the gap that Gartner has identified. Just as APM (News - Alert) tools pivoted to the cloud with the emergence of vendors like New Relic and AppDynamics, the NPM tools and solutions space is cloudifying, so network managers can expect to find cloud-based network performance monitoring sooner rather than later. Such solutions are already available, and more are coming out every day.  

Jim Frey is vice president of strategic alliances at Kentik.

Edited by Alicia Young
blog comments powered by Disqus