Follow the White Rabbit: CloudPhysics Exploration Mode — Strong Correlation to Cause

CloudPhysics Releases Exploration Mode

Funny tweets cause earthquakes. Stop laughing and keep us safe.

Really. Here’s an equation drawing out the correlation:
tumblr_lqeca8STqi1qmamfv

correlation is not causation.” The reprimand is so common that it’s comic sport to find funny examples that become internet memes, like this guy’s catalogue.

The not-so-funny aspect of correlation is that it’s widely relied upon to address vSphere operational challenges. That reliance exists because there are no tools in the market for making correlation a seriously effective operational debugging technique – until now, with the introduction of CloudPhysics’ Exploration Mode on our SaaS platform.

This is not to say correlation per se is wrong or not useful; it’s just been abused by tools and admins strapped for function and time to apply it effectively. Weak correlations in your operational management need to be replaced with strong correlations and context – the combination we supply in our new Exploration Mode, to lead you more accurately and faster to cause and solution. Without this form of analysis you find yourself tumbling down a rabbit hole.

So what’s a weak correlation look like?

Perfmon has come a long way. It’s a part of every admin’s troubleshooting war chest. Here’s a view from Server 2012:

performance-monitor

 

The essential problem with this view (i.e. what makes it weak) is that it doesn’t lead you to root cause. While many characteristics of memory are provided here in what appears to be correlative time, it’s still one-dimensional: memory utilization has no context. What is causing the variations on the different memory elements? The rabbit hole only gets deeper and more confusing here when you think about what else you need to know in order to understand where to look next:

  • this is a view of only one host — what is happening across the hosts in a cluster?
  • what’s going on with the other resources (cpu, network, storage) within this host (and across the cluster)?
  • what other events or issues exist outside this resource view and how do they relate to these resource metrics now and over time? What other datasources can I align with these resource consumption trends?

Why is weak correlation so limiting?

Weak correlation is crippling in vSphere environments, where consolidation inherently creates resource contention across a cluster — all the rabbit hole questions get deeper and darker. Simple mistakes made in configurations or changes to the infrastructure can punish operational stability. “Performance problems” in vSphere environments are exhibited in full-on application disruption; disruptions caused predominantly by changes – changes introduced by human errors. These errors take many forms – straightforward misconfigurations, insidious “configuration drift”, best practices not followed…these “events” evolve unnoticed, until your phone rings. Even then, you only see the symptoms of the problem – not the root cause – symptoms in the form of unpredictable application latencies – or worse, an unexplained outage.

When all you have to go on is weak correlation, your ability to see through these symptoms to identify the underlying problem sources – which don’t show up in your monitoring tools or logs – is severely limited. This leads you down the wrong troubleshooting path and lengthens time to resolution. What’s worse is that weak correlation by definition does not contain context – that is, meaningful relationships in which the correlation is set – it does nothing to help you avoid repeating the same mistake, and fails to diminish or eliminate the time to the next app disruption:

time-between-disruptions

Remember — it’s not enough to solve disruption problems faster (“faster TTR!”). You must also decrease the number of disruptions, their duration, and frequency. You can only do this with strong correlations set in context.

How can I generate strong correlations?

CloudPhysics’ Exploration Mode addresses these shortfalls in correlation found in today’s operational management approaches by revolving multiple dimensions around the common axis of time – allowing the user to quickly and visually explore many sources of data in not only how they’ve changed but also in their ordering and sequence of changes.

What’s important to note is how many dimensions we capture in one viewing pane for a user to grok at a glance:

  • Dimension 1: context – we do not arbitrarily stack graphs of different resources; instead, we create specific views of resources and their relationships (cpu, memory, network, storage) based on our domain expertise in the design and operation of vSphere;
  • Dimension 2: dependencies and relationships – every context contains relationships with other objects; we create the ability to navigate among these, over changes and time;
  • Dimension 3: change sequencing – with time as the fulcrum of each of our views, the user can view ordering of changes to build strong correlations for seeking causality;
  • Dimension 4: rates of change and durations – the frequency of changes and their durations can be seen in the time series by manipulating the time scope across any selected context. Frequency and duration become part of your correlation analysis.

These unique capabilities get you to the source of problems faster – but they also enable you to identify contributing factors and patterns to remediate them and avoid repetitions. As a result, you diminish the number of app disruptions while lengthening the time between interruptions. Discovering that latency is being caused intermittently by an unsupported PCI-e driver or a known bug you didn’t know about but is described in an obscure KBA prevents you from a maddening chase around a rabbit hole multiple times.

Let’s look at an example in Exploration Mode

 

exploration-mode

 

In this example, exploration mode gives us context to determine whether a VM is CPU-bound: CPU Ready Time vs. CPU Demand. Ready time is the view of the VM indicating it has work ready to go; CPU demand is the view of the physical host on which multiple VMs are running and demanding CPU resources. With CPU ready time peaking with CPU demand you’ll want to explore this context over time, to understand duration and severity and isolate the disruption to a CPU resource shortage. CloudPhysics exploration mode provides you the ability to expand and narrow the focus of time for this context.
Exploration Mode: The Fastest Path to Cause

But wait there’s more. Just because you’ve isolated the context and exhibition of the symptom – a resource shortage – you are not headed towards cause yet. For that you need more data to correlate other changes that may have preceded the resource hit. Let’s look again at the exploration mode above – notice the other information available to you in the changes and structural issues elements located above the time series chart. CloudPhysics takes these and aligns them along the same time-series axis you are exploring – producing a fully correlated view across multiple sets of data. In this screen, we see a detailed change log containing events associated with configuration and performance details for this particular VM we’re viewing in context. At the same time (literally) you can see issues to the right populated with known problems and events that have occurred on this VM or its dependencies. In our example above, your attention is drawn to a red event indicating a network connection problem.

In this one view, across three correlated dataset panes, we can initiate an immediate action to get to cause: a network connection probably is producing new traffic load or a failover process causing increased CPU load, and backing up access to CPU resources for the VMs on this host. One view in the CloudPhysics exploration mode, and you are set in the right direction and engaging in the best possible approach to using data out of vSphere to your advantage.

Strong Correlation is the White Rabbit

As you can see, correlation can be a powerful tactic for wrestling problems to ground. Our Exploration Mode combines multiple dimensions, associated contexts, and makes them navigable across time. This is your White Rabbit. It will get you through the Looking Glass that is vSphere. You can start your trip by going here. Follow the White Rabbit.