Noisy Neighbor Where Art Thou? VM Performance Culprit and Victim Analysis Using CloudPhysics Storage Analytics

Photo by Tim Easley on Unsplash

In my last blog post I talked about the our new Storage Analytics offering. Here, I want to focus on one of the core features of this product offering — Datastore Contention card.

We all know that virtualization helps to drive up utilization of physical resources, but it also makes resource contention almost inevitable. VMware used the term “noisy neighbor” to denote these resource contention issues. One of the commonly cited barriers for virtualization is the lack of visibility into noisy neighbor issues. Over the years, VMware administrators have trained themselves to spot CPU and memory contention issues. However, detecting datastore contention has always been really hard.

Why is datastore contention difficult to detect?

You can readily detect CPU contention by monitoring CPU ready time and memory contention by monitoring memory ballooning or swapping metrics on a single host. But what would you monitor to track datastore contention? Unlike CPU or memory resources, datastore contention does not happen within a host. A datastore is a shared resource accessed by multiple physical hosts at the same time. To determine contention hotspots one needs first to look at the I/O metrics from every host and the VMs that are connected to the datastore. Then one must aggregate and correlate them into a unified view. To achieve this today, one has to be both a PowerCLI guru and an Excel savant. All the while you are likely to put a lot stress on your vCenter to pull all that performance data.

Even if one managed to pull all the data, it is still a very tedious job to do visual correlation and analyze all the metrics. The storage performance metrics such as IOPS, Outstanding IOs, Latency and Throughput have very intricate relationships. Only a trained eye can spot issues from these metrics. In the end, VMware administrators are faced with analysis paralysis. When the analysis is overwhelming one might be tempted to simply throw more hardware at the problem in the form of disk spindles, memory or SSD cache. Or one may even simply reduce the consolidation undermining the core value of virtualization.

This is specifically why we built our datastore contention analysis card. With it you can:

  • Quickly find which datastores are experiencing contention
  • Find out the overall health and performance of all the datastores
  • Determine when contention is occurring
  • Identify culprit VM(s) and the victims

Which datastores are experiencing contention?

When you have hundreds of datastores how do you find out which ones are experiencing contention and which ones are not? In our Datastore Contention card we simplified this by automatically classifying the datastores into those that require your attention and those that are noteworthy.


Need Attention, Noteworthy Tabs

If you want ignore some parts of your infrastructure such as test and development environment, we got you covered. Using the filter menu you can select your vCenter, Datacenter, Compute cluster and storage cluster and even search for a single datastore name. If you have the habit of using specific naming conventions for your datastore names you can also leverage our support for regular expressions in the search filter.


Filter Options to Select Inventory


Overall health and performance of the datastores

We provide useful summary aggregated metrics such as overall throughput, average latency and peak latency for all the datastores. These metrics auto update as you change your datastore selection. Often many administrators wonder what is normal and what is abnormal values for some of these metrics. Some operations management tools claim that they can do anomaly detection but these tools are limited by the dataset that they have access to. If you already have a bad baseline then the anomaly detection algorithm is not that useful to spot the existing problems. One of the other powerful aspect of CloudPhysics is the access to performance metrics from a wide variety of infrastructures. Using this global dataset we can spot performance outliers much more easily even if you don’t have a good baseline in your infrastructure. For instance, if the metric values in your infrastructure not only exceed your baseline but also exceed many other similar infrastructures, we can indicate that you have a problem with much greater certainty.


Datastore Performance Chart Highlighting Contention Hotspot



Datastore Performance Chart Highlighting Contention Hotspot

When does the contention occur? Our analytics continuously monitor the storage performance metrics and identify hotspots within the last 24 hours. Our hotspot detection is not based on a simple thresholding approach. Instead, we run complex analytics in our cloud backend to identify issues. Once we identify a one or more performance hotspots we highlight the likely problematic time periods in the performance chart.

Which VMs are the Culprit and which are the Victims?

The most important feature of the datastore contention card is its ability to automatically find culprit and victim VMs. In the past, this sort of analysis would require you to have a storage performance guru spending hours scouring through all the performance data. I myself have done this analysis in the VMware performance group and I know how painful and tedious it is. This is why I’m really proud of the way we have simplified this analysis and automated down to a few simple clicks. Now all you have to do is select the datastore and click on the hotspot to identify culprits and victims.


VM Culprit and Victim List


Two weeks ago we launched our storage analytics product and we have been receiving tremendous feedback so far. I’m really excited about the direction that we are heading. I would love to hear more of your feedback and suggestions — please provide them in the comments section below or send me a tweet.