VM Bullies and Other Performance Surprises

I spent a decade at VMware chasing after customers’ virtualization performance problems, and took away two key learnings:

  1. The vast majority of performance problems are related to storage infrastructure in some fashion.
  2. Problems that actually get reported are just half the story. In reality, many issues go undetected until a serious problem surfaces. Even then, problems often go unreported because, to avoid the time and hassle of root cause analysis and resolution, many users opt for quick-and-dirty workarounds or just throw more hardware at the problem.

One of the unique things about CloudPhysics is we receive a continuous stream of metadata from our users, day in and day out, problem or not. Our analytics engine processes this incoming data stream, continuously yielding incredible insights. (We call this Collective Intelligence.)

In anticipation of today’s launch of our new Storage Analytics, we ran one of our analytics − Datastore Contention − across our global customer dataset for a 24-hour period. We’ve summarized the results in the infographic below − followed by some additional details behind these somewhat surprising findings.

Datastore contention is more common than you think

Datastores are shared not only across VMs on the same host but also across hosts. So it’s not surprising to discover performance problems due to storage resource contention. What is surprising is the extent of the problem. Within a 24-hour window, over 95% of organizations experience some sort of I/O resource contention. Then, to highlight the most severe performance impact, we filtered out contention periods where the sustained average latency is less than 37 ms, a high latency threshold. We observed that 55% of the organizations are severely impacted within the 24-hour window. That’s lot of performance impact due to I/O contention.

The data also tells us that roughly 1 in 5 datastores experience contention daily – and if you experience contention once, you’re likely to experience it again:

  • 50% experience contention again within the same day
  • 27% experience it twice more within the same day
  • 5% experience it 10 more times within the day

Datastores are severely underutilized

It’s a common perception among administrators that resource contention happens only on heavily utilized datastores. But our numbers tell a different story. Across our 24-hour window, the average throughput across all datastores in our global set is only 4.7 MB/second and the average latency per I/O is only 5.9 ms. While 4.7 MB/s average throughput may seem like a big number for a large window like 24 hours, the median value is only 0.85 MB/s. This suggests there are periods where large bursts in throughput drive up the average.

The average and median performance numbers we see can be effectively handled by a single IDE drive. Why, then, are datastores typically created on enterprise-grade high-performance disk drives with multiple spindles in RAID configuration and plenty of cache. Admins provision storage resources to cater to peak demand, and the irony is that despite this overprovisioning they’re unable to mitigate contention or performance degradation. Our data indicates throwing more hardware at the problem is not the right solution.

All it takes is few VM bullies to create havoc

Another common perception among administrators is that resource contention is triggered when multiple VMs do heavy I/O simultaneously (we call these bully VMs). Our analysis reveals there are only 1.2 bully per datastore on average. However, each VM bully impacts 5 victim VMs on average. Further, VM bullies tend to be repeat offenders: each of the VM bullies causes 2.2 contentions per day on average. And each contention “event” affects a random set of victim VMs, so there’s no way to predict where trouble will arise. The silver lining here is that relatively few VM bullies are causing havoc per datastore, so if you can locate and isolate them, it’s easy to mitigate contention.

One of the cool things about CloudPhysics is that we automatically identify VM bullies (and their victims), helping admins resolve contention issues very quickly. You can try it out for yourself – it takes about 5 minutes to get started.

Good luck neutralizing your VM bullies!