Rightsizing workloads correctly in the data center can be a dark art. Most tools available today look at vCPU utilization and vRAM usage over the course of a day, week, month, or longer, as well as measure the peak usage of the workload. However, peak workload usage can be misleading.
Looking at a history of vCPU utilization, we often start to see patterns across the course of the day, and across the duration of a week.
Let’s look at an instance where a VM with 2 vCPUs reports a 100% peak utilization.
Note the utilizations on this VM: We have daily peaks on the sparkline from 27% at month start and 13.86% at month end, with a reported peak of 100% somewhere in between. While it may have peaked at 100% sometime over the past 30 days, the reality is that its 99th percentile is a low 4.22%, and its mean utilization is 2.89%. At first glance, something does not add up.
Digging deeper starts to reveal some more interesting patterns.
The visual pattern tells us we have a 120% usage every hour. Note: vSphere reports 100% potential for each core, so a 32 vCPU VM has the potential for 3200%. At 120%, we are using 1.2 vCPU cores.
Here we find the problem with rightsizing solely to peak utilization without understanding the usage trends. In this case, an array snapshot generated by a backup forces the VM to quiesce its memory cache to disk once per hour; this action generated a spike of 120% CPU utilization for 20 seconds once per hour as well. In our case, this VM, even under typical load, tends to only require between 5% and 40% of one core CPU throughout the day.
Looking at a data center where the primary backup takes place once an hour all day for all VMs leaves us with a false peak utilization. This is where most rightsizing analytics fail the user. Rightsizing the workload to the peak leaves most workloads in the data center oversized.
CloudPhysics collects 2-second granularity in all its performance metrics. These metrics ensure that we don’t count a 10-second snapshot as 5 minutes of 100% CPU utilization. Roll-ups and peaks provide a false sense of utilization; only in the granularity and analytics of 99th percentile and 95th percentile do we start to understand the actual usage characteristics of a workload. Those 10-second and 20-second spikes can drive a rightsizing calculation off course.
Jump up to larger VMs of 20 to 32 vCPUs, and we see the same logic of rightsizing to peak resulting in VMs that are still over-provisioned.
The following workload has 32 vCPUs in its configuration and 16GB of vRAM. Note that we have an extended anomaly in CPU usage on Jan 24th from 12:00 p.m. to 2:00 p.m.
At its peak usage, we see it only reaches 800%. Now, most rightsizing tools would tell us that 8 to 10 cores might be ideal had this been the daily trend. If a user had selected right-size to peak plus 25% overhead for growth, we would see 10 vCPUs in the rightsizing report from other tools. Again, context is everything. With a mean CPU usage of between 1.5 and 2 vCPU cores, 150%-200%, the act of rightsizing to 10 vCPUs, or 1000%, would be a considerably oversized.
In this case, investigation of the event tells us this was a one-time event from an operating system software update. A quick disruption with a reboot and a short spike in CPU shows a heavy workload, but in this case, the workload is a unique event as a result of the OS finishing its software update. Looking farther back and forward in time, we see the system trends towards only one to two vCPUs worth of utilization. Surely this is not the best use of 32 cores!
Some may say, “So what? The hypervisor can share workloads, and an idle workload will only use what it needs.”
But there is a danger in this thinking. CPU Ready and CPU Co-Stop start to come into play with larger than necessary workloads. The time a VM requires to swap all its cores into the scheduler for its execution time can be greatly hindered by the number of vCPUs in a VM and the number of vCPUs allocated on the host. If you have too many large VMs on a host. and you start to run into CPU Ready and CPU Co-Stop spikes, they add massive amounts of performance degradation on a workload. This is where uninformed admins start adding more cores to a workload in the hopes of increasing the performance. But this action will only compound the problem. In the end, the users typically ask to be moved off the virtual host and onto a dedicated physical server. Now you have a 32 core OS on an idle dedicated server—wasting resources.
Rightsizing is indeed an art that needs greater knowledge than just, “what is my peak CPU usage?“ Look at the trends, know the context, and observe the 99th and 95th percentiles to ensure you are making the right choice.
Look at the CloudPhysics card, Cost Calculator for On-Prem IT, to see a great example of what resources are actually required for each workload at peak, 99th, and 95th percentiles. These same rightsizing practices could save you hardware upgrades on premises, or significant savings before a cloud migration.