Partly Cloudy with A Chance to Fill Up Disk Space: Storage Analytics for Datastore and Guest Disk Capacity Management

If you follow tech news, you’ve probably heard the buzz about predictive analytics. Facebook can predict when couples breakup, Amazon can predict and ship items you’re likely to purchase before you click the buy button. Target can predict pregnancy based on shopping patterns. These organizations now use big data not only to understand but also to predict behavior.

Given these advances in the consumer world, you’d expect to see the same predictive power at work in the world of datacenter operations. The ability to accurately forecast behavior, especially for expensive resources like storage, could translate to tremendous gains in efficiency and cost savings. Surprisingly – and unfortunately – even simple storage-related operations such as “disk space used by the virtual machine” are difficult to predict and forecast.

In this blog I’ll highlight today’s reactive approach to storage capacity management, explain why it’s so difficult to predict storage capacity behavior and needs in the virtualized datacenter, and share what CloudPhysics is doing to make it much, much easier.

Today’s storage capacity management: reactive, not predictive

In virtualized infrastructure there are two levels of storage space needs: the space required by the guest operating systems and the space required by the hypervisor to store the virtual machines. Running out of disk space in either of these could lead to costly outages, but most administrators today don’t know when they will reach their capacity limits.

As a band-aid, admins set up monitoring solutions at the guest and hypervisor layers to track disk space usage, using static (and often arbitrary) thresholds such as “% free space” to trigger alerts. For example, a 1TB disk drive that’s 90% full still has 100G free space and is unlikely to cause an immediate outage, so an alert triggered by this threshold is somewhat meaningless. Also in many cases, disk space usage remains fairly constant, making static threshold-based alerting is useless. A case in point: thick provisioned virtual disks don’t grow and don’t require a lot of disk space headroom so even when “90% full,” there is no disk space capacity risk.

As a result, most alerts are considered noise or false alarms, and routinely ignored or disabled. In the rare case that an alert points to an actual capacity issue, notification is sent after the problem occurs. So instead of predicting behavior, the admin is put in the position of reacting to behaviors and fighting fires.

Overprovisioning: preventive (and expensive), not predictive

In response to the ineffectiveness of capacity monitoring and alerting, overprovisioning storage has become a standard acceptable practice among administrators – and is largely a preventive measure. According to CloudPhysics’ global data set, most organizations maintain an ongoing buffer of 35% more storage capacity than they need. But the cost of prevention is very high, not just in real dollars but opportunity cost. Take a look at this chart:

Cost-perGB-Backblaze

The longer you wait, the lower your storage cost.

Storage cost (price per GB) is on the steady decline for the both spinning and NAND drives and at the same time performance, reliability and feature sets of enterprise storage has been steadily increasing. For instance several years ago storage vendors were not providing dedupe functionality or it was very expensive to get that functionality. Today most storage vendors provide dedupe functionality and use SSD in some form or another. Also the number of storage choices also has tremendously increased over time. So delaying storage purchases not only defers capex expenditure but also gives you the opportunity to buy the latest and coolest technology.

Clearly, the ability to accurately predict storage capacity needs would enable you to delay storage purchase, avoid purchasing more than what’s truly necessary to support your virtual datacenter, and at the same time preempt capacity-induced downtime.

Why predictive is hard – and how CloudPhysics makes it easy

We all know virtualized datacenters are very dynamic. VMs are provisioned and deleted, snapshots are created and deleted, VMs could be storage vmotioned from one datastore to another by storage DRS for balancing performance or storage usage or when you put a datastore into maintenance mode. In addition, virtual machines could be using thick provisioned virtual disks or thin provisioned virtual disks or linked clones. Each of these virtual disk types has different storage usage patterns. In such a dynamic environment, predicting future storage usage is complex, since changes in disk usage tend to be very bursty. As a result, simple linear interpolation techniques based on models of disk usage growth rates don’t work very well.

With CloudPhysics, all of that dynamism and complexity is captured over time and run through simulation techniques to predict future behavior, enabling admins to anticipate and plan proactively, not reactively – eliminating the need for overprovisioning in advance.

Forecasting space usage

For example, to forecast storage space usage, we take frequent snapshots of space usage and use it to determine space usage patterns and future behavior. Our predictive analytics then can give you a forecast with a probability very similar to weather forecast (see chart), and one that’s always updated based on the latest inputs.

Guest disk capacity

Capacity management: VMs at High Risk

At a glance, admins can see how many VMs are at risk of running out of space in the guest.

Likewise, our predictive analytics algorithm monitors guest space usage information (as reported by VMware tools) and uses it to predict VMs at risk of running out space in the guest. At one quick glance you get the count of VMs that are under risk at different time intervals, without the need to monitor and manage each guest individually.

 

Forecasting space usageAnd you can drill down to an individual VM to see which partitions are likely to run out disk space and even identify unpartitioned space that you can leverage for expansion.

Admins can examine individual VMs and their risk to fill prediction for individual disk partitions.

 

Capacity Management - Guest Disk Capacity

Admins can examine individual VMs and their risk to fill prediction for individual disk partitions.

Datastore capacity management

Today, if you have hundreds of datastores you have to click each of them to find out their space usage. Wouldn’t it be convenient to get the total space usage across your entire datacenter? The datastore space card provides you this view. With one click you can get the total space usage, broken down by the space used by VMs and non-VM files. You can also get an overview of total reclaimable space broken down by its components.

 

Capacity Management: Datastore Space Usage

See datastore space usage datacenter wide, and how much disk space can be reclaimed.

And for each datastore you can drill down to its individual risk profiles for different fill levels. You can also can get a history of past space usage and the top 10 files currently hogging disk space in that datastore.

 

Capacity Management: Datastore Space Usage

Drill down into each datastore for more specific storage analytics.

Predictive analytics is powerful and has the potential to radically change datacenter operations management. It is our mission at CloudPhysics to eliminate the tedium and make analytics that are simple and easy for virtual administrators to use. Check out our CloudPhysics Storage Analytics (request demo) to see how they can help you be more efficient, effective and proactive in storage capacity management.

Click here to sign up for Free Edition.

Additional information:

  • Our Storage Analytics run on VMware 4.1 and higher.
  • Information about Premium and Free Editions is here

storage capacity management