digital-marketing
Blog Development management

Managing IT Capacity Management

Talking to all kinds of companies, both customers and IT equipment integrators and distributors, in the last year there has been an increasing talk that new equipment has become very expensive, its availability is limited, and no supplier is willing to guarantee the delivery time and completeness of orders. This makes it more important than ever for large companies to maximize the use of existing infrastructure, optimize the purchase of new equipment, and accurately plan future purchases.
In addition, for many companies, the paradigm of customer consumption of their goods and services has changed. For example, retail has had to shift its focus from the offline in-store sales model to the online online online sales model, which means completely different order management, payment processing, logistics, warehouse operations, and so on and so forth. What have companies encountered? With the fact that the systems designed for one load suddenly got another load, and the lack of resources, as you can easily guess, adversely affects the performance of IT systems, well, and poor performance of IT systems in turn has a bad impact on business. As a result we have dissatisfied customers.

At the same time, systems that used to be loaded as they should, in many ways work in a more relaxed mode. But you can not just use them, firstly, because no one knows that they are free, and secondly, because, for example, they have been purchased for the needs of another department.

Capacity management practices

So how can a business solve such a complex problem – improve the efficiency of existing resources, optimize future purchases and, most importantly, do it all with reference to actual current and future business needs, rather than just pointing fingers in the sky?

One of the answers is a well-known ITSM practice – capacity management and demand management processes. What’s it about, in a nutshell? It’s about the need for a complete picture of the resources involved, those responsible for each sector, and most importantly, predictability of utilization for the future.
Under normal circumstances, predictability of load is realized through two mechanisms: the tracking of historical demand (if at the end of the quarter the accounting department frantically summarizes reports and balances, the load on the reporting systems at the end of each quarter will obviously increase) and the process of requesting and negotiating new capacity (if two groups of testers need the same environment to test new releases, there must be a process of requesting and negotiating capacity, which allows everyone to do what they need, while not “drop” the system).

How are things now and what are the problems?

Now, of course, the conditions are unusual, so it may be useless to rely on historical demand, but a general picture of resource utilization and an established process of requesting and providing capacity, as well as releasing and reutilizing capacity that is no longer used, would be extremely useful.

Many companies have this process in one form or another, in the manual mode the request / allocation of capacities is carried out, and certainly there are monitoring systems in almost all organizations, large and small.
However, a classic problem can look like this: IT gets a request for a new virtual machine. The virtual infrastructure administrator requests resources from the storage administrator, according to the request – for example, 2TB. The storage administrator gives 2TB, the virtual machine is created, and uses… 100GB. Or 200GB. At the same time, the storage administrator does not see real usage, he sees 2TB of allocated disk space. As a result, almost all of it (and IDC statistics say that up to 40% on average in storage infrastructure) is not used. So what is missing and how can the process be improved?

Why don’t classic monitoring systems work?

If you have 10 servers or 1 storage system, it is not a problem to manage their load forecasting, keep a spreadsheet in Excel and coordinate new loads with your colleague at the kitchen table over tea, but if you have 10 000 servers and dozens or hundreds of storage systems, and the colleagues responsible for the infrastructure areas you need are located in different cities, the picture is much more complicated. Desk systems service and work schedules, of course, help to comb out the process, but again, everything is manual. Monitoring systems, of course, solve the problem of load control, but when the infrastructure is large, there are often several systems, not all have a single console, and most importantly – they often only provide data about the load, but the questions that really concern the business and operation, do not help.

What monitoring systems give us:

  • Percentage of CPU, memory, disk space usage.
  • Network bandwidth and network load.
  • The utilization and state of virtual and container infrastructure (usually in individual systems, by the way).
  • The presence of hardware and software bugs.
  • In the case of advanced systems, they will also provide analytics on the causes of failures.
  • In the case of APM systems, they will give a picture of the application and user experience.

It seems great, so much stuff… But what do we really want to know? Well, I mean, of course we want to know that the servers work, the memory is there, the virtual machines work, the containers are created and migrated normally, but why do we need all this?

  • First and foremost, we care about whether the end service works, which means, does the user get it?
  • If the service doesn’t work, why not?
  • Can we deploy the new service on the current infrastructure?
  • How will changing business requirements affect the infrastructure? What will need to be changed?
  • How much does it cost us to run the IT infrastructure and how much should we budget for next year, given the business plans?

The answers to these questions, unfortunately, are often either left unanswered, or the answers are simply based on an expert opinion of the responsible employees of operating departments.

This approach precisely leads to an uneven allocation of resources, poorly reconciled with the real needs of various services and applications, losses due to downtime, and the simultaneous need to buy extra equipment for those infrastructure segments for which it is not really needed.

What does a large organization really need from a resource management system?

So, it would be nice to have some kind of a single console in which, in addition to the resource utilization data itself (preferably, of course, aggregated from all sources, so as not to look separately in this console, and separately in, say, vCenter), which would also have functions:

  • tying IT system metrics to business metrics (say, number of orders to database hits and memory load, or site visitors to processor and memory load on web and application servers)
  • building forecasts for various scenarios in these business terms;
  • ordering and negotiating capacity provisioning with risk assessment and mutual influence;
  • recommending placement of new workloads or reallocation of resources;
  • reporting on resource utilization in relation to business functions (say, departments, branches, or customers in the case of a service provider).

Such a solution would potentially be of great value to large companies

  • In the current environment, it would be quickly understood which services would be under the most strain, which would not cope, which resources are already busy and where resources can be quickly taken without affecting the performance of other systems.
  • Globally, it’s a serious optimization of IT procurement, moving away from the “let’s buy for keeps, or next year they won’t give us the budget” approach.
  • Transparency of spending on infrastructure, licenses and cloud services for IT management and the company.
  • An organized process of demand and capacity management, which means reducing the risk of business service downtime due to resource shortages.