I'm currently analyzing a process that is considered too slow.
In summary, it's a task that loads a lot of data from Microsoft SQL Server, performs some basic stuff on it, and creates a report. It currently takes from five to six hours, whereas the stakeholders expect it to take three to four hours.
Basic analysis showed that:
- The application server that runs the actual task is underused. The CPU usage stays very low (less than 5%), there is plenty of memory, and the SSD is not used much either.
- The database server seem to get most of the job. Its CPU usage occasionally averages 85% (although most of the time, it stays around 20%). The memory is used at 99%, but this is expected—the default behavior of SQL Server is to fill all the memory. The SSD is used roughly at 80%.
- The network usage, measured at the database server, varies a lot—sometimes it remains at nearly zero, and sometimes it peaks at 2.5 Gbps.
Now, what do I do next to understand where the bottleneck is, given that none of the resources seem to be used at 100%?
I guess there are three possible scenarios:
The network is the issue, and the machines are simply waiting for the data to be sent/received. As both the database server and the application server are on premises, and there are lots of other things going on between all other machines in the data center, if the router is busy with other traffic, it may only allow, say, 500 Mbps between the database server and the application server, whereas a few minutes later, it would make it possible to reach 2.5 Gbps.
There is not one, but multiple factors that cause slowness. For instance, at a given moment, the data may be ready to be sent by the database, but the network is slow; a few minutes later, the network is fast, but the new data is not ready and the CPU or the SSD become the bottleneck.
The issue is somewhere else—maybe something is just idling, while waiting for a lock.
How do I figure out which one of those scenarios is correct—and eventually find how to optimize the task by improving either the hardware or the actual task?