Now, what do I do next to understand where the bottleneck is, given that none of the resources seem to be used at 100%?
A fundamental problem is latency. If the CPU makes many small requests to the disk or to a database the time may be dominated by various types of fixed costs, like the time to actually transmit the information. This time may not show up in performance metrics since both are mostly waiting for each other and has the capacity to other work if such work was available.
The solution is usually to make fewer but larger requests or queries to reduce the effect of latency. Simply starting processing of database results as they are received can also help a fair bit. But concurrency and granularity can be complicated. While it can greatly help with performance it can also be difficult to get right.
How do I figure out which one of those scenarios is correct—and eventually find how to optimize the task by improving either the hardware or the actual task?
Use the tools available to see if you can confirm or disprove any hypothesis
- There are tools to simulate bandwidth restrictions and latency. If this has little effect it is probably not the network.
- 99% memory and 80% disk load is high enough that I would be concerned. Upgrading hardware can be a relatively cheap, and may be enough. Simulating added memory/disk/CPU load and checking if that hurts performance could also be one wayreveal potential bottlenecks.
- If a database is involved it should be one of the primary suspects. There are specialized tools available to check for common problems, like swapping, lack of indexes or, lots of small queries, lock contention and so on. Check the documentation for your database for the specifics.
- There are CPU profilers to check what an application is doing and what takes most time, even if CPU time is less likely to be a problem in your case.
- Just reading the source code can be illuminating, even if just to get a rough understanding of the overall code quality and what kind of problems you can expect.
Chances are that the main limitation is the software design and architecture. Computers are ridiculously fast when used well. But most software development stops when they get something that works well enough, and poor scaling may start to hurt as more data. Ensuring the solution scales well is accumulatedoften not considered or tested enough.
It is not that uncommon to improve performance by orders of magnitude with some fairly simple fixes. But that might require a good understanding of the application to make sure you understand the problems and ensure it still works correctly after any fixes. Gaining that understanding can be expensive if the project lacks documentation, automated tests, and the original developers are gone.