5

We have softwares that specify minimum hardware requirements. So, how to figure out what specs we need in case we want to scale up our systems.

Like I'm working on server performance monitoring setup where I have 100 servers that need to be monitored for following metrics:

  1. CPU busy
  2. CPU cores
  3. total reads
  4. total writes
  5. RAM usage
  6. Disk I/O
  7. etc and assuming to 20 such metrics.

How can I find minimum hardware requirements for such scenario, like what RAM, disk and CPU specifications I need if,

  1. I use single server having prometheus and Grafana running and taking data from these separate 100 servers?
  2. I use load distribution for 2 or 3 servers having prometheus (which take data from these separate 100 servers) and another server with Grafana taking that data?

I tried considering sample size (~2B given on prometheus docs).

And then calculating total bytes based on total time the systems would be up, that's how I can find storage requirements for prometheus.

Like for my case, (100 servers x 20 samples x 2B/each sample x (16 x 3600) seconds) / (5 seconds refresh interval)

= 46.08MB for 16 hours usage.

I'm not asking for exact answer for this but on learning how to make estimates and assumptions in such scenarios.

Thanks is advance!

2
  • 7
    Note to our community before people vote: this question is asking for an approach to figuring this out, which is perfectly on topic in my opinion. The question is not asking us to figure out the hardware requirements for the OP (which would be off-topic). In other words, this question is asking how to approach the problem versus "please give me the hardware specs." Commented Jun 4 at 13:20
  • 2
    One factor that you should consider first is how this estimate is going to be used. If you are deploying to a flexible (e.g. cloud provider) platform, you will likely be doing this for op-ex budgeting. If this is used to procure hardware, the implications of getting it wrong are different. The precision of your estimates and which side you should err on follow from those considerations. Commented Jun 5 at 18:21

2 Answers 2

4

You have to test.

It's hard to predict the bottlenecks for any program as it scales, so you really need to run the software in question at the given scale and then muck around with the parameters to see the minimum spec you can get away with.

This proves beyond a reasonable doubt that spec X will have performance Y

You can simulate the load via load testing.

Check out this blog post/advert for Netdata where they do a performance comparisom

https://www.netdata.cloud/blog/netdata-vs-prometheus-2025/

I'm not sure the comparison is 100% fair, but you can see they are stressing the setup to the max. With memory consumption being one of the biggest limiters and they stress the complexity of the prometheus setup at this scale. You may be able to get significant performance improvements by tweaking your setup.

6
  • 3
    I would give another "+1" just for missing "it depends". Commented Jun 18 at 16:35
  • Alright, hardware identification is totally based on actually testing the software. There's no mathematics equation than can give exact answer. Thanks! Commented Jun 19 at 7:34
  • 1
    hmm there will be a function, but it's not going to be linear and its going to "depend" on the details of your particular setup. In the link for example you can see prometheus is using OpenMetric protocol and is v2. thats going to make a massive difference to the hardware requirements Commented Jun 19 at 7:46
  • 1
    @TanmaySharma: One reason that testing is necessary is because a lot of software can be configured to require more than your "Minimum required requirements" - a rather infamous one would be Crysis - when it came out, you could run it, just not at the maximum settings. Hence, you effectively treat it like Crysis, get it onto your environment at the scale you want, and test if it works at required performance. Commented Jun 20 at 23:56
  • 1
    The JVM is not a recognised uniform... Commented Jun 21 at 10:13
1

It depends on the type of software you are benchmarking.

Example 1

Let's imagine that the piece of software is a mathematical computation, that is, moreover, pretty straightforward. By straightforward, I mean that:

  • It will take exactly N seconds to compute a result for a given input on a given piece of hardware, using the CPU at 100%, with memory increasing linearly from M₁ to M₂ during the process.

  • It will take 0.52 × N seconds and 0.29 × N seconds respectively when executed in parallel on respectively 2 and 4 cores.

  • It will take N/d + 25 seconds when executed on d machines in parallel, with 2 ≤ d ≤ 200. The 25 seconds here are essentially the time it gets for the map-reduce controller to distribute the load, and collect and combine the results.

It wouldn't be very hard to figure how much machines do you need to provision to do the computation in a given amount of time. The only issue would be that it works only for a given range of d. Put too much servers, and things would get ugly, and, eventually, adding more machines would make the process slower.

Example 2

Now let's imagine an ordinary web application. Say, an e-commerce website. You host it on a farm of ten servers and, for a week, collect all the metrics you consider relevant. It looks pretty much solid, although I notice that it's not very linear: your servers idle most of the time at 4 a.m., but you have quite an activity there at 8 p.m. Also, you got much more activity on Wednesday evening and on Saturday.

You do the math and try to forecast the usage for the next week.

And then, things start to break apart quickly.

  • You got all your servers at 100% at 3 a.m. on Tuesday. That's right. There was a DOS attack. Luckily, it stopped before the next morning.
  • On Thursday, you got a lot of activity. Guys from marketing didn't tell you? They spammed all the potential customers. I meant—sent promotional emails.
  • On Friday, 100% again. Not because of another DOS attack—rather five of the ten servers installed an automatic update and were stuck at an infinite reboot. Absolutely theoretical case, absolutely not mentioning CrowdStrike here.
  • On Saturday, server activity in the range of 0% – 1%. Actually, a new release of the website introduced a regression: every customer sees an HTTP 404 for every requested page (and yes, it doesn't take too much resources to send the HTTP 404 from cache).

If you're in this situation, you can't just have a mathematical model that would predict how the application will behave with a given number of servers. Because there is a huge number of parameters to consider, and some of them are not even known in advance—you usually cannot predict a DDOS attack or a CrowdStrike update bug, or the fact that you'll suddenly hit the front page of a popular social network as the greatest website to buy the stuff you sell.

What you can have is:

  • Active monitoring that tells you: “hey, you'll have a load problem right now, and by right now, I mean... in a few seconds.”
  • The way to react to this monitoring, by scaling up (and then possibly scaling down to save money). You do that either by having enough spare servers sitting around in your data center and waiting to be turned on. Or delegating this task to a cloud provider. Whatever is cheaper.
2
  • This scenario is like one of those edge cases, mainly for encountering faults or overuse. What I keen to ask is if a software or any program is scaled for large use, we would need more storage, more computation. So, how can we make an estimate that I'll need this much hardware so that this program runs smoothly. And thanks for the parallel distribution you explained in example 1, really helpful! Commented Jun 6 at 6:21
  • If you're in the second case, any estimation would be inexact, probably even very inexact—again, there is too much happening to have a model that works well. This is also why the ability to scale up and down very fast is so important. Commented Jun 6 at 9:10

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.