4
\$\begingroup\$

We are in the process of figuring out which memory option would be the best for our needs. What are our requirements? Very simple

  1. Bandwidth: BW TB/s, say greater than 20TB/s
  2. Capacity: C TB, say 20TB

Aim: Which memory technology can meet our demands at the lowest power? And it is alright if the power numbers are not extremely accurate

Memory Options:

  1. DDR5
  2. HBM3 or 3e
  3. LPDDR5

There are two ways to go about calculating the power:

  1. Using pJ/bit: How much energy it takes to fetch 1 bit from the memory and then multiply it by the total amount of data you want to fetch
  2. Avg Power: Find average power numbers, say for a single DIMM of DDR5 and multiply it by the total number of DIMMs required to meet the capacity and BW requirement i.e max(BW/(Single DIMM BW), C/(Single DIMM Capacity)) The main challenge: finding reliable power numbers either through the internet or even simulators.

For example: As per the paper here, following are the energy per bit numbers

enter image description here

whereas, if we check the youtube video here, they provide avg power for different operating clock frequency, using which I get around 19pJ/bit number.

I understand that pJ/bit numbers vary highly with the operating frequency and even the capacity of the memory macro.

How do I go about getting a good enough estimate for the problem above and what numbers to rely upon?

\$\endgroup\$
3
  • 2
    \$\begingroup\$ It might not be only about memory. The MPU (i.e. the memory controller ) makes a big share of energy wasting. My tip: Find an evaluation board that matches your specs and go from there. \$\endgroup\$ Commented Jan 2, 2024 at 11:08
  • 1
    \$\begingroup\$ Could you elaborate on why you need TB/s levels of bandwidth? That's the sort of bandwidth I wouldn't expect to see outside of facilities like CERN, if at all. \$\endgroup\$ Commented Apr 11 at 23:03
  • \$\begingroup\$ Can you elaborate on what devices you want the memory connected to? E.g. FPGAs, GPUs or Intel X86_64 processors? What is "best" for the overall system will probably depend on what devices the memory needs to be connected to, \$\endgroup\$ Commented Apr 12 at 8:30

2 Answers 2

2
+75
\$\begingroup\$

How do I go about getting a good enough estimate for the problem above and what numbers to rely upon?

This is a system problem and it's more than just performance per bit. How often a refresh occurs and data transfer energy costs should also be considered. In addition to this energy use is highly application dependent as memory that accesses random vs sequential can have different energy use on different modules. Also the processor needs to be considered and what the application is doing on that because of things like cache misses and how often it needs a new page of ram.

One thing to consider is the packaging. HBM is for on-die solutions. Wheras DDR5/DDR4 is available in DIMM and on-die packages. So if you require DIMM like you mention, HBM isn't an option. Another thing to consider is it's hard to compare the two because DDR4/DDR5 also have a serial interface that also uses some energy.

That being said and you are doing on-die I would probably select HBM3 based off of this product documentation (and it helps to fill in the table above with the numbers you don't have)

Micron HBM3E has an industry best data rate of >9.2 Gb/s and 24GB capacity in an 8-high cube, resulting in >1.2 TB/s bandwidth with 2.5X performance/watt compared to the previous generation HBM2E. Source: https://wccftech.com/micron-introduces-12-high-hbm3e-and-lpddr5x-based-socamm/

Benchmarks are also helpful, but only if the system you want to build is simmilar.

For Xeon processors HBM came first in energy benchmarks:

enter image description here enter image description here
Source: https://dl.acm.org/doi/pdf/10.1109/SCW63240.2024.00182

\$\endgroup\$
1
\$\begingroup\$

This is not feasible.

Looking at the bandwidth of 20TB/s requested at the capacity of 20TB, nothing on the market today would be able to achieve this.

To achieve the speed you would need either 24 HBM3 devices however this would only get you 576GB of memory so you would actually need over 830 devices for the bandwidth!
If you look at modern graphics cards, they are in the range of 2TB/s so you would need 10 of them. However in this case you would still be nowhere near to the 20TB requirement.

If we expand our thought to a larger system, we can examine the interconnects and see if a networked set of devices could provide the bandwidth necessary with the memory on a different device and accessed via the bus. A PCIe Gen 7 x16 Link (which does not yet exist) has ~256GB/s of bandwidth. A 800GbE Ethernet link has 100GB/s of bandwidth. It would take 80 PCIe Gen 7 x16 links to meet the bandwidth.

Given that modern memory devices nor interconnects can support this bandwidth/capacity, this is not feasible.

\$\endgroup\$
2
  • \$\begingroup\$ You could achieve this with multiple processors in a cluster \$\endgroup\$ Commented Apr 13 at 3:23
  • \$\begingroup\$ For capacity, yes. But if you needed 20TB/s of bandwidth to a random location the interconnect would not be capable enough. \$\endgroup\$ Commented Apr 13 at 10:19

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.