3

I am reading about NUMA (Non-uniform memory access) architecture. It looks like this is the hardware architecture that on the multiprocessor system, each core accesses their internal local memory is faster than the remote memory.

The thing I don't know is: looks like the main memory (RAM) is also divided between nodes. That makes me confused because I think all the nodes (which are stayed inside the same CPU) will have the same access speed to the main memory. So why does Linux divide the main memory for each node?

2
  • Particularly on big systems with multiple cpu chips (not multiple cpus on a single chip) you will find the main memory is divided up and each group of ram chips is connected directly to a single cpu chip. Commented Apr 30, 2020 at 22:22
  • the "main memory" that you mentioned is RAM right? Can you give me an example of multiple cpu chips system? Commented May 3, 2020 at 18:26

1 Answer 1

4

Disclaimer: Two days ago I had no idea what NUMA was. I had to learn all of this yesterday after acquiring a NUMA platform.

The nodes does not have the same access speed

I think all the nodes (which are stayed inside the same CPU) will have the same access speed to the main memory. So why does Linux divide the main memory for each node?

The quick answer to your question is that Linux divides the main memory for each node because each node has its own dedicated memory controller, even to the external (main) memory. Your premise about all nodes having the same access speed is false.

Example

As an example I will use my single AMD Opteron 6386 SE system with 128 GiB of RAM. This 16-core1 processor really consists of two separate dies with a high-speed interconnect, and can be treated as two separate processors in the same physical package.

AMD Documentation

From AMD's description of my architecture in the catchy BIOS and Kernel Developer’s Guide (BKDG) for AMD Family 15h Models 00h-0Fh Processors we have on page 35, figure 2:

A Dual-Node Processor

  • One package (physical, this is what you mount in the socket)
  • …containing two nodes (one node equals a separate die)
  • A node contains four compute units
  • A compute unit has two integer cores and shares an FPU and L2 cache.
  • Each node has its own northbridge
  • Each northbridge has two DDR memory channels

This illustrates that, for a core in node 0 to use the RAM connected to DDR channel C or D, it must go through the northbridge in node 1.

My system has eight 16 GiB physical sticks of RAM so each controller (A-D) has access to 32 GiB.

How it looks in Linux

Since I have enabled support for transparent NUMA in my BIOS, linux thinks my computer looks like this:

Topology from lstopo

I generated the image using the lstopo2 command from the hwloc package.

Here the hierarchy is clear: Linux knows that I have two nodes, each having local access to a total of 64 GiB of RAM (give or take a GiB or two). We can see how it's further divided into each node sharing one L3 cache, each compute unit having its own L2 and L1 instruction cache, and each core having a dedicated L1 data cache.

It is important that the kernel knows about this layout, because for a process running in "the left half" of this CPU to use memory allocated in "the right half" it has to jump through some hoops. This speed can range from fast enough not to bother or bog down everything, all depending on how busy the rest of the system is.

How it could look in Linux - Node Interleaving

There is an option on my server to "disable" NUMA and treat the whole package as one CPU with 16 cores and 128 GiB of RAM. It is my understanding that it does this by interleaving memory addresses between the nodes, so that a process that runs in one node will see roughly half of its memory from node 0 and half from node 1.

First, this may be necessary if your operating system or workload is not NUMA-aware3. Second, it can actually be difficult to allocate memory correctly. If your workload has a lot of inter-process communication it is not always possible to find an optimal layout. You could also end up having a process on one node doing a lot of I/O to hardware connected to the other node. Spreading the RAM evenly across the nodes ensures that at least half the access will be local, so that you will not hit a worst-case scenario.


Footnotes:
1. Intel would call it 8/16-core, see this lawsuit
2. To make it fit I used lstopo --no-io --no-legend --no-index=PU,core --gridsize 5 --no-attrs=cache --horiz
3. The architecture is from 2011, possibly used to replace legacy hardware while keeping the same OS.

6
  • Many thanks. This is a great article. I have understood many things. I just found this blog post on LWN.net magazine explain different architectures: lwn.net/Articles/250967 Commented Jun 9, 2020 at 14:34
  • Just one more question, I run hwloc command on my MacBook pro (2018), it shows only 1 node. I searched and people said that macOS doesn't support NUMA. In my case, that because my CPU doesn't have NUMA or because of the operating system? I asked this because my MacBook is pretty new that I think it might have NUMA architecture behind. Thanks. Commented Jun 9, 2020 at 14:36
  • @hqt Pretty sure your CPU doesn't use NUMA. From what I could guess you have a Coffee Lake-H CPU. It only has two memory channels. Commented Jun 9, 2020 at 17:12
  • Understood. Thanks so much again for your great post. :D Commented Jun 9, 2020 at 18:41
  • Just another dumb question, why a processor needs more than 1 die (mean more than 1 node in NUMA)? I guess because we cannot put many CPUs on the same die? Or if we put many CPUs on the same die, there will be a bottleneck? Commented Jun 9, 2020 at 19:09

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.