Disclaimer: Two days ago I had no idea what NUMA was. I had to learn all of this yesterday after acquiring a NUMA platform.
The nodes does not have the same access speed
I think all the nodes (which are stayed inside the same CPU) will have the same access speed to the main memory. So why does Linux divide the main memory for each node?
The quick answer to your question is that Linux divides the main memory for each node because each node has its own dedicated memory controller, even to the external (main) memory. Your premise about all nodes having the same access speed is false.
Example
As an example I will use my single AMD Opteron 6386 SE system with 128 GiB of RAM. This 16-core1 processor really consists of two separate dies with a high-speed interconnect, and can be treated as two separate processors in the same physical package.
AMD Documentation
From AMD's description of my architecture in the catchy BIOS and Kernel Developer’s Guide (BKDG) for AMD Family 15h Models 00h-0Fh Processors we have on page 35, figure 2:

- One package (physical, this is what you mount in the socket)
- …containing two nodes (one node equals a separate die)
- A node contains four compute units
- A compute unit has two integer cores and shares an FPU and L2 cache.
- Each node has its own northbridge
- Each northbridge has two DDR memory channels
This illustrates that, for a core in node 0 to use the RAM connected to DDR channel C or D, it must go through the northbridge in node 1.
My system has eight 16 GiB physical sticks of RAM so each controller (A-D) has access to 32 GiB.
How it looks in Linux
Since I have enabled support for transparent NUMA in my BIOS, linux thinks my computer looks like this:

I generated the image using the lstopo2 command from the hwloc package.
Here the hierarchy is clear: Linux knows that I have two nodes, each having local access to a total of 64 GiB of RAM (give or take a GiB or two). We can see how it's further divided into each node sharing one L3 cache, each compute unit having its own L2 and L1 instruction cache, and each core having a dedicated L1 data cache.
It is important that the kernel knows about this layout, because for a process running in "the left half" of this CPU to use memory allocated in "the right half" it has to jump through some hoops. This speed can range from fast enough not to bother or bog down everything, all depending on how busy the rest of the system is.
How it could look in Linux - Node Interleaving
There is an option on my server to "disable" NUMA and treat the whole package as one CPU with 16 cores and 128 GiB of RAM. It is my understanding that it does this by interleaving memory addresses between the nodes, so that a process that runs in one node will see roughly half of its memory from node 0 and half from node 1.
First, this may be necessary if your operating system or workload is not NUMA-aware3. Second, it can actually be difficult to allocate memory correctly. If your workload has a lot of inter-process communication it is not always possible to find an optimal layout. You could also end up having a process on one node doing a lot of I/O to hardware connected to the other node. Spreading the RAM evenly across the nodes ensures that at least half the access will be local, so that you will not hit a worst-case scenario.
Footnotes:
1. Intel would call it 8/16-core, see this lawsuit
2. To make it fit I used lstopo --no-io --no-legend --no-index=PU,core --gridsize 5 --no-attrs=cache --horiz
3. The architecture is from 2011, possibly used to replace legacy hardware while keeping the same OS.