33

I have three arrays that need to be combined in one three-dimension array. The following code shows slow performance in Performance Explorer. Is there a faster solution?

for (int i = 0; i < sortedIndex.Length; i++) { if (i < num_in_left) { // add instance to the left child leftnode[i, 0] = sortedIndex[i]; leftnode[i, 1] = sortedInstances[i]; leftnode[i, 2] = sortedLabels[i]; } else { // add instance to the right child rightnode[i-num_in_left, 0] = sortedIndex[i]; rightnode[i-num_in_left, 1] = sortedInstances[i]; rightnode[i-num_in_left, 2] = sortedLabels[i]; } } 

Update:

I'm actually trying to do the following:

//given three 1d arrays double[] sortedIndex, sortedInstances, sortedLabels; // copy them over to a 3d array (forget about the rightnode for now) double[] leftnode = new double[sortedIndex.Length, 3]; // some magic happens here so that leftnode = {sortedIndex, sortedInstances, sortedLabels}; 
1

5 Answers 5

77

Use Buffer.BlockCopy. Its entire purpose is to perform fast (see Buffer):

This class provides better performance for manipulating primitive types than similar methods in the System.Array class.

Admittedly, I haven't done any benchmarks, but that's the documentation. It also works on multidimensional arrays; just make sure that you're always specifying how many bytes to copy, not how many elements, and also that you're working on a primitive array.

Also, I have not tested this, but you might be able to squeeze a bit more performance out of the system if you bind a delegate to System.Buffer.memcpyimpl and call that directly. The signature is:

internal static unsafe void memcpyimpl(byte* src, byte* dest, int len) 

It does require pointers, but I believe it's optimized for the highest speed possible, and so I don't think there's any way to get faster than that, even if you had assembly at hand.


Update:

Due to requests (and to satisfy my curiosity), I tested this:

using System; using System.Diagnostics; using System.Reflection; unsafe delegate void MemCpyImpl(byte* src, byte* dest, int len); static class Temp { //There really should be a generic CreateDelegate<T>() method... -___- static MemCpyImpl memcpyimpl = (MemCpyImpl)Delegate.CreateDelegate( typeof(MemCpyImpl), typeof(Buffer).GetMethod("memcpyimpl", BindingFlags.Static | BindingFlags.NonPublic)); const int COUNT = 32, SIZE = 32 << 20; //Use different buffers to help avoid CPU cache effects static byte[] aSource = new byte[SIZE], aTarget = new byte[SIZE], bSource = new byte[SIZE], bTarget = new byte[SIZE], cSource = new byte[SIZE], cTarget = new byte[SIZE]; static unsafe void TestUnsafe() { Stopwatch sw = Stopwatch.StartNew(); fixed (byte* pSrc = aSource) fixed (byte* pDest = aTarget) for (int i = 0; i < COUNT; i++) memcpyimpl(pSrc, pDest, SIZE); sw.Stop(); Console.WriteLine("Buffer.memcpyimpl: {0:N0} ticks", sw.ElapsedTicks); } static void TestBlockCopy() { Stopwatch sw = Stopwatch.StartNew(); sw.Start(); for (int i = 0; i < COUNT; i++) Buffer.BlockCopy(bSource, 0, bTarget, 0, SIZE); sw.Stop(); Console.WriteLine("Buffer.BlockCopy: {0:N0} ticks", sw.ElapsedTicks); } static void TestArrayCopy() { Stopwatch sw = Stopwatch.StartNew(); sw.Start(); for (int i = 0; i < COUNT; i++) Array.Copy(cSource, 0, cTarget, 0, SIZE); sw.Stop(); Console.WriteLine("Array.Copy: {0:N0} ticks", sw.ElapsedTicks); } static void Main(string[] args) { for (int i = 0; i < 10; i++) { TestArrayCopy(); TestBlockCopy(); TestUnsafe(); Console.WriteLine(); } } } 

The results:

Buffer.BlockCopy: 469,151 ticks Array.Copy: 469,972 ticks Buffer.memcpyimpl: 496,541 ticks Buffer.BlockCopy: 421,011 ticks Array.Copy: 430,694 ticks Buffer.memcpyimpl: 410,933 ticks Buffer.BlockCopy: 425,112 ticks Array.Copy: 420,839 ticks Buffer.memcpyimpl: 411,520 ticks Buffer.BlockCopy: 424,329 ticks Array.Copy: 420,288 ticks Buffer.memcpyimpl: 405,598 ticks Buffer.BlockCopy: 422,410 ticks Array.Copy: 427,826 ticks Buffer.memcpyimpl: 414,394 ticks 

Now change the order:

Array.Copy: 419,750 ticks Buffer.memcpyimpl: 408,919 ticks Buffer.BlockCopy: 419,774 ticks Array.Copy: 430,529 ticks Buffer.memcpyimpl: 412,148 ticks Buffer.BlockCopy: 424,900 ticks Array.Copy: 424,706 ticks Buffer.memcpyimpl: 427,861 ticks Buffer.BlockCopy: 421,929 ticks Array.Copy: 420,556 ticks Buffer.memcpyimpl: 421,541 ticks Buffer.BlockCopy: 436,430 ticks Array.Copy: 435,297 ticks Buffer.memcpyimpl: 432,505 ticks Buffer.BlockCopy: 441,493 ticks 

Now change the order again:

Buffer.memcpyimpl: 430,874 ticks Buffer.BlockCopy: 429,730 ticks Array.Copy: 432,746 ticks Buffer.memcpyimpl: 415,943 ticks Buffer.BlockCopy: 423,809 ticks Array.Copy: 428,703 ticks Buffer.memcpyimpl: 421,270 ticks Buffer.BlockCopy: 428,262 ticks Array.Copy: 434,940 ticks Buffer.memcpyimpl: 423,506 ticks Buffer.BlockCopy: 427,220 ticks Array.Copy: 431,606 ticks Buffer.memcpyimpl: 422,900 ticks Buffer.BlockCopy: 439,280 ticks Array.Copy: 432,649 ticks 

or, in other words: they're very competitive; as a general rule, memcpyimpl is fastest, but it's not necessarily worth worrying about.

Sign up to request clarification or add additional context in comments.

13 Comments

C'mon, man, benchmark it! I always assumed Buffer.BlockCopy is faster, but I'm not sure anymore. Hans Passant (way down in the page) claims the exact same CLR code executes for both: social.msdn.microsoft.com/Forums/en-US/netfxbcl/thread/…
I'm curious to know if your last suggestion works, and how well if so.
@MusiGenesis: I guess memcpyimpl is the way to go then? (Not that I've benchmarked that either, though I have used it before. I'll benchmark it right now.)
Hans Passant: answering StackOverflow questions before there even was a StackOverflow! I think that officially puts him in Jon Skeet territory.
It should be noted that the memcpyimpl method no longer exists in at least .NET 4.5.1, it's now called Memcpy and there are various overloads for it so you need to pass in the parameter types to resolve the method you want.
|
11

You can use Array.Copy.

EDIT

Array.Copy does work for multidimensional arrays: see this topic.

1 Comment

I looked at your link, but my situation is different. The source comes from three different 1d arrays. The dest array is an N by 3 array where each dimension contains one of the source array.
6

If running on .NET Core, you may consider using source.AsSpan().CopyTo(destination) (beware on Mono though).

 Method | Job | Runtime | Mean | Error | StdDev | Ratio | RatioSD | ---------------- |----- |-------- |----------:|----------:|----------:|------:|--------:| ArrayCopy | Clr | Clr | 60.08 ns | 0.8231 ns | 0.7699 ns | 1.00 | 0.00 | SpanCopy | Clr | Clr | 99.31 ns | 0.4895 ns | 0.4339 ns | 1.65 | 0.02 | BufferBlockCopy | Clr | Clr | 61.34 ns | 0.5963 ns | 0.5578 ns | 1.02 | 0.01 | | | | | | | | | ArrayCopy | Core | Core | 63.33 ns | 0.6843 ns | 0.6066 ns | 1.00 | 0.00 | SpanCopy | Core | Core | 47.41 ns | 0.5399 ns | 0.5050 ns | 0.75 | 0.01 | BufferBlockCopy | Core | Core | 59.89 ns | 0.4713 ns | 0.3936 ns | 0.94 | 0.01 | | | | | | | | | ArrayCopy | Mono | Mono | 149.82 ns | 1.6466 ns | 1.4596 ns | 1.00 | 0.00 | SpanCopy | Mono | Mono | 347.87 ns | 2.0589 ns | 1.9259 ns | 2.32 | 0.02 | BufferBlockCopy | Mono | Mono | 61.52 ns | 1.1691 ns | 1.0364 ns | 0.41 | 0.01 | 

3 Comments

Where operation per second column ?
In my own testing, BufferBlockCopy consistently wins out over SpanCopy. But this is for copying raw bytes of at most 512. So there's that. It's not a big difference but consistently 20% faster. This was tested on .NET 7. It is possible that all optimizations aren't enabled here. So I tried a .NET 8 release build (was using LINQPad) and that made SpanCopyTo about the same. So, the real story is that BufferBlockCopy is faster unless release builds are used with full optimization in which case they are the same. For this reason, I'd use BufferBlockCopy because it will be fast with less effort...
...and by that I mean that the compiler won't have to optimize your code to achieve that level of performance. It will be fast in both debug and release builds.
5

For primitive type arrays (like double) you can copy fast, even for multidimensional array with pointers.

In the code below I initialize a 2D array A[10,10] with the values 1 through 100. Then I copy these values into a 1D array B[100]

unsafe class Program { static void Main(string[] args) { double[,] A = new double[10, 10]; for(int i = 0; i < 10; i++) { for(int j = 0; j < 10; j++) { A[i, j] = 10 * i + j + 1; } } // A has { { 1 ,2 .. 10}, { 11, 12 .. 20}, .. { .. 99, 100} } double[] B = new double[10 * 10]; if (A.Length == B.Length) { fixed (double* pA = A, pB = B) { for(int i = 0; i < B.Length; i++) { pB[i] = pA[i]; } } // B has {1, 2, 3, 4 .. 100} } } } 

How fast is it. My testing has shown it to be many times faster then native C# copy and Buffer.BlockCopy(). You try it for your case and let us know.

Edit 1 I compared copying with four methods. 1) Two Nested loops, 2) One Serial loop, 3) Pointers, 4) BlockCopy. I measured the # of copies per tick for various size arrays.

N = 10x 10 (cpy/tck) Nested = 50, Serial = 33, Pointer = 100, Buffer = 16 N = 20x 20 (cpy/tck) Nested = 133, Serial = 40, Pointer = 400, Buffer = 400 N = 50x 50 (cpy/tck) Nested = 104, Serial = 40, Pointer = 2500, Buffer = 2500 N = 100x 100 (cpy/tck) Nested = 61, Serial = 41, Pointer = 10000, Buffer = 3333 N = 200x 200 (cpy/tck) Nested = 84, Serial = 41, Pointer = 40000, Buffer = 2666 N = 500x 500 (cpy/tck) Nested = 69, Serial = 41, Pointer = 125000, Buffer = 2840 N = 1000x1000 (cpy/tck) Nested = 33, Serial = 45, Pointer = 142857, Buffer = 1890 N = 2000x2000 (cpy/tck) Nested = 30, Serial = 43, Pointer = 266666, Buffer = 1826 N = 5000x5000 (cpy/tck) Nested = 21, Serial = 42, Pointer = 735294, Buffer = 1712 

It is clear here who is the winner. Pointer copy is orders of magnitudes better than any other method.

Edit 2 Apparently I was unfairly taking advantage of a compiler/JIT optimization because when I moved the loops behind delegates to equalize the playing field the numbers changed dramatically.

N = 10x 10 (cpy/tck) Nested = 0, Serial = 0, Pointer = 0, Buffer = 0 N = 20x 20 (cpy/tck) Nested = 80, Serial = 14, Pointer = 100, Buffer = 133 N = 50x 50 (cpy/tck) Nested =147, Serial = 15, Pointer = 277, Buffer = 2500 N = 100x 100 (cpy/tck) Nested = 98, Serial = 15, Pointer = 285, Buffer = 3333 N = 200x 200 (cpy/tck) Nested =106, Serial = 15, Pointer = 272, Buffer = 3076 N = 500x 500 (cpy/tck) Nested =106, Serial = 15, Pointer = 276, Buffer = 3125 N = 1000x1000 (cpy/tck) Nested =101, Serial = 11, Pointer = 199, Buffer = 1396 N = 2000x2000 (cpy/tck) Nested =105, Serial = 9, Pointer = 186, Buffer = 1804 N = 5000x5000 (cpy/tck) Nested =102, Serial = 8, Pointer = 170, Buffer = 1673 

The buffered copy is top here (thanks to @Mehrdad) with pointer copy second. The question now is why isn't pointer copy as fast as buffer methods?

3 Comments

Buffer.BlockCopy uses internally memmove, which, I guess, copies any number of bytes in one assembler command. This should be much faster than using pointers, which copy in one instruction only 1 double and need to loop many times.
Yes, it makes sense. It would make sense for the buffer to be smaller than CPU cache to keep things fast. I wonder if it adjusts according to the CPU architecture.
Most likely the Buffer Copy uses data parallel SIMD/SSE CPU instructions to move up to 512-bits at a time, which moves eight doubles per instructions
0

If a jagged array of the following form would work, then a copy could be avoided:

double[][] leftNode = new double[3][]; leftNode[0] = sortedIndex; leftNode[1] = sortedInstances; leftNode[2] = sortedLabels; 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.