Why does allocating a single 2D array take longer than a loop allocating multiple 1D arrays of the same total size and shape?

Question

I thought it would be quicker to create directly, but in fact, adding loops takes only half the time. What happened that slowed down so much?

Here is the test code

@BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.MICROSECONDS) public class Test_newArray { private static int num = 10000; private static int length = 10; @Benchmark public static int[][] newArray() { return new int[num][length]; } @Benchmark public static int[][] newArray2() { int[][] temps = new int[num][]; for (int i = 0; i < temps.length; i++) { temps[i] = new int[length]; } return temps; } }

The test results are as follows.

Benchmark Mode Cnt Score Error Units Test_newArray.newArray avgt 25 289.254 ± 4.982 us/op Test_newArray.newArray2 avgt 25 114.364 ± 1.446 us/op

The test environment is as follows

JMH version: 1.21

VM version: JDK 1.8.0_212, OpenJDK 64-Bit Server VM, 25.212-b04

just guessing, maybe because in case of int[num][length], the continuous space of size num x length should be allocated while in case of int[num][], the arrays are allocated arbitrarily — mangusta
– mangusta, Commented Sep 29, 2019 at 19:34
Do you have any data on what happens in your environment when you vary num and length? — NPE
– NPE, Commented Sep 29, 2019 at 19:34
@mangusta There are no 2d arrays in Java, I believe your guess is wrong. stackoverflow.com/a/6631081/1570854 — Lesiak
– Lesiak, Commented Sep 29, 2019 at 19:36
If you run JMH with -prof perfasm, you might gain some helpful insights. E.g. I can see lots of ObjArrayKlass::multi_allocate present in the output of the first method, but absent in the second one. My guess: reflection overhead? — knittl
– knittl, Commented Sep 29, 2019 at 19:52
@Bohemian Doesn't JMH execute each benchmark in isolation (i.e. in its own JVM) and handle warmup for you? By default, I believe 5 forks are used per benchmark, where each fork runs 5 warmup iterations and 5 measurement iterations. — Slaw
– Slaw, Commented Sep 29, 2019 at 21:16

apangin · Accepted Answer · 2023-05-10 16:10:53Z

In Java there is a separate bytecode instruction for allocating multidimensional arrays - multianewarray.

newArray benchmark uses multianewarray bytecode;
newArray2 invokes simple newarray in the loop.

The problem is that HotSpot JVM has no fast path^* for multianewarray bytecode. This instruction is always executed in VM runtime. Therefore, the allocation is not inlined in the compiled code.

The first benchmark has to pay performance penalty of switching between Java and VM Runtime contexts. Also, the common allocation code in the VM runtime (written in C++) is not as optimized as inlined allocation in JIT-compiled code, just because it is generic, i.e. not optimized for the particular object type or for the particular call site, it performs additional runtime checks, etc.

Here are the results of profiling both benchmarks with async-profiler. I used JDK 11.0.4, but for JDK 8 the picture looks similar.

In the first case, 99% time is spent inside OptoRuntime::multianewarray2_C - the C++ code in the VM runtime.

In the second case, the most of the graph is green, meaning that the program runs mostly in Java context, actually executing JIT-compiled code optimized specifically for the given benchmark.

EDIT

^* Actually, HotSpot JVM can inline multianewarray, but only if the total number of allocations does not exceed -XX:MultiArrayExpandLimit which defaults to 6.

So, for example, new int[5][10] will be allocated inline in the JIT compiled code, while the allocation of new int[10][5] will go through the VM runtime.

Nice answer, especially including a flame graph – I wish they were more common.
Thank you for your answer. It's a good answer to my doubts. The problem is the entertainment of rest time. It's not really about optimizing the details.

Oleksandr Pyrohov · Accepted Answer · 2019-10-02 09:03:10Z

A note in the Oracle Docs under the multianewarray instruction says:

It may be more efficient to use newarray or anewarray (§newarray, §anewarray) when creating an array of a single dimension.

Further:

newArray benchmark uses multianewarray bytecode instruction.

newArray2 benchmark uses anewarray bytecode instruction.

And that is what makes a difference. Let's see the statistics obtained using the perf Linux profiler.

For the newArray benchmark the hottest methods after inlining are:

....[Hottest Methods (after inlining)].............................................................. 22.58% libjvm.so MemAllocator::allocate 14.80% libjvm.so ObjArrayAllocator::initialize 12.92% libjvm.so TypeArrayKlass::multi_allocate 10.98% libjvm.so AccessInternal::PostRuntimeDispatch<G1BarrierSet::AccessBarrier<2670710ul, G1BarrierSet>, (AccessInternal::BarrierType)1, 2670710ul>::oop_access_barrier 7.38% libjvm.so ObjArrayKlass::multi_allocate 6.02% libjvm.so MemAllocator::Allocation::notify_allocation_jvmti_sampler 5.84% ld-2.27.so __tls_get_addr 5.66% libjvm.so CollectedHeap::array_allocate 5.39% libjvm.so Klass::check_array_allocation_length 4.76% libc-2.27.so __memset_avx2_unaligned_erms 0.75% libc-2.27.so __memset_avx2_erms 0.38% libjvm.so __tls_get_addr@plt 0.17% libjvm.so memset@plt 0.10% libjvm.so G1ParScanThreadState::copy_to_survivor_space 0.10% [kernel.kallsyms] update_blocked_averages 0.06% [kernel.kallsyms] native_write_msr 0.05% libjvm.so G1ParScanThreadState::trim_queue 0.05% libjvm.so Monitor::lock_without_safepoint_check 0.05% libjvm.so G1FreeCollectionSetTask::G1SerialFreeCollectionSetClosure::do_heap_region 0.05% libjvm.so OtherRegionsTable::occupied 1.92% <...other 288 warm methods...> ....[Distribution by Source].... 87.61% libjvm.so 5.84% ld-2.27.so 5.56% libc-2.27.so 0.92% [kernel.kallsyms] 0.03% perf-27943.map 0.03% [vdso] 0.01% libpthread-2.27.so ................................ 100.00% <totals>

And for the newArray2:

....[Hottest Methods (after inlining)].............................................................. 93.45% perf-28023.map [unknown] 0.26% libjvm.so G1ParScanThreadState::copy_to_survivor_space 0.22% [kernel.kallsyms] update_blocked_averages 0.19% libjvm.so OtherRegionsTable::is_empty 0.17% libc-2.27.so __memset_avx2_erms 0.16% libc-2.27.so __memset_avx2_unaligned_erms 0.14% libjvm.so OptoRuntime::new_array_C 0.12% libjvm.so G1ParScanThreadState::trim_queue 0.11% libjvm.so G1FreeCollectionSetTask::G1SerialFreeCollectionSetClosure::do_heap_region 0.11% libjvm.so MemAllocator::allocate_inside_tlab_slow 0.11% libjvm.so ObjArrayAllocator::initialize 0.10% libjvm.so OtherRegionsTable::occupied 0.10% libjvm.so MemAllocator::allocate 0.10% libjvm.so Monitor::lock_without_safepoint_check 0.10% [kernel.kallsyms] rt2800pci_rxdone_tasklet 0.09% libjvm.so G1Allocator::unsafe_max_tlab_alloc 0.08% libjvm.so ThreadLocalAllocBuffer::fill 0.08% ld-2.27.so __tls_get_addr 0.07% libjvm.so G1CollectedHeap::allocate_new_tlab 0.07% libjvm.so TypeArrayKlass::allocate_common 4.15% <...other 411 warm methods...> ....[Distribution by Source].... 93.45% perf-28023.map 4.31% libjvm.so 1.64% [kernel.kallsyms] 0.42% libc-2.27.so 0.08% ld-2.27.so 0.06% [vdso] 0.04% libpthread-2.27.so ................................ 100.00% <totals>

As we can see, for the slower newArray most of the time is spent in the jvm code (87.61% total):

22.58% libjvm.so MemAllocator::allocate 14.80% libjvm.so ObjArrayAllocator::initialize 12.92% libjvm.so TypeArrayKlass::multi_allocate 7.38% libjvm.so ObjArrayKlass::multi_allocate ...

While the newArray2 uses the OptoRuntime::new_array_C, spending much less time allocating the memory for arrays. The total time spent in the jvm code is only 4.31%.

Bonus statistics obtained using the perfnorm profiler:

Benchmark Mode Cnt Score Error Units newArray avgt 4 448.018 ± 80.029 us/op newArray:CPI avgt 0.359 #/op newArray:L1-dcache-load-misses avgt 10399.712 #/op newArray:L1-dcache-loads avgt 1032985.924 #/op newArray:L1-dcache-stores avgt 590756.905 #/op newArray:cycles avgt 1132753.204 #/op newArray:instructions avgt 3159465.006 #/op Benchmark Mode Cnt Score Error Units newArray2 avgt 4 125.531 ± 50.749 us/op newArray2:CPI avgt 0.532 #/op newArray2:L1-dcache-load-misses avgt 10345.720 #/op newArray2:L1-dcache-loads avgt 85185.726 #/op newArray2:L1-dcache-stores avgt 103096.223 #/op newArray2:cycles avgt 346651.432 #/op newArray2:instructions avgt 652155.439 #/op

Note the difference in the number of cycles and instructions.

Environment:

Ubuntu 18.04.3 LTS java version "12.0.2" 2019-07-16 Java(TM) SE Runtime Environment (build 12.0.2+10) Java HotSpot(TM) 64-Bit Server VM (build 12.0.2+10, mixed mode, sharing)

it would be nice if you elaborated on these results because I didn't get anything ;)
@Andrew Thank you for comment. I updated my answer with some additional details. If you still have any specific questions, please don't hesitate to ask.

Collectives™ on Stack Overflow

Why does allocating a single 2D array take longer than a loop allocating multiple 1D arrays of the same total size and shape?

2 Answers 2

2 Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Linked

Related