0

I know very little about inline assembly, codes(see here for details) are as follows:

JNIEXPORT void JNICALL Java_com_xingin_xarengine_RGBAToGrayRenderer_nCopy(JNIEnv *env, jclass clazz, jobject dstBuf, jobject srcBuf, jint sz) { if(sz & 63){ sz = (sz & -64) + 64; } auto dst = (uint8_t volatile*)env->GetDirectBufferAddress(dstBuf); auto src = (uint8_t volatile*)env->GetDirectBufferAddress(srcBuf); asm volatile ( "NEONCopyPLD: \n" " VLDM %[src]!,{d0-d7} \n" " VSTM %[dst]!,{d0-d7} \n" " SUBS %[sz],%[sz],#0x40 \n" " BGT NEONCopyPLD \n" : [dst]"+r"(dst), [src]"+r"(src), [sz]"+r"(sz) : : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "cc", "memory"); LOGD("Use neon registers for memory copy"); } 

It's basically used to copy memory by NEON registers. While the compiler complaint when building my application:

Build command failed. Error while executing process /Users/user/Library/Android/sdk/cmake/3.10.2.4988404/bin/ninja with arguments {-C /Users/user/Projects/XarEngine/android/arview/.cxx/Release/5s3f6f2r/arm64-v8a XarEngine} ninja: Entering directory `/Users/user/Projects/XarEngine/android/arview/.cxx/Release/5s3f6f2r/arm64-v8a' [1/2] Building CXX object CMakeFiles/XarEngine.dir/XarEngine/details.cpp.o FAILED: CMakeFiles/XarEngine.dir/XarEngine/details.cpp.o /Users/user/Library/Android/sdk/ndk/21.1.6352462/toolchains/llvm/prebuilt/darwin-x86_64/bin/clang++ --target=aarch64-none-linux-android21 --gcc-toolchain=/Users/user/Library/Android/sdk/ndk/21.1.6352462/toolchains/llvm/prebuilt/darwin-x86_64 --sysroot=/Users/user/Library/Android/sdk/ndk/21.1.6352462/toolchains/llvm/prebuilt/darwin-x86_64/sysroot -DXarEngine_EXPORTS -D__GIT_TAG__=\"1.3.3-7-g59b0706\" -I../../../../../../components/PlaneTracker/include -I../../../../../../thirdparty/rapidjson -I../../../../../../thirdparty/filament/include -I../../../../../../thirdparty/opencv_4.5.3/include -g -DANDROID -fdata-sections -ffunction-sections -funwind-tables -fstack-protector-strong -no-canonical-prefixes -D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security -s -O2 -O2 -DNDEBUG -fPIC -MD -MT CMakeFiles/XarEngine.dir/XarEngine/details.cpp.o -MF CMakeFiles/XarEngine.dir/XarEngine/details.cpp.o.d -o CMakeFiles/XarEngine.dir/XarEngine/details.cpp.o -c ../../../../../../XarEngine/details.cpp clang++: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument] ../../../../../../XarEngine/details.cpp:175:48: warning: value size does not match register size specified by the constraint and modifier [-Wasm-operand-widths] : [dst]"+r"(dst), [src]"+r"(src), [sz]"+r"(sz) : : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "cc", "memory"); ^ ../../../../../../XarEngine/details.cpp:173:12: note: use constraint modifier "w" " SUBS %[sz],%[sz],#0x40 \n" ^~~~~ %w[sz] ../../../../../../XarEngine/details.cpp:175:48: warning: value size does not match register size specified by the constraint and modifier [-Wasm-operand-widths] : [dst]"+r"(dst), [src]"+r"(src), [sz]"+r"(sz) : : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "cc", "memory"); ^ ../../../../../../XarEngine/details.cpp:173:18: note: use constraint modifier "w" " SUBS %[sz],%[sz],#0x40 \n" ^~~~~ %w[sz] ../../../../../../XarEngine/details.cpp:171:6: error: vector register expected " VLDM %[src]!,{d0-d7} \n" ^ <inline asm>:2:12: note: instantiated into assembly here VLDM x0!,{d0-d7} ^ ../../../../../../XarEngine/details.cpp:172:6: error: vector register expected " VSTM %[dst]!,{d0-d7} \n" ^ <inline asm>:3:13: note: instantiated into assembly here VSTM x21!,{d0-d7} ^ 2 warnings and 2 errors generated. ninja: build stopped: subcommand failed. 

Who can help figuring out above information?

UPDATE
Is it related about compiler? My compiler is clang while above inline assembly should be gcc-compliant

16
  • I hope this is just an experiment in getting the syntax right, not that you're expecting a speedup from this vs. the JVM's own memcpy. I'd expect a JVM to use NEON regs for memcpy if available, without the overhead of marshalling for a JNI call. As for the actual errors, seems weird, I'd have expected a jint to be an integer type that could use "+r". Commented Nov 24, 2022 at 7:20
  • If you want to manually vectorize pixel-format conversions, like averaging the RGB to a single gray level and packing 4 bytes down to 1, I'd suggest using intrinsics and see if the compiler does a decent job. If not, then maybe hand-optimize the asm and wrap it up in an inline asm statement. But hopefully you won't need to mess with asm directly, just read it while you tweak the C++ source, to get good performance. Commented Nov 24, 2022 at 7:23
  • 1
    The reason to use inline assembly for memcpy is complicated in some way, put it simply the dstBuf is mapped DMA buffer from GPU memory and is not cached on CPU, call c++ memcpy directly may be very slow for some GPU(e.g. Mali), so we can use NEON register to overcome it Commented Nov 24, 2022 at 7:34
  • 1
    Ok, that makes some sense. Yeah, a JVM memcpy might not be optimized to read whole 64-byte chunks with a single instruction, so yeah might be super bad on uncacheable memory. And I don't think intrinsics could let you tell the compiler you want a vldm like that. Commented Nov 24, 2022 at 7:41
  • 1
    No, the cause is already clear: you wrote some code that only works for 32-bit ARM (godbolt.org/z/1Pcs7GhjE), and are compiling it for 64-bit AArch64 as part of your android project. Use #idef __aarch64__ to make sure you use the right inline asm. (What predefined macro can I use to detect the target architecture in Clang? / Get architecture type (ABI) to C preprocessor for Android NDK) Commented Nov 24, 2022 at 7:56

1 Answer 1

0

Duplicate of this post. As @PeterCordes say, above inline assembly only can be compiled for 32-bit ARM

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.