2,390 questions
1 vote
0 answers
74 views
Exception 13 with AVX instruction
I have a VxWorks machine, and I'm trying to optimize my compiler flags. The chip is theoretically a skylake architecture and should support AVX instructions. I ran a test code and got an "...
4 votes
1 answer
328 views
Are arrays of simd vectors naturally inefficient?
I'm compiling c++ code with MSVC. My sample code below is SSE2 because I'm using a dispatcher which means Visual Studio won't dump the assembly for AVX or AVX512. My use case is AVX2 and AVX512. I've ...
3 votes
1 answer
182 views
What is the fastest way to evaluate a cubic given 4 packed double coefficients in a YMM register?
I have a YMM register containing 4 packed double coefficients. I want to compute the cubic equation that they represent as quickly as possible using SSE and AVX2 intrinsics. Throughput is most ...
0 votes
0 answers
67 views
Meaning of Q in SSE instruction mnemonics
I'm learning SSE programming and I come across this instruction: (V)CVTDQ2PS—Convert Packed Doubleword Integers to Packed Single-Precision Floating Point I thought CVT means convert, D means Dword, ...
3 votes
2 answers
129 views
How to trigger exactly only *one* SSE-exception
I've written a little test program that tiggers FPU-exceptions through feraiseexcept(): #include <iostream> #include <cfenv> using namespace std; int main() { auto test = []( int exc,...
3 votes
1 answer
138 views
SSE4.1 slower than SSE3 on 4x4 matrix multiplication?
So I have this SSE3 implementation for matrix multiplication: /** * Loop is unwrapped for performance * @attention As opposed to non-SIMD multiplication we're using column-major */ inline void ...
1 vote
2 answers
463 views
Is floating point math determinstic for all intel/amd cpus?
Suppose I have already compiled a binary, doing some float caculation and output the result. If I provide same input for different execution, can I assume that the result must be completely the same (...
0 votes
1 answer
210 views
How correctly convert _mm_hadd_epi32 to equivalent C++ code (wasm_simd128.h) with Emscripten?
I have some code written with simd instructions to convert RGBA color data to grayscale data compiled to WASM thanks to Emscripten. It works fine and i hadn't any issues doing this. But i would make a ...
1 vote
1 answer
66 views
Output errors when using libmvec intrinsics for trigo functions manually (like cosf)
Referencing this link, I tried to integrate the libmvec intrinsics into some existing C++ code. Naturally, this involved me changing the forward declaration to extern "C" __m128 ...
3 votes
2 answers
177 views
How to optimize cell-width measuring with SIMD (find the first column to have a non-zero byte in an 8x8 block of bytes)
I have an algorithm that measures the width of each cell (8x8) in a bitmap (128x128) by counting the distance from the start of a cell to the first column within it containing only zeroes. If there is ...
0 votes
1 answer
339 views
Accelerate CRC32b using intel processors
I have recently been studying the issue of CRC32 calculation acceleration, which is aimed at the type of CRC32b (polynomial 0x04C11DB7), but I found that the crc32 instructions in Intel's SSE4.2 are ...
1 vote
1 answer
126 views
sse4.2 _mm_cmpistrm/_mm_cmpestrm instruction get wrong result
I want to use the following code to compute the intersection of array a and array b: #include <nmmintrin.h> #include <cstdint> #include <cstdio> void test(uint16_t *a, uint16_t *b) { ...
2 votes
0 answers
379 views
Run TensorFlow 2.17 on CPU without AVX
I would like to install tensorflow on a Windows system with a processor that does not seem to support AVX (Pentium J6426). I saw that a minority of people had this problem in the past with earlier ...
3 votes
1 answer
141 views
How might I optimize computing a large bilinear function exhibiting more-or-less random access?
I have a function which takes two arrays and fills a third array with bilinear combinations of the coordinates of the two arrays. The function is a clifford product over an 8-dimensional real vector ...
0 votes
1 answer
105 views
Access violation when performing matrix product using SIMD in Rust
I'm making my own linalg library for my opengl project, and was thinking of accelerating matmul using simd. minimal reproducible example: use std::arch::x86_64::*; #[derive(Debug, Clone, Copy)] ...