Skip to main content
1 vote
0 answers
74 views

I have a VxWorks machine, and I'm trying to optimize my compiler flags. The chip is theoretically a skylake architecture and should support AVX instructions. I ran a test code and got an "...
dividebyzero's user avatar
  • 2,288
4 votes
1 answer
328 views

I'm compiling c++ code with MSVC. My sample code below is SSE2 because I'm using a dispatcher which means Visual Studio won't dump the assembly for AVX or AVX512. My use case is AVX2 and AVX512. I've ...
dts's user avatar
  • 223
3 votes
1 answer
182 views

I have a YMM register containing 4 packed double coefficients. I want to compute the cubic equation that they represent as quickly as possible using SSE and AVX2 intrinsics. Throughput is most ...
Martin Brown's user avatar
  • 3,626
0 votes
0 answers
67 views

I'm learning SSE programming and I come across this instruction: (V)CVTDQ2PS—Convert Packed Doubleword Integers to Packed Single-Precision Floating Point I thought CVT means convert, D means Dword, ...
user239216's user avatar
3 votes
2 answers
129 views

I've written a little test program that tiggers FPU-exceptions through feraiseexcept(): #include <iostream> #include <cfenv> using namespace std; int main() { auto test = []( int exc,...
Edison von Myosotis's user avatar
3 votes
1 answer
138 views

So I have this SSE3 implementation for matrix multiplication: /** * Loop is unwrapped for performance * @attention As opposed to non-SIMD multiplication we're using column-major */ inline void ...
Baba Dan Constantin's user avatar
1 vote
2 answers
463 views

Suppose I have already compiled a binary, doing some float caculation and output the result. If I provide same input for different execution, can I assume that the result must be completely the same (...
song xs's user avatar
  • 81
0 votes
1 answer
210 views

I have some code written with simd instructions to convert RGBA color data to grayscale data compiled to WASM thanks to Emscripten. It works fine and i hadn't any issues doing this. But i would make a ...
kalwalt's user avatar
  • 490
1 vote
1 answer
66 views

Referencing this link, I tried to integrate the libmvec intrinsics into some existing C++ code. Naturally, this involved me changing the forward declaration to extern "C" __m128 ...
Gabriel Lim's user avatar
3 votes
2 answers
177 views

I have an algorithm that measures the width of each cell (8x8) in a bitmap (128x128) by counting the distance from the start of a cell to the first column within it containing only zeroes. If there is ...
invertedPanda's user avatar
0 votes
1 answer
339 views

I have recently been studying the issue of CRC32 calculation acceleration, which is aimed at the type of CRC32b (polynomial 0x04C11DB7), but I found that the crc32 instructions in Intel's SSE4.2 are ...
Dbettkk's user avatar
  • 113
1 vote
1 answer
126 views

I want to use the following code to compute the intersection of array a and array b: #include <nmmintrin.h> #include <cstdint> #include <cstdio> void test(uint16_t *a, uint16_t *b) { ...
zelin's user avatar
  • 21
2 votes
0 answers
379 views

I would like to install tensorflow on a Windows system with a processor that does not seem to support AVX (Pentium J6426). I saw that a minority of people had this problem in the past with earlier ...
Alkhwarizmi's user avatar
3 votes
1 answer
141 views

I have a function which takes two arrays and fills a third array with bilinear combinations of the coordinates of the two arrays. The function is a clifford product over an 8-dimensional real vector ...
MonadMania's user avatar
0 votes
1 answer
105 views

I'm making my own linalg library for my opengl project, and was thinking of accelerating matmul using simd. minimal reproducible example: use std::arch::x86_64::*; #[derive(Debug, Clone, Copy)] ...
alco's user avatar
  • 1

15 30 50 per page
1
2 3 4 5
160