Subscribe to RSS

Question 1

I have a VxWorks machine, and I'm trying to optimize my compiler flags. The chip is theoretically a skylake architecture and should support AVX instructions. I ran a test code and got an "...

Question 2

I'm compiling c++ code with MSVC. My sample code below is SSE2 because I'm using a dispatcher which means Visual Studio won't dump the assembly for AVX or AVX512. My use case is AVX2 and AVX512. I've ...

Question 3

I have a YMM register containing 4 packed double coefficients. I want to compute the cubic equation that they represent as quickly as possible using SSE and AVX2 intrinsics. Throughput is most ...

Question 4

I'm learning SSE programming and I come across this instruction: (V)CVTDQ2PS—Convert Packed Doubleword Integers to Packed Single-Precision Floating Point I thought CVT means convert, D means Dword, ...

Question 5

I've written a little test program that tiggers FPU-exceptions through feraiseexcept(): #include <iostream> #include <cfenv> using namespace std; int main() { auto test = []( int exc,...

Question 6

So I have this SSE3 implementation for matrix multiplication: /** * Loop is unwrapped for performance * @attention As opposed to non-SIMD multiplication we're using column-major */ inline void ...

Question 7

Suppose I have already compiled a binary, doing some float caculation and output the result. If I provide same input for different execution, can I assume that the result must be completely the same (...

Question 8

I have some code written with simd instructions to convert RGBA color data to grayscale data compiled to WASM thanks to Emscripten. It works fine and i hadn't any issues doing this. But i would make a ...

Question 9

Referencing this link, I tried to integrate the libmvec intrinsics into some existing C++ code. Naturally, this involved me changing the forward declaration to extern "C" __m128 ...

Question 10

I have an algorithm that measures the width of each cell (8x8) in a bitmap (128x128) by counting the distance from the start of a cell to the first column within it containing only zeroes. If there is ...

Question 11

I have recently been studying the issue of CRC32 calculation acceleration, which is aimed at the type of CRC32b (polynomial 0x04C11DB7), but I found that the crc32 instructions in Intel's SSE4.2 are ...

Question 12

I want to use the following code to compute the intersection of array a and array b: #include <nmmintrin.h> #include <cstdint> #include <cstdio> void test(uint16_t *a, uint16_t *b) { ...

Question 13

I would like to install tensorflow on a Windows system with a processor that does not seem to support AVX (Pentium J6426). I saw that a minority of people had this problem in the past with earlier ...

Question 14

I have a function which takes two arrays and fills a third array with bilinear combinations of the coordinates of the two arrays. The function is a clifford product over an 8-dimensional real vector ...

Question 15

I'm making my own linalg library for my opengl project, and was thinking of accelerating matmul using simd. minimal reproducible example: use std::arch::x86_64::*; #[derive(Debug, Clone, Copy)] ...

Collectives™ on Stack Overflow

Exception 13 with AVX instruction

Are arrays of simd vectors naturally inefficient?

What is the fastest way to evaluate a cubic given 4 packed double coefficients in a YMM register?

Meaning of Q in SSE instruction mnemonics

How to trigger exactly only one SSE-exception

SSE4.1 slower than SSE3 on 4x4 matrix multiplication?

Is floating point math determinstic for all intel/amd cpus?

How correctly convert _mm_hadd_epi32 to equivalent C++ code (wasm_simd128.h) with Emscripten?

Output errors when using libmvec intrinsics for trigo functions manually (like cosf)

How to optimize cell-width measuring with SIMD (find the first column to have a non-zero byte in an 8x8 block of bytes)

Accelerate CRC32b using intel processors

sse4.2 _mm_cmpistrm/_mm_cmpestrm instruction get wrong result

Run TensorFlow 2.17 on CPU without AVX

How might I optimize computing a large bilinear function exhibiting more-or-less random access?

Access violation when performing matrix product using SIMD in Rust

Hot Network Questions