I have this piece of code that basically goes through a really big image and two ways that seems super similar have a 70% speed difference. First one is the fast one that takes around 10s
if (clusterPntr[col] == i) { /* Calculate the location of the relevant pixel (rows are flipped) */ pixel = bmp->Data + ( ( bmp->Header.Height - row - 1 ) * bytes_per_row + col * bytes_per_pixel ); /* Get pixel's RGB values */ b=pixel[0]; g=pixel[1]; r=pixel[2]; totr += r; totg += g; totb += b; sizeCluster++; } second one takes 17s
if (clusterPntr[col] == i) { /* Calculate the location of the relevant pixel (rows are flipped) */ pixel = bmp->Data + ( ( bmp->Header.Height - row - 1 ) * bytes_per_row + col * bytes_per_pixel ); /* Get pixel's RGB values */ //why is this SO MUCH SLOWER totr += pixel[2]; totg += pixel[1]; totb += pixel[0]; sizeCluster++; } I would think that the problem lies in how cache and probably one version uses registers while the other one uses the data arrays. CPU is an M1 pro so the ARM architecture might have something to do as well
pixelandtot*(which would hinder compiler optimizations), but hard to tell without full code.