Towards your original code (this does not answer the question why Mod is that slow; it just tries to get rid of it), here is my version of it, two generations later. We already identified Mod to be the bottleneck, so let's get rid of it, by precomputing the indices and by submitting them to the compiled function as additional arguments:
isingProper5 = Compile[{{initCoords, _Integer, 2}, {tSteps, _Integer}, {storeTraj, _Integer}, {up, _Integer, 1}, {down, _Integer, 1}, {left, _Integer, 1}, {right, _Integer, 1} }, Module[{totalCoords, realTraj, storedSteps, storedStepPointer = 1, coordMat = initCoords, peMat, gridNRows, gridNCols, giL, giM, gjL, gjM, c}, (*number of steps to be used in the trajectory*) realTraj = Min@{Max@{storeTraj, 1}, tSteps}; (*steps to be cached*) storedSteps = Table[Ceiling[tSteps*i/realTraj], {i, Min@{Max@{realTraj, 1}, tSteps}}]; (*preallocated trajectory block*) totalCoords = Table[initCoords, {i, realTraj}]; gridNRows = Length@initCoords; gridNCols = Dimensions[initCoords][[2]]; Do[ peMat =(*calculate PE on the grid from nearest neighbors*) Table[ giL = Compile`GetElement[up, i]; giM = Compile`GetElement[down, i]; Table[ gjL = Compile`GetElement[left, j]; gjM = Compile`GetElement[right, j]; Times[ - Compile`GetElement[coordMat, i, j], Plus[ Compile`GetElement[coordMat, giL, j], Compile`GetElement[coordMat, giM, j], Compile`GetElement[coordMat, i, gjL], Compile`GetElement[coordMat, i, gjM] ] ], {j, gridNCols}] , {i, gridNRows} ]; If[n == storedSteps[[storedStepPointer]], totalCoords[[storedStepPointer]] = coordMat; storedStepPointer++]; , {n, tSteps} ]; totalCoords ], RuntimeOptions -> "Speed", CompilationTarget -> "C" ];
Here is a timing example:
m = n = 100; isingGrid = Developer`ToPackedArray[RandomChoice[{-1, 1}, {m, n}]]; up = RotateLeft[Range[n]]; down = RotateRight[Range[n]]; left = RotateLeft[Range[m]]; right = RotateRight[Range[m]]; t5 = AbsoluteTiming[ isingProper5[isingGrid, 1000, 1, up, down, left, right]; ][[1]]/1000
0.000020027
Compared to the original function, this is about 36 times as fast:
t3 = AbsoluteTiming[ isingProper3[isingGrid, 1000, 1] ][[1]]/1000 t3/t5
0.00105735
36.2294
So far, this code is also not parallelized. You have to cast that into a Listable function in order to profit from Parallelization -> True.
I am also quite surprised about the slowness of Mod. Usually, we get advised to replace as much memory bound tasks by CPU-bound tasks on modern hardware. In this case, the opposite was more performant.
Disclaimer
I haven't checked the result for correctnes, yet.
Thinking about it even more, I wonder whether the Do-loop in the compiled function gets optimized away by the compiler. If I replace the Do by Table and make it the return variable, execution takes way longer and I doubt that this is solely due to memory allocation...
Moreover, the following isolates a single iteration from the Do-loop.
cg = Compile[{{a, _Integer, 2}, {up, _Integer, 1}, {down, _Integer, 1}, {left, _Integer, 1}, {right, _Integer, 1}}, Table[ Times[ (- Compile`GetElement[a, i, j]), Plus[ Compile`GetElement[a, Compile`GetElement[up, i], j], Compile`GetElement[a, Compile`GetElement[down, i], j], Compile`GetElement[a, i, Compile`GetElement[left, j]], Compile`GetElement[a, i, Compile`GetElement[right, j] ] ] ], {i, Dimensions[a][[1]]}, {j, Dimensions[a][[2]]}] , RuntimeOptions -> "Speed", CompilationTarget -> "C"];
Its runtime is approximately twice as the average runtime of an iteration in isingProper5.
tcg = cg[isingGrid, up, down, left, right]; // RepeatedTiming // First
0.000038
I am not sure what the reason is. It cannot be calling overhead alone, because the factor 2 remains also for m = n = 1000; and m = n = 2000;.