In addition to the answers given, you may tweak specific commands to give better performance. For example `Part[]` is a candidate for this. Part has to do bound checks. In time critical inner loops you can switch that off be using, `Compiler`GetElement[]` instead. Very cautious with this one.

Another thing you might want to try (never needed this myself) is give platform specific compile optimization options that your CPU supports:

 Needs["CompiledFunctionTools`"]
 Compiler`$CCompilerOptions = {"SystemCompileOptions" -> "-fPIC -O3"}

I thing default optimization is -O2.

Furthermore, for example basic arithmetic operations are quite optimized and linking to the runtime lib should be quite fast.