In addition to the answers given, you may tweak specific commands to give better performance. For example Part[] is a candidate for this. Part has to do bound checks. In time critical inner loops you can switch that off be using, CompilerGetElement[]` instead. Very cautious with this one.
Another thing you might want to try (never needed this myself) is give platform specific compile optimization options that your CPU supports:
Needs["CompiledFunctionTools`"] Compiler`$CCompilerOptions = {"SystemCompileOptions" -> "-fPIC -O3"} I thing default optimization is -O2.
Furthermore, for example basic arithmetic operations are quite optimized and linking to the runtime lib should be quite fast.