In addition to the answers given, you may tweak specific commands to give better performance. For example Part[] is a candidate for this. Part has to do bound checks. In time critical inner loops you can switch that off beby using, Compiler`GetElement[] instead. Very cautious with this one.
Another thing you might want to try (never needed this myself) is to give platform specific compile optimization options that your CPU supports:
Needs["CompiledFunctionTools`"] Compiler`$CCompilerOptions = {"SystemCompileOptions" -> "-fPIC -O3"} I think default optimization is -O2.
Furthermore, for example basic arithmetic operations are quite optimized and linking to the runtime lib should be quite fast.
Edit:
One important point I forgot, the internal optimizer will find a good way to formulate your expressions
Experimental`OptimizeExpression[{x^2 Sin[x^2]}] Also, with the symbolic power you can simplify expression you could never do by hand or pen an paper....
To see what can be compiled have a look at:
Compile`CompilerFunctions[] Find get warning about external symbols not included you can alternatively use:
On[Compile::noinfo] Also RuntimeAttributes -> Listable provides for very easy parallelization.