I'll just throw in a few random thoughts in no particular order, but this will be a rather high-level view on things.
###Typical use cases
In my opinion, Compile as an efficiency-boosting device is effective in two kinds of situations (and their mixes):
The problem is solved most efficiently with a procedural style, because for example an efficient algorithm for it is formulated procedurally and does not have a simple / efficient functional counterpart (note also that functional programming in Mathematica is peculiar in many respects, reflecting the fact that functional layer is a thin one on top of the rule-based engine. So, some algorithms which are efficient in other functional languages may be inefficient in Mathematica). A very clear sign of it is when you have to do array indexing in a loop.
The problem can be solved my joining several
Compile-able built-in functions together, but there are (perhaps several) "joints" where you face the performance-hit if using the top-level code, because it stays general and can not use specialized versions of these functions, and for a few other reasons. In such cases,Compilemerely makes the code more efficient by effectively type-specializing to numerical arguments and not using the main evaluator. One example that comes to mind is when we compileSelectwith a custom (compilable) predicate and can get a substantial performance boost (here is one example).
I use this rule of thumb when determining whether or not I will benefit from Compile: the more my code inside Compile looks like C code I'd write otherwise, the more I benefit from it (strictly speaking, this is only true for the compilation to C, not MVM).
It may happen that some portions of top-level code will be the major bottleneck and can not be recast in a more efficient form, for a given approach to the problem. In such a case, Compile may not really help, and it is best to rethink the whole approach and try to find another formulation for the problem. In other words, it often saves time and effort to do some profiling and get a good idea about the real places where the bottlenecks are, before turning to Compile.
###Limitations of Compile
Here is an (incomplete) list of limitations, most of which you mentioned yourself
- Can only accept regular arrays (tensors) of numerical or boolean types. This excludes ragged arrays and more general Mathematica expressions.
- In most cases, can only return a single tensor of some type
- Only machine-precision arithmetics
- No way to create functions with memory (a-la static variables in C)
- Only a small subset of functions can be compiled to byte-code (or C)
- Possibilities for writing recursive compiled functions seem to be very limited, and most interesting cases seem to be ruled out
- No decent pass-by-reference semantics, which is a big deal (to me anyways)
- You can not really use indexed variables in
Compile, although it may appear that you can. - ...
###Whether or not to compile to C?
I think this depends on the circumstances. Compilation to C is expensive, so this makes sense only for performance-critical code to be used many times. There are also many cases when compilation to MVM will give similar performance, while being much faster. So, it really depends, although more often than not compilation to C will lead to faster execution and, if you at all decide to use Compile, will be justified
###What to include in Compile
I share an opinion that, unless you have some specific requirements, it is best to only use Compile on minimal code fragments which would benefit from it the most, rather than have one big Compile. This is good because:
- It allows you to better understand where the real bottlenecks are
- It makes your compiled code more testable and composable
- If you really need it, you can then combine these pieces and use
InlineCompiledFunctiions -> Trueoption setting, to get all the benefits that one largeCompilewould give you - Since
Compileis limited in what it can take, you will have less headaches on how to include some uncompilable pieces, plus less chances to overlook a callback to the main evaluator
That said, you may benefit from one large Compile in some situations, including:
- Cases when you want to grab the resulting C code and use it stand-alone (linked against Wolfram RTL)
- Cases when you want to run your compiled code in parallel on several kernels and don't want to think about possible definitions distribution issues etc (this was noted by @halirutan)
###A few more techniques
There are a number of techniques involving Compile, which are perhaps somewhat more advanced, but which sometimes allow one to solve problems for which plain Compile is not flexible enough. Some that I am aware of:
Sometimes you can trade memory for speed, and, having a nested ragged list, pad it with zeros to form a tensor, and pass that to
Compile.Run-time generation of compiled function, where certain run-time parameters get embedded. This may be combined with memoization. One example is here, another very good example is here
One can emulate pass-by-reference and have a way of composing larger compiled functions out of smaller ones with parameters (well, sort of), without a loss of efficiency. This technique is showcased for example here
A common wisdom is that since neither linked-lists, not
Sow-Reapare compilable, one has to pre-allocate large arrays most of the time. There are at least two other options:- Use
Internal`Bag, which is compilable (the problem however is that it can not be returned as a result ofCompileas of now, AFAIK. - It is quite easy to implement an analog of dynamic array inside your compiled code, by setting up a variable which gives the current limit, and copy to a new larger array once more space is needed. In this way, you only allocate (at the end) as much space as is really needed, for a price of some overhead, which is often negligible.
- Use
One may often be able to use vectorized operations like
UnitStep,Clip,Unitizeetc, to replace the if-else control flow in inner loops, also insideCompile. This may give a huge speed-up, particularly when compiling to MVM target. Some examples are in my comments in this and this blog posts, and one other pretty illustrative example of a vectorized binary search hereUsing additional list of integers as "pointers" to some lists you may have. Here, I will make an exception for this post, and give an explicit example, illustrating the point. The following is a fairly efficient function to find a longest increasing subsequence of a list of numbers. It was developed jointly by DrMajorBob, Fred Simons and myself, in an on and off-line Mathgroup discussion (so this final form is not available publicly AFAIK, thus including it here)
Here is the code
Clear[btComp]; btComp = Compile[{{lst, _Integer, 1}}, Module[{refs, result, endrefs = {1}, ends = {First@lst}, len = Length@lst, n0 = 1, n1 = 1, i = 1, n, e}, refs = result = 0 lst; For[i = 2, i <= len, i++, Which[ lst[[i]] < First@ends, (ends[[1]] = lst[[i]]; endrefs[[1]] = i; refs[[i]] = 0), lst[[i]] > Last@ends, (refs[[i]] = Last@endrefs;AppendTo[ends, lst[[i]]]; AppendTo[endrefs, i]), First@ends < lst[[i]] < Last@ends, (n0 = 1; n1 = Length@ends; While[n1 - n0 > 1, n = Floor[(n0 + n1)/2]; If[ends[[n]] < lst[[i]], n0 = n, n1 = n]]; ends[[n1]] = lst[[i]]; endrefs[[n1]] = i; refs[[i]] = endrefs[[n1 - 1]]) ]]; For[i = 1; e = Last@endrefs, e != 0, (i++; e = refs[[e]]), result[[i]] = lst[[e]]]; Reverse@Take[result, i - 1]], CompilationTarget -> "C"]; Here is an example of use (list should not contain duplicates):
test = RandomSample[#, Length[#]] &@ Union@RandomInteger[{1, 1000000}, 1000000]; btComp[test] // Length // Timing The fastest solution based on built-ins, which is indeed very fast, is still about 6 times slower for this size of the list:
LongestCommonSequence[test, Sort@test] // Short // Timing Anyways, the point here is that this was possible because of extra variables refs and endrefs, the use of which allowed to only manipulate single integers instead of large integer lists.
###A few assorted remarks
Things to watch out for: see this discussion for some tips on that. Basically, you should avoid
Callbacks to the main evaluator
Excessive copying of tensors (`CopyTensor instruction)
Accidental unpacking happening in top-level functions preparing input for
Compileor processing its output. This is not related toCompileproper, but it happens thatCompiledoes not help at all, because the bottleneck is in the top-level code.Type conversion I would not worry about performance hit, but sometimes wrong types may lead to run-time errors, or unanticipated callbacks to
MainEvaluatein the compiled code.Certain functions (e.g.
Sortwith the default comparison function, but not only), don't benefit from compilation much or at all.It is not clear how
CompilehandlesHold- attributes in compiled code, but there are indications that it does not fully preserve the standard semantics we are used to in the top-level.How to see whether or not you can effectively use
Compilefor a given problem. My experience is that withCompilein Mathematica you have to be "proactive" (with all dislike I have for the word, I know of nothing better here). What I mean is that to use it effectively, you have to search the structure of your problem / program for places where you could transform the (parts of) data into a form which can be used inCompile. In most cases (at least in my experience), except obvious ones where you already have a procedural algorithm in pseudo-code, you have to reformulate the problem, so you have to actively ask: what should I do to useCompilehere.