Skip to main content
Removed the statement about 1-level Listability of compiled functions - it was pointed to me in comments that it might not be correct.
Source Link
Leonid Shifrin
  • 115.8k
  • 16
  • 341
  • 435
  • Listable attribute for Compile works only one-level, unlike usual Listable attribute, which threads down to all levels.
  • It may happen that the time it takes to prepare the data to be fed into a Listable compiled function, will be much more than the time the function runs (e.g. when we use Transpose or Partition etc on huge lists), which then sort of destroys the purpose. So, it is good to make an estimate whether or not that will be the case.
  • A more "coarse-grained" alternative to this is to run a single-threaded compiled function in parallel on several Mathematica kernels, using the built-in parallel functionality (ParallelEvaluate, ParallelMap, etc). These two possibilities are useful in different situations.
  • Listable attribute for Compile works only one-level, unlike usual Listable attribute, which threads down to all levels.
  • It may happen that the time it takes to prepare the data to be fed into a Listable compiled function, will be much more than the time the function runs (e.g. when we use Transpose or Partition etc on huge lists), which then sort of destroys the purpose. So, it is good to make an estimate whether or not that will be the case.
  • A more "coarse-grained" alternative to this is to run a single-threaded compiled function in parallel on several Mathematica kernels, using the built-in parallel functionality (ParallelEvaluate, ParallelMap, etc). These two possibilities are useful in different situations.
  • It may happen that the time it takes to prepare the data to be fed into a Listable compiled function, will be much more than the time the function runs (e.g. when we use Transpose or Partition etc on huge lists), which then sort of destroys the purpose. So, it is good to make an estimate whether or not that will be the case.
  • A more "coarse-grained" alternative to this is to run a single-threaded compiled function in parallel on several Mathematica kernels, using the built-in parallel functionality (ParallelEvaluate, ParallelMap, etc). These two possibilities are useful in different situations.
replaced http://stackoverflow.com/ with https://stackoverflow.com/
Source Link
  • Sometimes you can trade memory for speed, and, having a nested ragged list, pad it with zeros to form a tensor, and pass that to Compile.

  • Sometimes your list is general and you can not directly process it in Compile to do what you want, however, you can reformulate a problem such that you can instead process a list of element positions, which are integers. I call it "element-position duality". One example of this technique in action is here, for a larger application of this idea see my last post in this thread (I hesitated to include this reference because my first several posts there are incorrect solutions. Note that for that particular problem, a far more elegant and short, but somewhat less efficient solution was given in the end of that thread).

  • Sometimes you may need some structural operations to prepare the input data for Compile, and the data contains lists (or, generally, tensors), of different types (say, integer positions and real values). To keep the list packed, it may make sense to convert integers to reals (in this example), converting them back to integers with IntegerPart inside Compile. One such example is here

  • Run-time generation of compiled functions, where certain run-time parameters get embedded. This may be combined with memoization. One example is herehere, another very good example is herehere

  • One can emulate pass-by-reference and have a way of composing larger compiled functions out of smaller ones with parameters (well, sort of), without a loss of efficiency. This technique is showcased for example herehere

  • A common wisdom is that since neither linked-lists, nor Sow-Reap are compilable, one has to pre-allocate large arrays most of the time, to store the intermediate results. There are at least two other options:

    • Use Internal`Bag, which is compilable (the problem however is that it can not be returned as a result of Compile as of now, AFAIK).
    • It is quite easy to implement an analog of a dynamic array inside your compiled code, by setting up a variable which gives the current size limit, and copy your array to a new larger array once more space is needed. In this way, you only allocate (at the end) as much space as is really needed, for a price of some overhead, which is often negligible.
  • One may often be able to use vectorized operations like UnitStep, Clip, Unitize etc, to replace the if-else control flow in inner loops, also inside Compile. This may give a huge speed-up, particularly when compiling to MVM target. Some examples are in my comments in this and this blog posts, and one other pretty illustrative example of a vectorized binary search in my answer in this thread

  • Using additional list of integers as "pointers" to some lists you may have. Here, I will make an exception for this post, and give an explicit example, illustrating the point. The following is a fairly efficient function to find a longest increasing subsequence of a list of numbers. It was developed jointly by DrMajorBob, Fred Simons and myself, in an on and off-line MathGroup discussion (so this final form is not available publicly AFAIK, thus including it here)

  • Sometimes you can trade memory for speed, and, having a nested ragged list, pad it with zeros to form a tensor, and pass that to Compile.

  • Sometimes your list is general and you can not directly process it in Compile to do what you want, however, you can reformulate a problem such that you can instead process a list of element positions, which are integers. I call it "element-position duality". One example of this technique in action is here, for a larger application of this idea see my last post in this thread (I hesitated to include this reference because my first several posts there are incorrect solutions. Note that for that particular problem, a far more elegant and short, but somewhat less efficient solution was given in the end of that thread).

  • Sometimes you may need some structural operations to prepare the input data for Compile, and the data contains lists (or, generally, tensors), of different types (say, integer positions and real values). To keep the list packed, it may make sense to convert integers to reals (in this example), converting them back to integers with IntegerPart inside Compile. One such example is here

  • Run-time generation of compiled functions, where certain run-time parameters get embedded. This may be combined with memoization. One example is here, another very good example is here

  • One can emulate pass-by-reference and have a way of composing larger compiled functions out of smaller ones with parameters (well, sort of), without a loss of efficiency. This technique is showcased for example here

  • A common wisdom is that since neither linked-lists, nor Sow-Reap are compilable, one has to pre-allocate large arrays most of the time, to store the intermediate results. There are at least two other options:

    • Use Internal`Bag, which is compilable (the problem however is that it can not be returned as a result of Compile as of now, AFAIK).
    • It is quite easy to implement an analog of a dynamic array inside your compiled code, by setting up a variable which gives the current size limit, and copy your array to a new larger array once more space is needed. In this way, you only allocate (at the end) as much space as is really needed, for a price of some overhead, which is often negligible.
  • One may often be able to use vectorized operations like UnitStep, Clip, Unitize etc, to replace the if-else control flow in inner loops, also inside Compile. This may give a huge speed-up, particularly when compiling to MVM target. Some examples are in my comments in this and this blog posts, and one other pretty illustrative example of a vectorized binary search in my answer in this thread

  • Using additional list of integers as "pointers" to some lists you may have. Here, I will make an exception for this post, and give an explicit example, illustrating the point. The following is a fairly efficient function to find a longest increasing subsequence of a list of numbers. It was developed jointly by DrMajorBob, Fred Simons and myself, in an on and off-line MathGroup discussion (so this final form is not available publicly AFAIK, thus including it here)

  • Sometimes you can trade memory for speed, and, having a nested ragged list, pad it with zeros to form a tensor, and pass that to Compile.

  • Sometimes your list is general and you can not directly process it in Compile to do what you want, however, you can reformulate a problem such that you can instead process a list of element positions, which are integers. I call it "element-position duality". One example of this technique in action is here, for a larger application of this idea see my last post in this thread (I hesitated to include this reference because my first several posts there are incorrect solutions. Note that for that particular problem, a far more elegant and short, but somewhat less efficient solution was given in the end of that thread).

  • Sometimes you may need some structural operations to prepare the input data for Compile, and the data contains lists (or, generally, tensors), of different types (say, integer positions and real values). To keep the list packed, it may make sense to convert integers to reals (in this example), converting them back to integers with IntegerPart inside Compile. One such example is here

  • Run-time generation of compiled functions, where certain run-time parameters get embedded. This may be combined with memoization. One example is here, another very good example is here

  • One can emulate pass-by-reference and have a way of composing larger compiled functions out of smaller ones with parameters (well, sort of), without a loss of efficiency. This technique is showcased for example here

  • A common wisdom is that since neither linked-lists, nor Sow-Reap are compilable, one has to pre-allocate large arrays most of the time, to store the intermediate results. There are at least two other options:

    • Use Internal`Bag, which is compilable (the problem however is that it can not be returned as a result of Compile as of now, AFAIK).
    • It is quite easy to implement an analog of a dynamic array inside your compiled code, by setting up a variable which gives the current size limit, and copy your array to a new larger array once more space is needed. In this way, you only allocate (at the end) as much space as is really needed, for a price of some overhead, which is often negligible.
  • One may often be able to use vectorized operations like UnitStep, Clip, Unitize etc, to replace the if-else control flow in inner loops, also inside Compile. This may give a huge speed-up, particularly when compiling to MVM target. Some examples are in my comments in this and this blog posts, and one other pretty illustrative example of a vectorized binary search in my answer in this thread

  • Using additional list of integers as "pointers" to some lists you may have. Here, I will make an exception for this post, and give an explicit example, illustrating the point. The following is a fairly efficient function to find a longest increasing subsequence of a list of numbers. It was developed jointly by DrMajorBob, Fred Simons and myself, in an on and off-line MathGroup discussion (so this final form is not available publicly AFAIK, thus including it here)

replaced http://mathematica.stackexchange.com/ with https://mathematica.stackexchange.com/
Source Link
  • The problem is solved most efficiently with a procedural style, because for example an efficient algorithm for it is formulated procedurally and does not have a simple / efficient functional counterpart (note also that functional programming in Mathematica is peculiar in many respects, reflecting the fact that functional layer is a thin one on top of the rule-based engine. So, some algorithms which are efficient in other functional languages may be inefficient in Mathematica). A very clear sign of it is when you have to do array indexing in a loop.

  • The problem can be solved by joining several Compile-able built-in functions together, but there are (perhaps several) "joints" where you face the performance-hit if using the top-level code, because it stays general and can not use specialized versions of these functions, and for a few other reasons. In such cases, Compile merely makes the code more efficient by effectively type-specializing to numerical arguments and not using the main evaluator. One example that comes to mind is when we compile Select with a custom (compilable) predicate and can get a substantial performance boost (herehere is one example).

I think this depends on the circumstances. Compilation to C is expensive, so this makes sense only for performance-critical code to be used many times. There are also many cases when compilation to MVM will give similar performance, while being much faster. One such example can be found in this answerthis answer, where the just-in-time compilation to MVM target led to a major speed-up, while compilation to C would have likely destroyed the purpose of it - in that particular case.

There are in fact many cases when this is important, and not all of them are as obvious as the above example. One such case was considered in a recent answerrecent answer to the question of extracting numbers from a sorted list belonging to some window. The solution is short and I will reproduce it here:

  • Sometimes you can trade memory for speed, and, having a nested ragged list, pad it with zeros to form a tensor, and pass that to Compile.

  • Sometimes your list is general and you can not directly process it in Compile to do what you want, however, you can reformulate a problem such that you can instead process a list of element positions, which are integers. I call it "element-position duality". One example of this technique in action is herehere, for a larger application of this idea see my last post in this thread (I hesitated to include this reference because my first several posts there are incorrect solutions. Note that for that particular problem, a far more elegant and short, but somewhat less efficient solution was given in the end of that thread).

  • Sometimes you may need some structural operations to prepare the input data for Compile, and the data contains lists (or, generally, tensors), of different types (say, integer positions and real values). To keep the list packed, it may make sense to convert integers to reals (in this example), converting them back to integers with IntegerPart inside Compile. One such example is herehere

  • Run-time generation of compiled functions, where certain run-time parameters get embedded. This may be combined with memoization. One example is here, another very good example is here

  • One can emulate pass-by-reference and have a way of composing larger compiled functions out of smaller ones with parameters (well, sort of), without a loss of efficiency. This technique is showcased for example here

  • A common wisdom is that since neither linked-lists, nor Sow-Reap are compilable, one has to pre-allocate large arrays most of the time, to store the intermediate results. There are at least two other options:

    • Use Internal`Bag, which is compilable (the problem however is that it can not be returned as a result of Compile as of now, AFAIK).
    • It is quite easy to implement an analog of a dynamic array inside your compiled code, by setting up a variable which gives the current size limit, and copy your array to a new larger array once more space is needed. In this way, you only allocate (at the end) as much space as is really needed, for a price of some overhead, which is often negligible.
  • One may often be able to use vectorized operations like UnitStep, Clip, Unitize etc, to replace the if-else control flow in inner loops, also inside Compile. This may give a huge speed-up, particularly when compiling to MVM target. Some examples are in my comments in this and this blog posts, and one other pretty illustrative example of a vectorized binary search in my answer in this thread

  • Using additional list of integers as "pointers" to some lists you may have. Here, I will make an exception for this post, and give an explicit example, illustrating the point. The following is a fairly efficient function to find a longest increasing subsequence of a list of numbers. It was developed jointly by DrMajorBob, Fred Simons and myself, in an on and off-line MathGroup discussion (so this final form is not available publicly AFAIK, thus including it here)

  • Things to watch out for: see this discussionthis discussion for some tips on that. Basically, you should avoid

  • Callbacks to the main evaluator

  • Excessive copying of tensors (CopyTensor instruction)

  • Accidental unpacking happening in top-level functions preparing input for Compile or processing its output. This is not related to Compile proper, but it happens that Compile does not help at all, because the bottleneck is in the top-level code.

  • Type conversion I would not worry about performance hit, but sometimes wrong types may lead to run-time errors, or unanticipated callbacks to MainEvaluate in the compiled code.

  • Certain functions (e.g. Sort with the default comparison function, but not only), don't benefit from compilation much or at all.

  • It is not clear how Compile handles Hold- attributes in compiled code, but there are indications that it does not fully preserve the standard semantics we are used to in the top-level.

  • How to see whether or not you can effectively use Compile for a given problem. My experience is that with Compile in Mathematica you have to be "proactive" (with all my dislike for the word, I know of nothing better here). What I mean is that to use it effectively, you have to search the structure of your problem / program for places where you could transform the (parts of) data into a form which can be used in Compile. In most cases (at least in my experience), except obvious ones where you already have a procedural algorithm in pseudo-code, you have to reformulate the problem, so you have to actively ask: what should I do to use Compile here.

  • The problem is solved most efficiently with a procedural style, because for example an efficient algorithm for it is formulated procedurally and does not have a simple / efficient functional counterpart (note also that functional programming in Mathematica is peculiar in many respects, reflecting the fact that functional layer is a thin one on top of the rule-based engine. So, some algorithms which are efficient in other functional languages may be inefficient in Mathematica). A very clear sign of it is when you have to do array indexing in a loop.

  • The problem can be solved by joining several Compile-able built-in functions together, but there are (perhaps several) "joints" where you face the performance-hit if using the top-level code, because it stays general and can not use specialized versions of these functions, and for a few other reasons. In such cases, Compile merely makes the code more efficient by effectively type-specializing to numerical arguments and not using the main evaluator. One example that comes to mind is when we compile Select with a custom (compilable) predicate and can get a substantial performance boost (here is one example).

I think this depends on the circumstances. Compilation to C is expensive, so this makes sense only for performance-critical code to be used many times. There are also many cases when compilation to MVM will give similar performance, while being much faster. One such example can be found in this answer, where the just-in-time compilation to MVM target led to a major speed-up, while compilation to C would have likely destroyed the purpose of it - in that particular case.

There are in fact many cases when this is important, and not all of them are as obvious as the above example. One such case was considered in a recent answer to the question of extracting numbers from a sorted list belonging to some window. The solution is short and I will reproduce it here:

  • Sometimes you can trade memory for speed, and, having a nested ragged list, pad it with zeros to form a tensor, and pass that to Compile.

  • Sometimes your list is general and you can not directly process it in Compile to do what you want, however, you can reformulate a problem such that you can instead process a list of element positions, which are integers. I call it "element-position duality". One example of this technique in action is here, for a larger application of this idea see my last post in this thread (I hesitated to include this reference because my first several posts there are incorrect solutions. Note that for that particular problem, a far more elegant and short, but somewhat less efficient solution was given in the end of that thread).

  • Sometimes you may need some structural operations to prepare the input data for Compile, and the data contains lists (or, generally, tensors), of different types (say, integer positions and real values). To keep the list packed, it may make sense to convert integers to reals (in this example), converting them back to integers with IntegerPart inside Compile. One such example is here

  • Run-time generation of compiled functions, where certain run-time parameters get embedded. This may be combined with memoization. One example is here, another very good example is here

  • One can emulate pass-by-reference and have a way of composing larger compiled functions out of smaller ones with parameters (well, sort of), without a loss of efficiency. This technique is showcased for example here

  • A common wisdom is that since neither linked-lists, nor Sow-Reap are compilable, one has to pre-allocate large arrays most of the time, to store the intermediate results. There are at least two other options:

    • Use Internal`Bag, which is compilable (the problem however is that it can not be returned as a result of Compile as of now, AFAIK).
    • It is quite easy to implement an analog of a dynamic array inside your compiled code, by setting up a variable which gives the current size limit, and copy your array to a new larger array once more space is needed. In this way, you only allocate (at the end) as much space as is really needed, for a price of some overhead, which is often negligible.
  • One may often be able to use vectorized operations like UnitStep, Clip, Unitize etc, to replace the if-else control flow in inner loops, also inside Compile. This may give a huge speed-up, particularly when compiling to MVM target. Some examples are in my comments in this and this blog posts, and one other pretty illustrative example of a vectorized binary search in my answer in this thread

  • Using additional list of integers as "pointers" to some lists you may have. Here, I will make an exception for this post, and give an explicit example, illustrating the point. The following is a fairly efficient function to find a longest increasing subsequence of a list of numbers. It was developed jointly by DrMajorBob, Fred Simons and myself, in an on and off-line MathGroup discussion (so this final form is not available publicly AFAIK, thus including it here)

  • Things to watch out for: see this discussion for some tips on that. Basically, you should avoid

  • Callbacks to the main evaluator

  • Excessive copying of tensors (CopyTensor instruction)

  • Accidental unpacking happening in top-level functions preparing input for Compile or processing its output. This is not related to Compile proper, but it happens that Compile does not help at all, because the bottleneck is in the top-level code.

  • Type conversion I would not worry about performance hit, but sometimes wrong types may lead to run-time errors, or unanticipated callbacks to MainEvaluate in the compiled code.

  • Certain functions (e.g. Sort with the default comparison function, but not only), don't benefit from compilation much or at all.

  • It is not clear how Compile handles Hold- attributes in compiled code, but there are indications that it does not fully preserve the standard semantics we are used to in the top-level.

  • How to see whether or not you can effectively use Compile for a given problem. My experience is that with Compile in Mathematica you have to be "proactive" (with all my dislike for the word, I know of nothing better here). What I mean is that to use it effectively, you have to search the structure of your problem / program for places where you could transform the (parts of) data into a form which can be used in Compile. In most cases (at least in my experience), except obvious ones where you already have a procedural algorithm in pseudo-code, you have to reformulate the problem, so you have to actively ask: what should I do to use Compile here.

  • The problem is solved most efficiently with a procedural style, because for example an efficient algorithm for it is formulated procedurally and does not have a simple / efficient functional counterpart (note also that functional programming in Mathematica is peculiar in many respects, reflecting the fact that functional layer is a thin one on top of the rule-based engine. So, some algorithms which are efficient in other functional languages may be inefficient in Mathematica). A very clear sign of it is when you have to do array indexing in a loop.

  • The problem can be solved by joining several Compile-able built-in functions together, but there are (perhaps several) "joints" where you face the performance-hit if using the top-level code, because it stays general and can not use specialized versions of these functions, and for a few other reasons. In such cases, Compile merely makes the code more efficient by effectively type-specializing to numerical arguments and not using the main evaluator. One example that comes to mind is when we compile Select with a custom (compilable) predicate and can get a substantial performance boost (here is one example).

I think this depends on the circumstances. Compilation to C is expensive, so this makes sense only for performance-critical code to be used many times. There are also many cases when compilation to MVM will give similar performance, while being much faster. One such example can be found in this answer, where the just-in-time compilation to MVM target led to a major speed-up, while compilation to C would have likely destroyed the purpose of it - in that particular case.

There are in fact many cases when this is important, and not all of them are as obvious as the above example. One such case was considered in a recent answer to the question of extracting numbers from a sorted list belonging to some window. The solution is short and I will reproduce it here:

  • Sometimes you can trade memory for speed, and, having a nested ragged list, pad it with zeros to form a tensor, and pass that to Compile.

  • Sometimes your list is general and you can not directly process it in Compile to do what you want, however, you can reformulate a problem such that you can instead process a list of element positions, which are integers. I call it "element-position duality". One example of this technique in action is here, for a larger application of this idea see my last post in this thread (I hesitated to include this reference because my first several posts there are incorrect solutions. Note that for that particular problem, a far more elegant and short, but somewhat less efficient solution was given in the end of that thread).

  • Sometimes you may need some structural operations to prepare the input data for Compile, and the data contains lists (or, generally, tensors), of different types (say, integer positions and real values). To keep the list packed, it may make sense to convert integers to reals (in this example), converting them back to integers with IntegerPart inside Compile. One such example is here

  • Run-time generation of compiled functions, where certain run-time parameters get embedded. This may be combined with memoization. One example is here, another very good example is here

  • One can emulate pass-by-reference and have a way of composing larger compiled functions out of smaller ones with parameters (well, sort of), without a loss of efficiency. This technique is showcased for example here

  • A common wisdom is that since neither linked-lists, nor Sow-Reap are compilable, one has to pre-allocate large arrays most of the time, to store the intermediate results. There are at least two other options:

    • Use Internal`Bag, which is compilable (the problem however is that it can not be returned as a result of Compile as of now, AFAIK).
    • It is quite easy to implement an analog of a dynamic array inside your compiled code, by setting up a variable which gives the current size limit, and copy your array to a new larger array once more space is needed. In this way, you only allocate (at the end) as much space as is really needed, for a price of some overhead, which is often negligible.
  • One may often be able to use vectorized operations like UnitStep, Clip, Unitize etc, to replace the if-else control flow in inner loops, also inside Compile. This may give a huge speed-up, particularly when compiling to MVM target. Some examples are in my comments in this and this blog posts, and one other pretty illustrative example of a vectorized binary search in my answer in this thread

  • Using additional list of integers as "pointers" to some lists you may have. Here, I will make an exception for this post, and give an explicit example, illustrating the point. The following is a fairly efficient function to find a longest increasing subsequence of a list of numbers. It was developed jointly by DrMajorBob, Fred Simons and myself, in an on and off-line MathGroup discussion (so this final form is not available publicly AFAIK, thus including it here)

  • Things to watch out for: see this discussion for some tips on that. Basically, you should avoid

  • Callbacks to the main evaluator

  • Excessive copying of tensors (CopyTensor instruction)

  • Accidental unpacking happening in top-level functions preparing input for Compile or processing its output. This is not related to Compile proper, but it happens that Compile does not help at all, because the bottleneck is in the top-level code.

  • Type conversion I would not worry about performance hit, but sometimes wrong types may lead to run-time errors, or unanticipated callbacks to MainEvaluate in the compiled code.

  • Certain functions (e.g. Sort with the default comparison function, but not only), don't benefit from compilation much or at all.

  • It is not clear how Compile handles Hold- attributes in compiled code, but there are indications that it does not fully preserve the standard semantics we are used to in the top-level.

  • How to see whether or not you can effectively use Compile for a given problem. My experience is that with Compile in Mathematica you have to be "proactive" (with all my dislike for the word, I know of nothing better here). What I mean is that to use it effectively, you have to search the structure of your problem / program for places where you could transform the (parts of) data into a form which can be used in Compile. In most cases (at least in my experience), except obvious ones where you already have a procedural algorithm in pseudo-code, you have to reformulate the problem, so you have to actively ask: what should I do to use Compile here.

Added a note on complications with compilation to C
Source Link
Leonid Shifrin
  • 115.8k
  • 16
  • 341
  • 435
Loading
Added a link to Szabolcs's solution for inkblot problem, using JIT to MVM target
Source Link
Leonid Shifrin
  • 115.8k
  • 16
  • 341
  • 435
Loading
Added a section on auto-compilation, per @Szabolcs's suggestion
Source Link
Leonid Shifrin
  • 115.8k
  • 16
  • 341
  • 435
Loading
minor typos
Source Link
Szabolcs
  • 238.9k
  • 32
  • 653
  • 1.3k
Loading
Added a link, corrected some typos
Source Link
Leonid Shifrin
  • 115.8k
  • 16
  • 341
  • 435
Loading
Added an example on Listable compiled functions
Source Link
Leonid Shifrin
  • 115.8k
  • 16
  • 341
  • 435
Loading
Added a link
Source Link
Leonid Shifrin
  • 115.8k
  • 16
  • 341
  • 435
Loading
Added a link
Source Link
Leonid Shifrin
  • 115.8k
  • 16
  • 341
  • 435
Loading
Source Link
Leonid Shifrin
  • 115.8k
  • 16
  • 341
  • 435
Loading