Optimizing Elixir

Over the last few weeks I’ve been writing a series on Optimizing Elixir code for my Game of Life. Actaully, so far I’ve only written about tools - available profiling tools and tools I’mbuilding for Comparing Benchmark Results Its time to finally start applying these tools and optmizing the code!

Integrate bmark

The first thing I did was to replace the lifebench and lifebench.cmp tasks with my standalone bmark tool. lifebench and bmark are similar but I’ve made a few improvements to the reporting in bmark that I want to incorporate. I added bmark to my mix.exs file. At the moment it refernces it via local path but once I fix up a few remaining issues I’ll put it on github.

I added a bmark based benchmark using the existing Profile module. For new code I would forgoe the Profile module and write the benchmarking code directly in the _bmark.ex file. Here’s my bmark:

defmodule LifeBench do use Bmark bmark :glider do Profile.run_test end end

After writing this I uncovered the first of several bugs in bmark. Its good to run it through a somewhat realistic case like my Game of Life. A GenServer call times out because the benchmark takes too long to run. After a quick fix I run into the second bug: I need to create a directory named results to store the results file. If the directory doesn’t exist the benchmark is run but the results are lost. Another quick fix and I’m on my way.

Baseline

Now that I have bmark ready I can start measuring. I want to measure in the production environment so I start by setting

export MIX_ENV=prod

in my shell. You can assume I’m running production from now on.

I collect baseline results by running mix bmark

lifebench.glider.results 15050451 15282246 15095683 15049299 15292709 15018099 15009078 15266227 15146941 15111238

I want to save these in a base directory. Hmm, I need to manually copy results to base. I’ll need to do this copy everytime. For the moment I choose to do this manually, but I’ll want to automate this soon as it will be tedious and prone to error.

Consolidating Protocols in Elixir

Serveral weeks ago Johan Wärlander suggested that I consolidate the protocols used in my Game of Life implementation in order to reduce time spent in functions like ‘Elixir.Code’,’ensure_compiled?’. He points out that protocol consolidation should be a part of preparing an Elixir application for a production release. Before Johan’s comment I hadn’t heard of protocol consolidation so I read up a bit on it. José Valim’s,in a github issue, described the process in terms of generating fast dispatches (without lookups) when all protocol implementations are known. This works well in the case of packaging a release. The protocol compilation can be invoked by mix as

$ mix compile.protocols

which writes out new binary files containing the consolidated protocols. But, this requires that the application be run as

$ elixir -pa _build/MIX_ENV/consolidated -S mix run

to set the loadpath to include the consolidated protocol files. I consolidated protocols and reran the benchmark:

lifebench.glider.results: lifebench.glider.results: 15050451 14887006 15282246 15028554 15095683 14762181 15049299 14905275 15292709 14577768 15018099 14883831 15009078 14795362 15266227 14897746 15146941 14763634 15111238 14904190 15132197.1 -> 14840554.7 (-1.93%) with p < 0.0005 t = 5.60108220481427, 18 degrees of freedom

Another problem with bmark: I use the basename of the results files for the header in the report. But, in my setup with separate resluts directories the files have the same name and so I can’t tell the two columns apart. Another fix to bmark and I have:

base/lifebench.glider.results: consolidate/lifebench.glider.results: 15050451 14887006 15282246 15028554 15095683 14762181 15049299 14905275 15292709 14577768 15018099 14883831 15009078 14795362 15266227 14897746 15146941 14763634 15111238 14904190 15132197.1 -> 14840554.7 (-1.93%) with p < 0.0005 t = 5.60108220481427, 18 degrees of freedom

Ok, so the report looks nice and this is a performance win! Protocol consolidation has lowered the run time by almost 2%.

Going forward I need to continue to run with protocol consolidation, but this isn’t a code change so it doesn’t “stick”. I need to make sure that I

Remember to compile the protocols before each profile run
Remember to run with the right command line to enable the use of the consolidated protocols.

This is a lot to remember, I better right myself a script to do it.

#!/bin/sh if [ -z "$1" ]; then echo "Usage bmark.sh <name>" exit -1 fi export MIX_ENV=prod mkdir -p results mix compile mix compile.protocols elixir -pa _build/prod/consolidated/ -S mix bmark mv results $1

I also decided to go ahead and save the results directory based on a command line argument. And I set the production environment in the script just in case I start a new shell and forget to set MIX_ENV. With my new script ready I can move ahead to the next optimization.

Count Live Neighbors

Next we profile with ExProf:

FUNCTION CALLS % TIME [uS / CALLS] -------- ----- --- ---- [----------] 'Elixir.Access':get/2 419073 0.53 85973 [ 0.21] 'Elixir.Enumerable.List':reduce/3 657360 0.54 86879 [ 0.13] 'Elixir.Access':'impl_for!'/1 419073 0.57 91691 [ 0.22] 'Elixir.Life':'-list_of_neighbors/3-fun-0-'/6 369765 0.60 96884 [ 0.26] code_server:call/2 419077 1.59 256773 [ 0.61] maps:put/3 49302 12.09 1949883 [ 39.55] 'Elixir.Access':impl_for/1 419073 14.65 2363776 [ 5.64] maps:find/2 419067 63.10 10178202 [ 24.29]

I see that the profile hasn’t really changed that much. There is probably less runtime compilatilation but I still see time spent in maps:find/2. My first question is what is maps:find/2? It comes from Erlang, but when is it used? As a guess, I look at Elixir.Access for Map and find the get function. The get function is what is used to implement the [] operator on Map (or anything that implements the Access protocol). I see that Elixir.Access.get for Map calls :maps.find(key, map) to do the Map lookup. So, we know that we are spending our time looking up values from Maps.

Also, I see 'Elixir.Life':'-list_of_neighbors/3-fun-0-'/6 in the profile which I know is part of count_live_neighbors/3 and I know that this path does a lot of map lookups. I decied to try optmizing this code a little. In hindsight the optimizations I tried didn’t make sense, but I’ll show them here and show how the bmark.cmp process can give a more formal way of making the determination - remember that code performance can often be hard to predict or reason about without hard information.

Looking at list_of_neighbors/3 I wonder how fast the for comprehension is. I could try rewriting the function using a module constant instead of generating the dx, dy values with the comprehension.

I rewrote this:

defp list_of_neighbors(x, y, board) do for dx <- [-1, 0, 1], dy <- [-1, 0, 1], {dx, dy} != {0, 0}, do: board[{x + dx, y + dy}] end

@deltas [ {-1, -1}, { 0, -1}, { 1, -1}, {-1, 0}, { 1, 0}, {-1, 1}, { 0, 1}, { 1, 1} ] defp list_of_neighbors(x, y, board) do for dx <- [-1, 0, 1], dy <- [-1, 0, 1], {dx, dy} != {0, 0}, do: board[{x + dx, y + dy}] Enum.map(@deltas, fn ({dx, dy}) -> board[{x + dx, y + dy}] end) end

and then measured the results

consolidate/lifebench.glider.results: precomputed/lifebench.glider.results: 14887006 14642711 15028554 14897523 14762181 14698632 14905275 14978539 14577768 14775842 14883831 14751767 14795362 14859778 14897746 14872981 14763634 14648105 14904190 14634486 14840554.7 -> 14776036.4 (-0.44%) with p < 1 t = 1.1845853684535732, 18 degrees of freedom

Sadly, this is not a win. I decided to take this one step further and try to reduce the work in count_live_neighbors/3 by taking the several pipeline steps and combining them into a single fold operation. I started with:

defp count_live_neighbors(x, y, board) do list_of_neighbors(x, y, board) |> Enum.map(&state_as_int/1) |> Enum.sum end defp list_of_neighbors(x, y, board) do Enum.map(@deltas, fn ({dx, dy}) -> board[{x + dx, y + dy}] end) end

and rewrote it as:

defp count_live_neighbors(x, y, board) do List.foldl(@deltas, 0, fn ({dx, dy}, acc) -> acc + (board[{x + dx, y + dy}] |> state_as_int) end) end

I measured again:

consolidate/lifebench.glider.results: foldl/lifebench.glider.results: 14887006 14742553 15028554 14879462 14762181 14712122 14905275 14925838 14577768 14772430 14883831 14604655 14795362 14682689 14897746 14799597 14763634 14634907 14904190 14979943 14840554.7 -> 14773419.6 (-0.45%) with p < 1 t = 1.220154927852377, 18 degrees of freedom

But still no win. Was it right to push through the first failed optimization this way? I think it was, sometimes it can lead to better ideas or a code structure that lends itself to furhter optmization. But in this case it just didn’t pay off. I discarded these changes.

As I said, in hindsight the optimizations I tried didn’t make sense. This is because I didn’t change the number of calls to maps:find/2 which clearly dominates the profile. Instead, I was only channging the stucture of the code around these calls. But, I find that kind of insight hard to see at the onset.

Concurrent Computation

Elixir is all about concurrency so I had been itching to try out a parallel version. I was pretty confident that a concurrent version would see a benefit from parallel execution on my 4 core system. But, I had wanted to save this optmization until the end. I didn’t envision changing too much in converting to a concurrent version and wanted to make sure I was starting with the best version of the base code.

My version of The Game of Life has a Board module and that module has a map function which I use to apply updates to each cell in the game. Building a concurrent version of the Game of Life was simple matter of changing Board.map into a parallel map. Here’s the original version:

def map(board, f) do board |> Map.keys |> Enum.map(fn (key) -> { key, f.(key, board[key]) } end) |> List.foldr(Map.new, fn ({key, value}, acc) -> Map.put(acc, key, value) end) end

The board itself is stored as a map and I use the existing Enum.map by creating a List of all keys in my map. Then map over the list and then consolidate the results back into a new Map using List.foldr I wrote up a naive concurrent version stating a process for each invocation of the function f:

def map(board, f) do board |> Map.keys |> pmap(fn (key) -> { key, f.(key, board[key]) } end) |> List.foldr(Map.new, fn ({key, value}, acc) -> Map.put(acc, key, value) end) end def pmap(list, f) do list |> Enum.map(fn (elem) -> Task.async(fn -> f.(elem) end) end) |> Enum.map(fn (task) -> Task.await(task) end) end

Here I’ve replaced the use of List.map with my own pmap function that uses Task.async to run each function invocation in a separate process.

consolidate/lifebench.glider.results: naive-pmap/lifebench.glider.results: 14887006 26889439 15028554 27339230 14762181 27067765 14905275 27026322 14577768 26777941 14883831 27135888 14795362 26867260 14897746 27028423 14763634 27354742 14904190 26881294 14840554.7 -> 27036830.4 (+82.18%) with p < 0.0005 t = 167.14386999331413, 18 degrees of freedom

This is a huge loss! Process creation in Erlang is supposed to be cheap but based on the results I think that the balance of work between the function f and the process creation in Task.async is off. That f which would be Life.apply_rules/3 does very little work compared to Task.async. While in the absolute Task.async may be cheap it is relatively expensive and the cost outweights the benefit.

But, still a concurrent version should be able to win. I try restructuring a bit and do the pmap in chunks of 8:

 def pmap(list, f) do list |> Enum.chunk(8, 8, []) |> Enum.flat_map(fn (chunk) -> chunk |> Enum.map(fn (elem) -> Task.async(fn -> f.(elem) end) end) |> Enum.map(fn (task) -> Task.await(task) end) end) end

consolidate/lifebench.glider.results: chunk-pmap/lifebench.glider.results: 14887006 24274268 15028554 24563751 14762181 24492221 14905275 24516553 14577768 24335224 14883831 24158102 14795362 24357174 14897746 24213098 14763634 24466586 14904190 24289248 14840554.7 -> 24366622.5 (+64.19%) with p < 0.0005 t = 163.96384320448385, 18 degrees of freedom

This looks like it is slightly less of a huge loss than the previous version. Which I think means I still don’t have the balance right. So I increase the chunk size to 128:

consolidate/lifebench.glider.results: chunk-128-pmap/lifebench.glider.results: 14887006 6426990 15028554 6416149 14762181 6507946 14905275 6453309 14577768 6491314 14883831 6405073 14795362 6504260 14897746 6449789 14763634 6532929 14904190 6509800 14840554.7 -> 6469755.9 (-56.41%) with p < 0.0005 t = 203.48398714422547, 18 degrees of freedom

And finally we have our win!

Next steps

First of all, there are a few more things I want to clean up in bmark. I’m thining that bmark should handle the protocol consolidation for the user. It doesn’t make sense to have to do this manually, though perhaps this could be controlled by a configuration option. There are also some other larger ideas that I will discuss in a future post.

Next week, I plan to continue with the profiling and optimization process.

Joseph Kain