Use single memcpy for row-contiguous tensors in to_blob by dannote · Pull Request #97 · elixir-nx/emlx

dannote · 2026-02-22T16:27:18Z

to_blob previously used a ContiguousIterator to copy data element by element, even when the MLX array was row-contiguous in memory. For a 512×1536 f32 tensor (typical embedding batch), that meant 786,432 individual memcpy calls with iterator stepping overhead.

Now checks t->flags().row_contiguous and uses a single memcpy when possible, falling back to the element-by-element path only for non-contiguous arrays (e.g. transposed views).

Benchmarks (Apple M5, 24 GB, GPU tensors, post-eval)

Shape	Before	After	Speedup
1×1536	18 µs	12 µs	1.5×
64×1536	219 µs	12 µs	18×
512×1536	1,680 µs	105 µs	16×
1024×1024	2,234 µs	118 µs	19×

Benchmark script included in bench/to_binary_bench.exs.

All 2115 existing tests pass, plus 4 new tests covering contiguous, non-contiguous, large tensor, and limited reads.

to_blob previously used a ContiguousIterator to copy data element by element, even when the MLX array was row-contiguous in memory. For a 512×1536 f32 tensor, that meant 786,432 individual memcpy calls with iterator stepping overhead. Check t->flags().row_contiguous and use a single memcpy when possible, falling back to the element-by-element path only for non-contiguous arrays (e.g. transposed views). Benchmarks on Apple M5, 24 GB (GPU tensors, post-eval): | Shape | Before | After | Speedup | |-------------|----------|--------|---------| | 1×1536 | 18 µs | 12 µs | 1.5× | | 64×1536 | 219 µs | 12 µs | 18× | | 512×1536 | 1,680 µs | 105 µs | 16× | | 1024×1024 | 2,234 µs | 118 µs | 19× |

polvalente

Awesome change!

polvalente · 2026-02-22T16:28:28Z

bench/to_binary_bench.exs

@@ -0,0 +1,38 @@
+# Benchmark: Nx.to_binary for EMLX tensors


Let's remove this file

polvalente · 2026-02-22T16:29:54Z

test/emlx/to_binary_test.exs

+ assert byte_size(bin) == 64 * 1536 * 4
+
+ # Verify first and last elements
+ <<first::float-32-native, _::binary>> = bin


You could verify the whole binary by defining the same tensor in Nx.BinaryBackend

polvalente approved these changes Feb 22, 2026

View reviewed changes

bench/to_binary_bench.exs Outdated

@@ -0,0 +1,38 @@

# Benchmark: Nx.to_binary for EMLX tensors

Copy link

Collaborator

polvalente Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this file

polvalente reviewed Feb 22, 2026

View reviewed changes

Remove bench file and verify binaries against BinaryBackend

af1cce8

polvalente approved these changes Feb 22, 2026

View reviewed changes

polvalente merged commit 666ca5d into elixir-nx:main Feb 22, 2026
6 checks passed

dannote deleted the fix-to-blob-contiguous-memcpy branch February 22, 2026 17:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use single memcpy for row-contiguous tensors in to_blob#97

Use single memcpy for row-contiguous tensors in to_blob#97
polvalente merged 2 commits intoelixir-nx:mainfrom
dannote:fix-to-blob-contiguous-memcpy

dannote commented Feb 22, 2026

polvalente left a comment

polvalente Feb 22, 2026

polvalente Feb 22, 2026

Uh oh!

Labels

2 participants

Conversation

dannote commented Feb 22, 2026

Benchmarks (Apple M5, 24 GB, GPU tensors, post-eval)

polvalente left a comment

Choose a reason for hiding this comment

polvalente Feb 22, 2026

Choose a reason for hiding this comment

polvalente Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Labels

2 participants