Skip to content

Use single memcpy for row-contiguous tensors in to_blob#97

Merged
polvalente merged 2 commits intoelixir-nx:mainfrom
dannote:fix-to-blob-contiguous-memcpy
Feb 22, 2026
Merged

Use single memcpy for row-contiguous tensors in to_blob#97
polvalente merged 2 commits intoelixir-nx:mainfrom
dannote:fix-to-blob-contiguous-memcpy

Conversation

@dannote
Copy link
Contributor

@dannote dannote commented Feb 22, 2026

to_blob previously used a ContiguousIterator to copy data element by element, even when the MLX array was row-contiguous in memory. For a 512×1536 f32 tensor (typical embedding batch), that meant 786,432 individual memcpy calls with iterator stepping overhead.

Now checks t->flags().row_contiguous and uses a single memcpy when possible, falling back to the element-by-element path only for non-contiguous arrays (e.g. transposed views).

Benchmarks (Apple M5, 24 GB, GPU tensors, post-eval)

Shape Before After Speedup
1×1536 18 µs 12 µs 1.5×
64×1536 219 µs 12 µs 18×
512×1536 1,680 µs 105 µs 16×
1024×1024 2,234 µs 118 µs 19×

Benchmark script included in bench/to_binary_bench.exs.

All 2115 existing tests pass, plus 4 new tests covering contiguous, non-contiguous, large tensor, and limited reads.

to_blob previously used a ContiguousIterator to copy data element by element, even when the MLX array was row-contiguous in memory. For a 512×1536 f32 tensor, that meant 786,432 individual memcpy calls with iterator stepping overhead. Check t->flags().row_contiguous and use a single memcpy when possible, falling back to the element-by-element path only for non-contiguous arrays (e.g. transposed views). Benchmarks on Apple M5, 24 GB (GPU tensors, post-eval): | Shape | Before | After | Speedup | |-------------|----------|--------|---------| | 1×1536 | 18 µs | 12 µs | 1.5× | | 64×1536 | 219 µs | 12 µs | 18× | | 512×1536 | 1,680 µs | 105 µs | 16× | | 1024×1024 | 2,234 µs | 118 µs | 19× |
Copy link
Collaborator

@polvalente polvalente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome change!

@@ -0,0 +1,38 @@
# Benchmark: Nx.to_binary for EMLX tensors
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this file

assert byte_size(bin) == 64 * 1536 * 4

# Verify first and last elements
<<first::float-32-native, _::binary>> = bin
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could verify the whole binary by defining the same tensor in Nx.BinaryBackend

@polvalente polvalente merged commit 666ca5d into elixir-nx:main Feb 22, 2026
6 checks passed
@dannote dannote deleted the fix-to-blob-contiguous-memcpy branch February 22, 2026 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants