Use single memcpy for row-contiguous tensors in to_blob#97
Merged
polvalente merged 2 commits intoelixir-nx:mainfrom Feb 22, 2026
Merged
Use single memcpy for row-contiguous tensors in to_blob#97polvalente merged 2 commits intoelixir-nx:mainfrom
polvalente merged 2 commits intoelixir-nx:mainfrom
Conversation
to_blob previously used a ContiguousIterator to copy data element by element, even when the MLX array was row-contiguous in memory. For a 512×1536 f32 tensor, that meant 786,432 individual memcpy calls with iterator stepping overhead. Check t->flags().row_contiguous and use a single memcpy when possible, falling back to the element-by-element path only for non-contiguous arrays (e.g. transposed views). Benchmarks on Apple M5, 24 GB (GPU tensors, post-eval): | Shape | Before | After | Speedup | |-------------|----------|--------|---------| | 1×1536 | 18 µs | 12 µs | 1.5× | | 64×1536 | 219 µs | 12 µs | 18× | | 512×1536 | 1,680 µs | 105 µs | 16× | | 1024×1024 | 2,234 µs | 118 µs | 19× |
polvalente approved these changes Feb 22, 2026
bench/to_binary_bench.exs Outdated
| @@ -0,0 +1,38 @@ | |||
| # Benchmark: Nx.to_binary for EMLX tensors | |||
Collaborator
There was a problem hiding this comment.
Let's remove this file
polvalente reviewed Feb 22, 2026
test/emlx/to_binary_test.exs Outdated
| assert byte_size(bin) == 64 * 1536 * 4 | ||
| | ||
| # Verify first and last elements | ||
| <<first::float-32-native, _::binary>> = bin |
Collaborator
There was a problem hiding this comment.
You could verify the whole binary by defining the same tensor in Nx.BinaryBackend
polvalente approved these changes Feb 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
to_blobpreviously used aContiguousIteratorto copy data element by element, even when the MLX array was row-contiguous in memory. For a 512×1536 f32 tensor (typical embedding batch), that meant 786,432 individual memcpy calls with iterator stepping overhead.Now checks
t->flags().row_contiguousand uses a singlememcpywhen possible, falling back to the element-by-element path only for non-contiguous arrays (e.g. transposed views).Benchmarks (Apple M5, 24 GB, GPU tensors, post-eval)
Benchmark script included in
bench/to_binary_bench.exs.All 2115 existing tests pass, plus 4 new tests covering contiguous, non-contiguous, large tensor, and limited reads.