Skip to content

use rust-native char instead of custom representation for string data#7021

Draft
benjamin-stacks wants to merge 14 commits intostacks-network:developfrom
benjamin-stacks:chore/utf8data-char-repr
Draft

use rust-native char instead of custom representation for string data#7021
benjamin-stacks wants to merge 14 commits intostacks-network:developfrom
benjamin-stacks:chore/utf8data-char-repr

Conversation

@benjamin-stacks
Copy link
Copy Markdown
Contributor

This builds on top of Jacinta's excellent work in #6948 (and sits on top of that PR's branch), but it changes the representation of characters in UTF8Data::data to be native chars instead four-byte arrays with pre-encoded UTF8.

The memory footprint is exactly the same; both the now-removed Utf8Char and the built-in UTF-32 char have four bytes.

The advantages are:

  • It requires less custom code for things that are essentially part of the Rust standard library
  • It requires less defensive checking -- a Utf8Char could in theory contain invalid data, which required extra checks, while a Rust char is guaranteed to be valid unicode
  • It's easier to read and understand
  • Postponing UTF-8 encoding until it's actually needed might give a slight performance improvement, although that'll likely be negligible in practice, especially in light of the major perf improvements from Jacinta's original PR (re-running the benchmarks was inconclusive)

Diff between this PR and Jacinta's

jacinta-stacks and others added 14 commits March 2, 2026 14:05
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
This builds on top of Jacinta's excellent work in stacks-network#6948 (and sits on top of that PR's branch), but it changes the representation of characters in `UTF8Data::data` to be native `char`s instead four-byte arrays with pre-encoded UTF8. The memory footprint is exactly the same; both the now-removed `Utf8Char` and the built-in UTF-32 `char` have four bytes. The advantages are: - It requires less custom code for things that are essentially part of the Rust standard library - It requires less defensive checking -- a `Utf8Char` could in theory contain invalid data, which required extra checks, while a Rust `char` is guaranteed to be valid unicode - It's easier to read and understand - Postponing UTF-8 encoding until it's actually needed might give a slight performance improvement, although that'll likely be negligible in practice, especially in light of the major perf improvements from Jacinta's original PR (re-running the benchmarks was inconclusive)
Copy link
Copy Markdown
Contributor

@federico-stacks federico-stacks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are my thoughts on this proposal:

Using Vec<char> is definitely more ergonomic and simplifies construction and parsing. However, it also introduces additional complexity and allocations in all paths that require UTF-8 bytes (e.g., serialization, serde, hashing).

For example:

  • Serialization: both to_utf8_bytes() and to_utf8_string() now require allocating a String and then converting it into a Vec<u8>. This is also likely to be a hot path.
  • Serde: encoding requires first converting all chars into a String (allocating), then re-parsing it with char_indices to determine UTF-8 boundaries, and finally slicing based on those boundaries. This adds extra work compared to the original approach.

That said, I’m open to discussing this further, especially in light of your note that you didn’t observe any benchmark impact.

@benjamin-stacks
Copy link
Copy Markdown
Contributor Author

benjamin-stacks commented Mar 26, 2026

  • Serialization: both to_utf8_bytes() and to_utf8_string() now require allocating a String and then converting it into a Vec<u8>. This is also likely to be a hot path.

While that's true, the original approach allocated a vector and filled it iteratively via try_fold. Because the size hints from c.as_bytes()? would have not been helpful, this would also require growing the vector a couple of times. The new char_vector.iter().collect() approach allows for much better size hinting (and more generally, I would expect the implementation in the Rust standard library to be as efficient as it gets, but that's just a gut feel). It also removes the need for Result<_, _> and error handling, because unlike the old Utf8Char, a char is guaranteed to play nice.

  • Serde: encoding requires first converting all chars into a String (allocating), then re-parsing it with char_indices to determine UTF-8 boundaries, and finally slicing based on those boundaries. This adds extra work compared to the original approach.

The parsing and slicing happened in the old approach as well, via byte_len and as_bytes, except that it was a home-grown implementation instead of a (presumably well-optimized) standard library method. The conversion from chars happened in the original approach as well, except that it happened unconditionally during creation, not on-demand (and I assumed that JSON-serialization isn't going to be such a hot path -- if it were, I would strongly suggest we reconsider the slightly strange serialization as an array-of-arrays).

And finally, the benefit of slicing a ready-made string is that the individual slices represent one contiguous area of memory, which I would expect to be more cache-friendly on the CPU, especially because I think it's fair to assume that most characters will be in the ASCII range, and thus the traversed memory is about a quarter of the size compared to the old approach.

All in all, it's absolutely possible that there are trade-offs here, but it's not at all obvious to me that the new approach would be worse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants