use rust-native `char` instead of custom representation for string data by benjamin-stacks · Pull Request #7021 · stacks-network/stacks-core

benjamin-stacks · 2026-03-23T11:48:37Z

This builds on top of Jacinta's excellent work in #6948 (and sits on top of that PR's branch), but it changes the representation of characters in UTF8Data::data to be native chars instead four-byte arrays with pre-encoded UTF8.

The memory footprint is exactly the same; both the now-removed Utf8Char and the built-in UTF-32 char have four bytes.

The advantages are:

It requires less custom code for things that are essentially part of the Rust standard library
It requires less defensive checking -- a Utf8Char could in theory contain invalid data, which required extra checks, while a Rust char is guaranteed to be valid unicode
It's easier to read and understand
Postponing UTF-8 encoding until it's actually needed might give a slight performance improvement, although that'll likely be negligible in practice, especially in light of the major perf improvements from Jacinta's original PR (re-running the benchmarks was inconclusive)

Diff between this PR and Jacinta's

Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>

… into chore/utf8data-fixed-array-repr

Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>

… into chore/utf8data-fixed-array-repr

Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>

This builds on top of Jacinta's excellent work in stacks-network#6948 (and sits on top of that PR's branch), but it changes the representation of characters in `UTF8Data::data` to be native `char`s instead four-byte arrays with pre-encoded UTF8. The memory footprint is exactly the same; both the now-removed `Utf8Char` and the built-in UTF-32 `char` have four bytes. The advantages are: - It requires less custom code for things that are essentially part of the Rust standard library - It requires less defensive checking -- a `Utf8Char` could in theory contain invalid data, which required extra checks, while a Rust `char` is guaranteed to be valid unicode - It's easier to read and understand - Postponing UTF-8 encoding until it's actually needed might give a slight performance improvement, although that'll likely be negligible in practice, especially in light of the major perf improvements from Jacinta's original PR (re-running the benchmarks was inconclusive)

…ar-repr

federico-stacks

Here are my thoughts on this proposal:

Using Vec<char> is definitely more ergonomic and simplifies construction and parsing. However, it also introduces additional complexity and allocations in all paths that require UTF-8 bytes (e.g., serialization, serde, hashing).

For example:

Serialization: both to_utf8_bytes() and to_utf8_string() now require allocating a String and then converting it into a Vec<u8>. This is also likely to be a hot path.
Serde: encoding requires first converting all chars into a String (allocating), then re-parsing it with char_indices to determine UTF-8 boundaries, and finally slicing based on those boundaries. This adds extra work compared to the original approach.

That said, I’m open to discussing this further, especially in light of your note that you didn’t observe any benchmark impact.

benjamin-stacks · 2026-03-26T13:08:14Z

Serialization: both to_utf8_bytes() and to_utf8_string() now require allocating a String and then converting it into a Vec<u8>. This is also likely to be a hot path.

While that's true, the original approach allocated a vector and filled it iteratively via try_fold. Because the size hints from c.as_bytes()? would have not been helpful, this would also require growing the vector a couple of times. The new char_vector.iter().collect() approach allows for much better size hinting (and more generally, I would expect the implementation in the Rust standard library to be as efficient as it gets, but that's just a gut feel). It also removes the need for Result<_, _> and error handling, because unlike the old Utf8Char, a char is guaranteed to play nice.

Serde: encoding requires first converting all chars into a String (allocating), then re-parsing it with char_indices to determine UTF-8 boundaries, and finally slicing based on those boundaries. This adds extra work compared to the original approach.

The parsing and slicing happened in the old approach as well, via byte_len and as_bytes, except that it was a home-grown implementation instead of a (presumably well-optimized) standard library method. The conversion from chars happened in the original approach as well, except that it happened unconditionally during creation, not on-demand (and I assumed that JSON-serialization isn't going to be such a hot path -- if it were, I would strongly suggest we reconsider the slightly strange serialization as an array-of-arrays).

And finally, the benefit of slicing a ready-made string is that the individual slices represent one contiguous area of memory, which I would expect to be more cache-friendly on the CPU, especially because I think it's fair to assume that most characters will be in the ASCII range, and thus the traversed memory is about a quarter of the size compared to the old approach.

All in all, it's absolutely possible that there are trade-offs here, but it's not at all obvious to me that the new approach would be worse.

jacinta-stacks and others added 14 commits March 2, 2026 14:05

Replace Utf8Data's Vec<Vec<u8>> with Vec<Utf8Char> (Vec<[u8; 4]>)

d3c2e78

Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>

Add benchmarking

3cf485a

Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>

Cleanup test name to be reflective

9abde8d

Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>

Add copyright

9bfb200

Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>

Fix cargo-fmt

a0b7669

Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>

CRC: make Utf8Char inner field private and validate on deserialization

49d0851

Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>

Merge branch 'develop' of https://github.com/stacks-network/stacks-core…

1de6ba8

… into chore/utf8data-fixed-array-repr

Fix clippy

9b2bd61

Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>

Merge branch 'develop' of https://github.com/stacks-network/stacks-core…

83e67bc

… into chore/utf8data-fixed-array-repr

Merge branch 'develop' of https://github.com/stacks-network/stacks-core…

f3236b6

… into chore/utf8data-fixed-array-repr

Added a changelog entry

4f63d59

Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>

Merge branch 'develop' into chore/utf8data-fixed-array-repr

1ce6a44

Merge branch 'chore/utf8data-fixed-array-repr' into chore/utf8data-ch…

05d3387

…ar-repr

benjamin-stacks requested review from cylewitruk-stacks and federico-stacks March 23, 2026 14:38

federico-stacks reviewed Mar 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use rust-native `char` instead of custom representation for string data#7021

use rust-native `char` instead of custom representation for string data#7021
benjamin-stacks wants to merge 14 commits intostacks-network:developfrom
benjamin-stacks:chore/utf8data-char-repr

benjamin-stacks commented Mar 23, 2026

federico-stacks left a comment

benjamin-stacks commented Mar 26, 2026 •

edited

Loading

Labels

3 participants

Conversation

benjamin-stacks commented Mar 23, 2026

federico-stacks left a comment

Choose a reason for hiding this comment

benjamin-stacks commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Labels

3 participants

benjamin-stacks commented Mar 26, 2026 •

edited

Loading