use rust-native char instead of custom representation for string data#7021
use rust-native char instead of custom representation for string data#7021benjamin-stacks wants to merge 14 commits intostacks-network:developfrom
char instead of custom representation for string data#7021Conversation
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
… into chore/utf8data-fixed-array-repr
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
… into chore/utf8data-fixed-array-repr
… into chore/utf8data-fixed-array-repr
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
This builds on top of Jacinta's excellent work in stacks-network#6948 (and sits on top of that PR's branch), but it changes the representation of characters in `UTF8Data::data` to be native `char`s instead four-byte arrays with pre-encoded UTF8. The memory footprint is exactly the same; both the now-removed `Utf8Char` and the built-in UTF-32 `char` have four bytes. The advantages are: - It requires less custom code for things that are essentially part of the Rust standard library - It requires less defensive checking -- a `Utf8Char` could in theory contain invalid data, which required extra checks, while a Rust `char` is guaranteed to be valid unicode - It's easier to read and understand - Postponing UTF-8 encoding until it's actually needed might give a slight performance improvement, although that'll likely be negligible in practice, especially in light of the major perf improvements from Jacinta's original PR (re-running the benchmarks was inconclusive)
federico-stacks left a comment
There was a problem hiding this comment.
Here are my thoughts on this proposal:
Using Vec<char> is definitely more ergonomic and simplifies construction and parsing. However, it also introduces additional complexity and allocations in all paths that require UTF-8 bytes (e.g., serialization, serde, hashing).
For example:
- Serialization: both
to_utf8_bytes()andto_utf8_string()now require allocating aStringand then converting it into aVec<u8>. This is also likely to be a hot path. - Serde: encoding requires first converting all chars into a
String(allocating), then re-parsing it withchar_indicesto determine UTF-8 boundaries, and finally slicing based on those boundaries. This adds extra work compared to the original approach.
That said, I’m open to discussing this further, especially in light of your note that you didn’t observe any benchmark impact.
While that's true, the original approach allocated a vector and filled it iteratively via
The parsing and slicing happened in the old approach as well, via And finally, the benefit of slicing a ready-made string is that the individual slices represent one contiguous area of memory, which I would expect to be more cache-friendly on the CPU, especially because I think it's fair to assume that most characters will be in the ASCII range, and thus the traversed memory is about a quarter of the size compared to the old approach. All in all, it's absolutely possible that there are trade-offs here, but it's not at all obvious to me that the new approach would be worse. |
This builds on top of Jacinta's excellent work in #6948 (and sits on top of that PR's branch), but it changes the representation of characters in
UTF8Data::datato be nativechars instead four-byte arrays with pre-encoded UTF8.The memory footprint is exactly the same; both the now-removed
Utf8Charand the built-in UTF-32charhave four bytes.The advantages are:
Utf8Charcould in theory contain invalid data, which required extra checks, while a Rustcharis guaranteed to be valid unicodeDiff between this PR and Jacinta's