Length of a C++ std::string in bytes

Question

I'm having some trouble figuring out the exact semantics of std::string.length(). The documentation explicitly points out that length() returns the number of characters in the string and not the number of bytes. I was wondering in which cases this actually makes a difference.

In particular, is this only relevant to non-char instantiations of std::basic_string<> or can I also get into trouble when storing UTF-8 strings with multi-byte characters? Does the standard allow for length() to be UTF8-aware?

there is wstring for UTF and there it makes senses that length returns the number of characters since the character's size could vary. — AndersK
– AndersK, Commented Oct 12, 2011 at 16:33
@AndersK.: No, wchar_t has a fixed size like any other type. It can't magically vary. — Lightness Races in Orbit
– Lightness Races in Orbit, Commented Oct 12, 2011 at 16:33
Also check this lovely thread about std::string vs. std::wstring and some stuff about Unicode: stackoverflow.com/questions/402283/stdwstring-vs-stdstring — wkl
– wkl, Commented Oct 12, 2011 at 16:35
@AndersK.: wstring has nothing to do with UTF. Perhaps you were thinking of u16string or u32string? — Kerrek SB
– Kerrek SB, Commented Oct 12, 2011 at 17:27

Lightness Races in Orbit · Accepted Answer · 2011-10-12 16:33:21Z

When dealing with non-char instantiations of std::basic_string<>, sure, length may not equal number of bytes. This is particularly evident with std::wstring:

std::wstring ws = L"hi"; cout << ws.length(); // <-- 2, not 4

But std::string is about char characters; there is no such thing as a multi-byte character as far as std::string is concerned, whether you crammed one in at a high level or not. So, std::string.length() is always the number of bytes represented by the string. Note that if you're cramming multibyte "characters" into an std::string, then your definition of "character" suddenly becomes at odds with that of the container and of the standard.

That makes perfect sense. I simply got confused by the wording in the documentation here. Thanks for clearing things up.
But std::string is about char characters, so the definition of "character" in C++ is "element of some string type", rather than "what a human sees, encoded" or "a unicode codepoint, encoded somehow". This sounds believable, but can anyone quote chapter-and-verse on this?
Well, I guess that's what I get for asking for chapter-and-verse.
@AdrianRatnapala: Yes, when asking for chapter-and-verse, you get chapter-and-verse. Anything else I can help you with? :)

Community · Accepted Answer · 2017-05-23 12:00:16Z

If we are talking specifically about std::string, then length() does return the number of bytes.

This is because a std::string is a basic_string of chars, and the C++ Standard defines the size of one char to be exactly one byte.

Note that the Standard doesn't say how many bits are in a byte, but that's another story entirely and you probably don't care.

EDIT: The Standard does say that an implementation shall provide a definition for CHAR_BIT which says how many bits are in a byte.

By the way, if you go down a road where you do care how many bits are in a byte, you might consider reading this.

The standard does define CHAR_BIT, the number of bits in a byte.
@Mike: True, but the Standard doesn't say what that's defined to. When I said "doesn't say how many bits are in a byte" I meant in a precise, unambigious sense. But I'll clarify my post with an edit, thanks for pointing this out.

NuSkooler · Accepted Answer · 2015-01-23 17:00:22Z

A std::string is std::basic_string<char>, so s.length() * sizeof(char) = byte length. Also, std::string knows nothing of UTF-8, so you're going to get the byte size even if that's not really what you're after.

If you have UTF-8 data in a std::string, you'll need to use something else such as ICU to get the "real" length.

Jonathan Wakely · Accepted Answer · 2013-05-28 16:17:09Z

cplusplus.com is not "the documentation" for std::string, it's a poor quality site full of poor quality information. The C++ standard defines it very clearly:

21.1 [strings.general] ¶1

This Clause describes components for manipulating sequences of any non-array POD (3.9) type. In this Clause such types are called char-like types, and objects of char-like types are called char-like objects or simply characters.
21.4.4 [string.capacity] ¶1

size_type size() const noexcept;
Returns: A count of the number of char-like objects currently in the string.
Complexity: constant time.

size_type length() const noexcept;
Returns: size()

Collectives™ on Stack Overflow

Length of a C++ std::string in bytes

4 Answers 4

6 Comments

3 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

3 Comments

Comments

Comments

Linked

Related