23

I'm having some trouble figuring out the exact semantics of std::string.length(). The documentation explicitly points out that length() returns the number of characters in the string and not the number of bytes. I was wondering in which cases this actually makes a difference.

In particular, is this only relevant to non-char instantiations of std::basic_string<> or can I also get into trouble when storing UTF-8 strings with multi-byte characters? Does the standard allow for length() to be UTF8-aware?

4
  • there is wstring for UTF and there it makes senses that length returns the number of characters since the character's size could vary. Commented Oct 12, 2011 at 16:33
  • 8
    @AndersK.: No, wchar_t has a fixed size like any other type. It can't magically vary. Commented Oct 12, 2011 at 16:33
  • Also check this lovely thread about std::string vs. std::wstring and some stuff about Unicode: stackoverflow.com/questions/402283/stdwstring-vs-stdstring Commented Oct 12, 2011 at 16:35
  • 2
    @AndersK.: wstring has nothing to do with UTF. Perhaps you were thinking of u16string or u32string? Commented Oct 12, 2011 at 17:27

4 Answers 4

31

When dealing with non-char instantiations of std::basic_string<>, sure, length may not equal number of bytes. This is particularly evident with std::wstring:

std::wstring ws = L"hi"; cout << ws.length(); // <-- 2, not 4 

But std::string is about char characters; there is no such thing as a multi-byte character as far as std::string is concerned, whether you crammed one in at a high level or not. So, std::string.length() is always the number of bytes represented by the string. Note that if you're cramming multibyte "characters" into an std::string, then your definition of "character" suddenly becomes at odds with that of the container and of the standard.

Sign up to request clarification or add additional context in comments.

6 Comments

That makes perfect sense. I simply got confused by the wording in the documentation here. Thanks for clearing things up.
@ComicSansMS: Not a problem :)
But std::string is about char characters, so the definition of "character" in C++ is "element of some string type", rather than "what a human sees, encoded" or "a unicode codepoint, encoded somehow". This sounds believable, but can anyone quote chapter-and-verse on this?
Well, I guess that's what I get for asking for chapter-and-verse.
@AdrianRatnapala: Yes, when asking for chapter-and-verse, you get chapter-and-verse. Anything else I can help you with? :)
|
14

If we are talking specifically about std::string, then length() does return the number of bytes.

This is because a std::string is a basic_string of chars, and the C++ Standard defines the size of one char to be exactly one byte.

Note that the Standard doesn't say how many bits are in a byte, but that's another story entirely and you probably don't care.

EDIT: The Standard does say that an implementation shall provide a definition for CHAR_BIT which says how many bits are in a byte.

By the way, if you go down a road where you do care how many bits are in a byte, you might consider reading this.

3 Comments

Indeed, "byte" is not necessarily synonymous with "octet".
The standard does define CHAR_BIT, the number of bits in a byte.
@Mike: True, but the Standard doesn't say what that's defined to. When I said "doesn't say how many bits are in a byte" I meant in a precise, unambigious sense. But I'll clarify my post with an edit, thanks for pointing this out.
4

A std::string is std::basic_string<char>, so s.length() * sizeof(char) = byte length. Also, std::string knows nothing of UTF-8, so you're going to get the byte size even if that's not really what you're after.

If you have UTF-8 data in a std::string, you'll need to use something else such as ICU to get the "real" length.

Comments

0

cplusplus.com is not "the documentation" for std::string, it's a poor quality site full of poor quality information. The C++ standard defines it very clearly:

  • 21.1 [strings.general] ¶1

    This Clause describes components for manipulating sequences of any non-array POD (3.9) type. In this Clause such types are called char-like types, and objects of char-like types are called char-like objects or simply characters.

  • 21.4.4 [string.capacity] ¶1

    size_type size() const noexcept;
    Returns: A count of the number of char-like objects currently in the string.
    Complexity: constant time.

    size_type length() const noexcept;
    Returns: size()

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.