Why null-terminated strings? Or: null-terminated vs. characters + length storage

Question

I'm writing a language interpreter in C, and my string type contains a length attribute, like so:

struct String { char* characters; size_t length; };

Because of this, I have to spend a lot of time in my interpreter handling this kind of string manually since C doesn't include built-in support for it. I've considered switching to simple null-terminated strings just to comply with the underlying C, but there seem to be a lot of reasons not to:

Bounds-checking is built-in if you use "length" instead of looking for a null.

You have to traverse the entire string to find its length.

You have to do extra stuff to handle a null character in the middle of a null-terminated string.

Null-terminated strings deal poorly with Unicode.

Non-null-terminated strings can intern more, i.e. the characters for "Hello, world" and "Hello" can be stored in the same place, just with different lengths. This can't be done with null-terminated strings.

String slice (note: strings are immutable in my language). Obviously the second is slower (and more error-prone: think about adding error-checking of begin and end to both functions).

struct String slice(struct String in, size_t begin, size_t end) { struct String out; out.characters = in.characters + begin; out.length = end - begin; return out; } char* slice(char* in, size_t begin, size_t end) { char* out = malloc(end - begin + 1); for(int i = 0; i < end - begin; i++) out[i] = in[i + begin]; out[end - begin] = '\0'; return out; }

After all this, my thinking is no longer about whether I should use null-terminated strings: I'm thinking about why C uses them!

So my question is: are there any benefits to null-termination that I'm missing?

Since malloc() is so expensive in C, I suggest to use this structure: struct String { size_t length; char[1] characters; } Just allocate strlen(s)+1+sizeof(size_t) or strlen(s)+sizeof(String) bytes and copy the string to the address &characters. — Aaron Digulla
– Aaron Digulla, Commented Aug 10, 2009 at 8:07
Does this answer your question? What's the rationale for null terminated strings? — Jan Schultke
– Jan Schultke, Commented Sep 13, 2023 at 9:05

weiqure · Accepted Answer · 2020-06-20 09:12:55Z

From Joel's Back to Basics:

Why do C strings work this way? It's because the PDP-7 microprocessor, on which UNIX and the C programming language were invented, had an ASCIZ string type. ASCIZ meant "ASCII with a Z (zero) at the end."

Is this the only way to store strings? No, in fact, it's one of the worst ways to store strings. For non-trivial programs, APIs, operating systems, class libraries, you should avoid ASCIZ strings like the plague.

Denis Ritchie opinion is somewhat different. BCPL had a length+content representation, with the length contained in one byte. B switched to terminated string "partially to avoid the limitation on the length of a string caused by holding the count in an 8 or 9 bit slot, and partly because maintaining the count seemed, in our experience, less convenient than using a terminator." (The Development of the C Language, cm.bell-labs.com/cm/cs/who/dmr/chist.pdf)

Daniel Earwicker · Accepted Answer · 2009-08-10 06:12:14Z

21

The usual solution is to do both - keep the length and maintain the null terminator. It's not much extra work and means that you are always ready to pass the string to any function.

Null-terminated strings are often a drain on performance, for the obvious reason that the time taken to discover the length depends on the length. On the plus side, they are the standard way of representing strings in C, so you have little choice but to support them if you want to use most C libraries.

answered Aug 10, 2009 at 6:12

Daniel Earwicker

117k38 gold badges209 silver badges289 bronze badges

7 Comments

RBerteig Over a year ago

This is what Lua does. It makes interfacing to C for normal use cases very clean, and still supports arbitrary length binary buffers.

Daniel Earwicker Over a year ago

It's what most things do! You don't even have to maintain the null terminator all the time - just do str[len] = '\0' whenever you need to. This is what `std::string::c_str`` usually does in C++.

Daniel Earwicker Over a year ago

By most things, I mean most string classes, and most interpreter string representations. One widely-used example on Windows is the BSTR type.

Imagist Over a year ago

This is exactly why I asked this question; I thought I might be missing some solution. It seems obvious now but I didn't think of this!

Daniel Earwicker Over a year ago

:) It means more if you have to wait for something, I think!

|

Jason Williams · Accepted Answer · 2016-10-23 13:41:57Z

Lengths have their problems too.

The length takes extra storage (not such an issue now, but a big factor 30 years ago).
Every time you alter a string you have to update the length, so you get reduced performance across the board.
With a NUL-terminated string you can still use a length or store a pointer to the last character, so if you are doing lots of string manipulations, you can still equal the performance of string-with-length.
NUL-terminated strings are much simpler - The NUL terminator is just a convention used by methods like strcat to determine the end of the string. So you can store them in a regular char array rather than having to use a struct.

Extra storage can still be a big issue for embedded systems, which is one reason to stress keeping the language light-weight.
@Jimmy My issue there is: on such embedded systems, why are you using strings? I don't think I even ever used a char back when I was doing robotics programming. The only example I can think of is if you're programming for an LED display (like those scrolling text things or the ones on soft drink machines) but the functionality there is so simple that I still have a difficult time imagining the extra 3 bytes being a problem (4 byte int - 1 byte since you don't have to store the null character).
What you suggest is not that trivial btw. Who will take ownership of the buffer? Will the newly created temporary string do that? I doubt you want this and then you need a way to have such non-owning strings to avoid copying.

caf · Accepted Answer · 2009-08-10 06:45:34Z

One advantage of nul-terminated strings is that if you are walking through a string character-by-character, you only need to keep a single pointer to address the string:

while (*s) { *s = toupper(*s); s++; }

whereas for strings without sentinels, you need to keep two bits of state around: either a pointer and index:

while (i < s.length) { s.data[i] = toupper(s.data[i]); i++; }

...or a current pointer and a limit:

s_end = s + length; while (s < s_end) { *s = toupper(*s); s++; }

When CPU registers were a scarce resource (and compilers were worse at allocating them), this was important. Now, not so much.

"When CPU registers were a scarce resource" - registers are still a scarce resource on x86 and x64.
I don't get it; if I'm storing string in the struct example I gave, why can't I use that as the limit?
The point is that during a processing loop like the above, length-based strings like yours end up using two registers for string book-keeping whereas sentinel-based strings like idiomatic C strings only use one (the other one is obtained "for free" because the character values are being loaded in order to process them anyway).

sharptooth · Accepted Answer · 2009-08-10 06:09:01Z

7

One benefit is that with null-termination any tail of a null-terminated string is also a null-terminated string. If you need to pass a substring starting with Nth character (provided there's no buffer overrun) into some string-handling function - no problem, just pass the offseeted address there. When storing size in some other way you would need to construct a new string.

answered Aug 10, 2009 at 6:09

sharptooth

171k110 gold badges545 silver badges1k bronze badges

5 Comments

weiqure Over a year ago

Can you give an example of a string of that you might want to print the end of the string?

sharptooth Over a year ago

That can be used when concatenating strings - you could want to append not the entire string, but only the tail of it. Then you call strcat( target, source + offset); - and it's done.

Dan McGrath Over a year ago

Take a front trim of white space. You can determine the first non-white space char and instead of actually changing the string, you can just pass the starting offset in, saving you either allocating new memory, or copying data.

Imagist Over a year ago

That's not all that different for what I do with my struct: struct String new; new.characters = old.characters + offset; new.length = old.length - offset; It's a bit of bookkeeping but comes out to what, 5 instructions? This seems trivial compared to the difference if you needed to do something to the beginning of the string instead of the end.

Mike Dunlavey Over a year ago

It makes it really easy to do things like recursive string matching, spelling correction, etc., if you can treat the string like you would a list in Lisp.

Nick Johnson · Accepted Answer · 2009-08-20 20:13:46Z

Slightly offtopic, but there's a more efficient way to do length-prefixed strings than the way you describe. Create a struct like this (valid in C99 and up):

struct String { size_t length; char characters[0]; }

This creates a struct that has the length at the start, with the 'characters' element usable as a char* just as you would with your current struct. The difference, however, is that you can allocate only a single item on the heap for each string, instead of two. Allocate your strings like this:

mystr = malloc(sizeof(String) + strlen(cstring))

Eg - the length of the struct (which is just the size_t) plus enough space to put the actual string after it.

If you don't want to use C99, you can also do this with "char characters[1]" and subtract 1 from the length of the string to allocate.

char characters[0]; is not "valid in C99 and up):". char characters[]; is valid in C99 and up.

Jimmy · Accepted Answer · 2009-08-10 06:14:38Z

Just throwing out some hypotheticals:

there's no way to get a "wrong" implementation of null terminated strings. A standardized struct however could have vendor-specific implementations.
no structs are required. Null terminated strings are "built-in" so to speak, by virtue of being a special case of a char*.

Dan McGrath · Accepted Answer · 2009-08-10 06:14:04Z

2

Although I prefer the array + len method in most cases, there are valid reasons for using null-terminated.

Take a 32-bit system.

To store a 7 byte string
char * + size_t + 8 bytes = 19 bytes

To store a 7 byte null-term string
char * + 8 = 16 bytes.

null-term arrays don't need to be immutable like your strings do. I can happily truncate the c-string by simply places a null char. If you code, you would need to create a new string, which involves allocating memory.

Depending on the usage of the strings, your strings will never be able to match the performance possible with c-strings as opposed to your strings.

answered Aug 10, 2009 at 6:14

Dan McGrath

42.1k11 gold badges104 silver badges130 bronze badges

5 Comments

Jason Williams Over a year ago

You can truncate a string-with-length by just reducing the length. Usually this means you have two lengths - the current length of the string plus the amount of memory you have currently allocated for the string. This allows it to be dynamically resized without having to realloc on every modification.

Dan McGrath Over a year ago

Indeed, that is the way to go; I had based my answer on the op's string struct, which would enable you to reduce the length, but never to utilise that space again. Ropes are another interesting way to handle strings. en.wikipedia.org/wiki/Rope_(computer_science)

Imagist Over a year ago

A few questions: assuming a byte is 8 bits, shouldn't a 32-bit system have sizeof(size_t) == 4 and sizeof(char*) == 4? And with my method you don't have to use 8 characters for the first method. So I'm getting: 4 + 4 + 7 = 15 for my method, and 4 + 8 = 12 for the null-terminating method. I'm not contesting your point, just the math that leads to your point.

Imagist Over a year ago

@Dan McG (comment) My strings will be stored in a garbage-collected system, which will allow me to reclaim the space. And my interpreter uses ropes internally; the garbage collector will use idle cycles to flatten ropes into strings.

chux Feb 16 at 21:15

@DanMcGrath "char * + size_t + 8 bytes = 19 bytes" addition is unclear as an even sum expected.

Adriaan · Accepted Answer · 2009-08-10 09:53:19Z

You're absolutely right that 0-termination is a method which is poor with respect to type checking and performance for part of the operations. The answers on this page already summarize the origins and uses for it.

I liked the way Delphi stored strings. I believe it maintains a length/maxlength in before the (variable length) string. This way the strings can be null-terminated for compatibility.

My concerns with your mechanism: - additional pointer - immutability si in the core parts of your language; normally string types are not immutable so if you ever reconsider than it'll be tough. You'd need to implement a 'create copy on change' mechanism - use of malloc (hardly efficient, but may be included here just for ease?)

Good luck; writing your own interpreter can be very educational in understanding mainly the grammar and syntax of programming languages! (at least, it ws for me)

@NickJohnson most low level languages have immutable strings. and most high level languages have mutable strings.

Kirill V. Lyadvinsky · Accepted Answer · 2009-08-10 06:07:18Z

0

I think main reason is that standard says nothing concrete about size of any type other than char. But sizeof(char) = 1 and that is definitely not enough for string size.

answered Aug 10, 2009 at 6:07

community wiki

Kirill V. Lyadvinsky

8 Comments

Daniel Earwicker Over a year ago

It's enough for 2^CHAR_BIT characters per string.

Kirill V. Lyadvinsky Over a year ago

It is only 255 characters. Too little.

caf Over a year ago

No, it's enough for 2^CHAR_BIT - 1 characters per string, and if you've never had to deal with strings longer than 255 characters then you've only had to program for a very limited problem domain. However C does say concrete things about other integral types - for example int has to have a range of at least -32767 to +32767. In particular C says that size_t must be able to hold the size of any object, so that would be fine as a standardised string length.

Daniel Earwicker Over a year ago

@caf, you aren't thinking creatively enough if you can't see how to represent that extra unit of character length. :)

Imagist Over a year ago

@Jla3ep Did you look at my code sample? I used size_t to store the length for a reason.

|

Collectives™ on Stack Overflow

Why null-terminated strings? Or: null-terminated vs. characters + length storage

10 Answers 10

1 Comment

7 Comments

3 Comments

3 Comments

5 Comments

1 Comment

Comments

5 Comments

3 Comments

8 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

10 Answers 10

1 Comment

7 Comments

3 Comments

3 Comments

5 Comments

1 Comment

Comments

5 Comments

3 Comments

8 Comments

Linked

Related