Odd behavior removing duplicate characters in a C string

Question

I am using the following method in a program used for simple substitution-based encryption. This method is specifically used for removing duplicate characters in the encryption/decryption key.

The method is functional, as is the rest of the program, and it works for 99% of the keys I've tried. However, when I pass it the key "goodmorning" or any key consisting of the same letters in any order (e.g. "dggimnnooor"), it fails. Further, keys containing more characters than "goodmorning" work, as well as keys with less characters.

I ran the executable through lldb with the same arguments and it works. I've cloned my repository on a machine running CentOS, and it works as is.

But I get no warnings or errors on compile.

//setting the key in main method char * key; key = removeDuplicates(argv[2]); //return 1 if char in word int targetFound(char * charArr, int num, char target){ int found = 0; if(strchr(charArr,target)) found = 1; return found; } //remove duplicate chars char * removeDuplicates(char * word){ char * result; int len = strlen(word); result = malloc (len * sizeof(char)); if (result == NULL) errorHandler(2); char ch; int i; int j; for( i = 0, j = 0; i < len; i++){ ch = word[i]; if(!targetFound(result, i, ch)){ result[j] = ch; j++; } } return result; }

Per request: if "feather" was passed in to this function the resulting string would be "feathr".

Show the context, how you call it etc. Make a minimal reproducible example. Explain what happens, in which undesired way. Please. — Yunnosch
– Yunnosch, Commented Feb 16, 2018 at 6:58
And what if there are no duplicates? How will you fit space for the string terminator in the resulting string? Where do you even add the terminator? — Some programmer dude
– Some programmer dude, Commented Feb 16, 2018 at 6:58
Related: stackoverflow.com/q/2221304/694576 stackoverflow.com/q/13904149/694576 — alk
– alk, Commented Feb 16, 2018 at 7:02
That's the problem with undefined behavior, sometimes, maybe even most of the time, it seems to work fine. And then it just breaks without warning. Not putting a terminator in a string will lead to undefined behavior. — Some programmer dude
– Some programmer dude, Commented Feb 16, 2018 at 7:20

Marco Bonelli · Accepted Answer · 2018-02-16 08:02:26Z

As R Sahu already said, you are not terminating your string with a NUL character. Now I'm not going to explain why you need to do this, but you always need to terminate your strings with a NUL character, which is '\0'. If you want to know why, head over here for a good explanation. However this is not the only problem with your code.

The main problem is that the function strchr that you are calling to find out if your result already contains some character expects you to pass a NUL terminated string, but your variable is not NUL terminated, because you keep appending characters to it.

To solve your problem, I would suggest you to use a map instead. Map all the characters you already used and if they aren't in the map add them both to the map and the result. This is simpler (no need to call strchr or any other function), faster (no need to scan all the string every time), and most importantly correct.

Here's a simple solution:

char *removeDuplicates(char *word){ char *result, *map, ch; int i, j; map = calloc(256, 1); if (map == NULL) // Maybe you want some other number here? errorHandler(2); // Add one char for the NUL terminator: result = malloc(strlen(word) + 1); if (result == NULL) errorHandler(2); for(i = 0, j = 0; word[i] != '\0'; i++) { ch = word[i]; // Check if you already saw this character: if(map[(size_t)ch] == 0) { // If not, add it to the map: map[(size_t)ch] = 1; // And to your result string: result[j] = ch; j++; } } // Correctly NUL terminate the new string; result[j] = '\0'; return result; }

Why does this work on other machines, but not on your machine?

You are being a victim of undefined behavior. Different compilers on different systems treat undefined behavior differently. For example, GCC may decide to not do anything in this particular case and make strchr just keep searching in the memory until it founds a '\0' character, and this is exactly what happens. Your program keeps searching for the NUL terminator and never stops because who knows where a '\0' could be in memory after your string? This is both dangerous and incorrect, because the program is not reading inside the memory reserved for it, so for example, another compiler could decide to stop the search there, and give you a correct result. This however is not something to take for granted, and you should always avoid undefined behavior.

You can also simply use 0 (zero or octal-constant zero) as the nul-terminating character. The ASCII value of '\0' is zero, there is just less typing with 0...
@DavidC.Rankin yes, less typing, but more confusion. Seeing 0 one might think that result is an integer array or something else. Since there's no difference between the two, I prefer clarity over saving three characters in my source code. Same goes for == NULL vs == 0, etc.

R Sahu · Accepted Answer · 2018-02-16 07:42:01Z

I see couple of problems in your code:

You are not terminating the output with the null character.
You are not allocating enough memory to hold the null character when there are no duplicate characters in the input.

As a consequence, your program has undefined behavior.

Change

result = malloc (len * sizeof(char));

to

result = malloc (len+1); // No need for sizeof(char)

Add the following before the function returns.

result[j] = '\0';

The other problem, the main one, is that you are using strchr on result, which is not a null terminated string when you call targetFound. That also caused undefined behavior. You need to use:

char * removeDuplicates(char * word){ char * result; int len = strlen(word); result = malloc (len+1); if (result == NULL) { errorHandler(2); } char ch; int i; int j; // Make result an empty string. result[0] = '\0'; for( i = 0, j = 0; i < len; i++){ ch = word[i]; if(!targetFound(result, i, ch)){ result[j] = ch; j++; // Null terminate again so that next call to targetFound() // will work. result[j] = '\0'; } } return result; }

A second option is to not use strchr in targetFound. Use num instead and implement the equivalent functionality.

int targetFound(char * charArr, int num, char target) { for ( int i = 0; i < num; ++i ) { if ( charArr[i] == target ) { return 1; } } return 0; }

That will allow you to avoid assigning the null character to result so many times. You will need to null terminate result only at the end.

char * removeDuplicates(char * word){ char * result; int len = strlen(word); result = malloc (len+1); if (result == NULL) { errorHandler(2); } char ch; int i; int j; for( i = 0, j = 0; i < len; i++){ ch = word[i]; if(!targetFound(result, i, ch)){ result[j] = ch; j++; } } result[j] = '\0'; return result; }

So, why does it work for all other keys every single time? And why does it work "as is" on a machine running CentOS time and time again?
@wanderbread, trying to make sense of undefined behavior is futile. Seemingly sane behavior is included in that too.
thank you. I initially had mis-read the placement of result[j]='\0'

Collectives™ on Stack Overflow

Odd behavior removing duplicate characters in a C string

2 Answers 2

3 Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Linked

Related