I wrote a header file to write German umlauts in a textfile properly

Question

This function is about the fact that a std::wstring was used in another cpp file in order to be able to read strings with German umlauts from the console. Since it is difficult to get wstrings into a text file when a std::ofstream is already accessing the text file, this wstring was converted into a normal std::string using utf8.h. The 16-bit characters that represented umlauts are now 2 cryptic characters (which is logical, I know). A ß becomes ÃŸ, an ü becomes Ã¼, as you often see it in everyday life. This is corrected with this .h and .cpp files.

By utf8.h, I mean this:

My question is: Could you please review this function and say what you think of the code? I'm asking because the code is copying one vector to a second one a lot, and, as you see in the last lines, I need to get rid of ‘left-over’ null characters. I want to include the header and cpp file more often, so I want the two to be good.

handle_German_umlauts.cpp

#include "handle_German_umlauts.h" Umlaute_korrigieren::Umlaute_korrigieren() { } Umlaute_korrigieren::~Umlaute_korrigieren() { } std::vector<char> Umlaute_korrigieren::_std__String_to_std__vectorChar_for_ANSI(std::string stdstring) { std::vector<char> CString(stdstring.c_str(), stdstring.c_str() + stdstring.size() + 1); std::vector<char> copy(stdstring.c_str(), stdstring.c_str() + stdstring.size() + 1); for (size_t i = (size_t)0; i < CString.size() - (size_t)1; i++) { if (CString[i] == -61 && CString[i + 1] == -97) // Pseudo-ß gefunden { copy[i] = '\xDF'; //ß ist DF(hex) in ANSI for (size_t j = copy.size() - (size_t)1; j > (i+(size_t)1); j--) // umkopieren { copy[j - 1] = CString[j]; } CString = copy; } if (CString[i] == -61 && CString[i + 1] == -68) // Pseudo-ü gefunden { copy[i] = '\xFC'; //ü ist FC(hex) in ANSI for (size_t j = copy.size() - (size_t)1; j > (i + (size_t)1); j--) // umkopieren { copy[j - 1] = CString[j]; } CString = copy; } if (CString[i] == -61 && CString[i + 1] == -92) // Pseudo-ä gefunden { copy[i] = '\xE4'; //ä ist E4(hex) in ANSI for (size_t j = copy.size() - (size_t)1; j > (i + (size_t)1); j--) // umkopieren { copy[j - 1] = CString[j]; } CString = copy; } if (CString[i] == -61 && CString[i + 1] == -74) // Pseudo-ö gefunden { copy[i] = '\xF6'; //ö ist F6(hex) in ANSI for (size_t j = copy.size() - (size_t)1; j > (i + (size_t)1); j--) // umkopieren { copy[j - 1] = CString[j]; } CString = copy; } if (CString[i] == -61 && CString[i + 1] == -124) // Pseudo-Ä gefunden { copy[i] = '\xC4'; //Ä ist C4(hex) in ANSI for (size_t j = copy.size() - (size_t)1; j > (i + (size_t)1); j--) // umkopieren { copy[j - 1] = CString[j]; } CString = copy; } if (CString[i] == -61 && CString[i + 1] == -106) // Pseudo-Ö gefunden { copy[i] = '\xD6'; //Ö ist D6(hex) in ANSI for (size_t j = copy.size() - (size_t)1; j > (i + (size_t)1); j--) // umkopieren { copy[j - 1] = CString[j]; } CString = copy; } if (CString[i] == -61 && CString[i + 1] == -100) // Pseudo-Ü gefunden { copy[i] = '\xDC'; //Ü ist DC(hex) in ANSI for (size_t j = copy.size() - (size_t)1; j > (i + (size_t)1); j--) // umkopieren { copy[j - 1] = CString[j]; } CString = copy; } } // crop unnecessary ‘\0’s size_t _0Counter = 0; for (size_t i = (size_t)0; i < CString.size(); i++) { if (CString[i] == '\0') { _0Counter += (size_t)1; } } size_t original = CString.size() - (size_t)1; // because the vector gets smaller due to the deletion and the for loop is always reevaluating size_t wie_weit = CString.size() - _0Counter; for (size_t i = original; i > wie_weit; i--) { CString.erase(CString.begin() + i - 1); } return CString; }

The handle_German_umlauts.h

#ifndef HANDLE_GERMAN_UMLAUTS_H_ #define HANDLE_GERMAN_UMLAUTS_H_ #include <vector> #include <string> class Umlaute_korrigieren { public: Umlaute_korrigieren(); ~Umlaute_korrigieren(); std::vector<char> _std__String_to_std__vectorChar_for_ANSI(std::string); private: }; #endif // !HANDLE_GERMAN_UMLAUTS_H_

The function is called as follows:

std::string Strasse_als_stdstring; utf8::utf16to8(physical_address.street.begin(), physical_address.street.end(), back_inserter(Strasse_als_stdstring)); std::vector<char> korrigierte_Strasse = Uk._std__String_to_std__vectorChar_for_ANSI(Strasse_als_stdstring); for (size_t h = (size_t)0; h < korrigierte_Strasse.size() - (size_t)3; h++) // write to txt. -3, so that \r\n\0 aren't printed. { fs8 << korrigierte_Strasse[h]; } fs8 << " " << physical_address.house_number << std::endl;

where physical_address.street is the std::wstring (mentioned above), and the for loop serves to write the chars in the textfile (std::ofstream fs8).

I'm confused because this converts a wstring to utf8, but then changes some utf8 things to CP-1252, so the result is kind of a mix between CP-1252 and utf8. It's odd. It indicates that whatever is reading the data, is really using CP-1252 after all, otherwise there would be no need to do this. — user555045
– user555045, Commented Jul 16, 2021 at 20:08
@harold The utf8 library converts the wstring not correctly if umlauts are in it. The result will be wrong. So today I started writing my own header file. — Daniel
– Daniel, Commented Jul 16, 2021 at 20:12
If the utf8 library converts the incorrectly you shouldn't be using it. However it's very unlikely that is the case. Generally modern software nowadays support umlauts (and other special characters) very well, so that the things you are doing here shouldn't be necessary and just will make the situation worse. It is more likely that you are doing something wrong (that is, you are using the library or something else incorrectly). — RoToRa
– RoToRa, Commented Jul 16, 2021 at 21:53

aghast · Accepted Answer · 2021-07-16 21:14:10Z

I see a few problems.

Comparisons and counting

At the lowest level, you are being repetitive and inefficient with your comparisons. And you are "rediscovering" information you already know -- specifically, the amount of shrinkage in the string.

I suggest you code your tests for characters in a cascading style, since they all depend on the same prefix character.

And keep a count of the amount of shrinkage so you don't have to recount it. If you're turning two characters into one, then shrinkage += 2 - 1; etc.

Multiple copies

At a higher level, you are being very inefficient with your handling of Cstring and copy. Instead of "moving down" the characters for each special character you find, you should only copy them once, into their exactly right target location.

You are doing this:

for i in ... if special character at position i fix special character move all subsequent characters down by 1

That means if your string starts with 2 special characters, you will process the first one, move all the following characters down by 1, then process the second one, and move all the following characters down by 1 AGAIN.

Instead, I think you could simple make a duplicate and copy characters into it one by one:

j = 0 for i in 0 ... if special character at position i: copy[j] = translated character shrinkage += ... j += 1 else copy[j] = original[i] j += 1

This way you keep two separate indexes but you only ever copy the characters one time, into their final position.

Magic Numbers

There are few better ways to piss off the poor idiot that has to maintain your code than by throwing in a bunch of magic numbers! Magic numbers are one of those things that every single programming language provides a way to fix (usually several ways), and you always learn them early because it's soooo easy to do. Why do you have all these magic numbers in your code?

What do -61, -97, '\xDF', and (size_t)1 mean?

Integer promotions

As an aside, (size_t)1 is unnecessary. If you have warnings at some ridiculous level, consider using 1U, but if there's a warning level that nags at you when you subtract a literal 1 from a variable that is already declared size_t, you should just turn it off -- it's making you pollute your code even worse than C++ usually does.

Get rid of comments

I've already suggested that you restructure your code to get rid of all the copying and moving. And I've suggested that you restructure your if statements into a 2-level hierarchy of first-character, second-character.

But I'll point out that having a comment explaining something is a good indicator that you can replace whatever it is with either a named object or a function call.

Consider this:

if (CString[i] == -61 && CString[i + 1] == -97) // Pseudo-ß gefunden

Now consider this:

#define SHARP_S -61, -97 #define IS_SPECIAL_CHAR(c1, c2) (CString[i]==(c1) && CString[i+1]==(c2)) if (IS_SPECIAL_CHAR(SHARP_S))

Notice that I don't need a comment?

If you've got a stupid coding standard, or a quasi-religious belief about getting rid of the C preprocessor, you can recode that macro as 200 lines of template class or something, or just rewrite it as an inline function (constexpr, even!).

Names

Be consistent. You have CString and copy and they are both locals with the same scope? How about cstring and copy? Or CString and CString2?

Who thinks _std__String_to_std__vectorChar_for_ANSI is a good idea? Underscores, double underscores, and capitalization? No, man. How about fix_wstring_umlaut_conversions instead?

Hello @aghast, thank you for your detailed and understandable answer. I implemented everything last night. The code looks much clearer. — Daniel
– Daniel, Commented Jul 17, 2021 at 18:29

Stack Exchange Network

I wrote a header file to write German umlauts in a textfile properly

1 Answer 1

Comparisons and counting

Multiple copies

Magic Numbers

Integer promotions

Get rid of comments

Names

You must log in to answer this question.

Hot Network Questions

I wrote a header file to write German umlauts in a textfile properly

1 Answer 1

Comparisons and counting

Multiple copies

Magic Numbers

Integer promotions

Get rid of comments

Names

You must log in to answer this question.

Related

Hot Network Questions