The OP didn't but I find it worth to mention: Speaking about non-ASCII characters, the encoding should be considered as well.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Characters like À, Á, and  are not part of the 7 bit ASCII but were considered in a variety of 8 bit encodings like e.g. Windows 1252. Thereby, it's not granted that a certain character (which is not part of ASCII) has the same code point (i.e. number) in any encoding. (Most of the characters have no number in most encodings.)
However, a unique encoding table is provided by the Unicode containing all characters of any other encoding (I believe). There are implementations as
- UTF-8 where code points are represented by 1 or more 8 bit values (storage with
char) - UTF-16 where code points are represented with 1 or 2 16 bit values (storage with
std::char16_t or, maybe, wchar_t) - UTF-32 where code points are represented with 1 32 bit value (storage with
std::char32_t or, maybe, wchar_t if it has sufficient size).
Concerning the size of wchar_t: Character types.
Having that said, I used wchar_t and std::wstring in my sample to make the usage of umlauts locale and platform independent.
The order used in std::sort() to sort a range of T elements is defined by default with
bool < operator(const T&, const T&) the < operator for T.
However, there are flavors of std::sort() to define a custom predicate instead.
The custom predicate must match the signature and must provide a strict weak ordering relation.
Hence, my recommendation to use a std::map which maps the charactes to an index which results in the intended order.
This is the predicate, I used in my sample:
// sort words auto charIndex = [&mapChars](wchar_t chr) { const CharMap::const_iterator iter = mapChars.find(chr); return iter != mapChars.end() ? iter->second : (CharMap::mapped_type)mapChars.size(); }; auto pred = [&mapChars, &charIndex](const std::wstring &word1, const std::wstring &word2) { const size_t len = std::min(word1.size(), word2.size()); // + 1 to include zero terminator for (size_t i = 0; i < len; ++i) { const wchar_t chr1 = word1[i], chr2 = word2[i]; const unsigned i1 = charIndex(chr1), i2 = charIndex(chr2); if (i1 != i2) return i1 < i2; } return word1.size() < word2.size(); }; std::sort(words.begin(), words.end(), pred);
From bottom to top:
std::sort(words.begin(), words.end(), pred); is called with a third parameter which provides the predicate pred for my customized order. - The lambda
pred(), compares two std::wstrings character by character. Thereby, the comparison is done using a std::map mapChars which maps wchar_t to unsigned i.e. a character to its rank in my order. - The
mapChars stores only a selection of all character values. Hence, the character in quest might not be found in the mapChars. To handle this, a helper lambda charIndex() is used which returns mapChars.size() in this case – which is granted to be higher than all occurring indices.
The type CharMap is simply a typedef:
typedef std::map<wchar_t, unsigned> CharMap;
To initialize a CharMap, a function is used:
CharMap makeCharMap(const wchar_t *table[], size_t size) { CharMap mapChars; unsigned rank = 0; for (const wchar_t **chars = table; chars != table + size; ++chars) { for (const wchar_t *chr = *chars; *chr; ++chr) mapChars[*chr] = rank; ++rank; } return mapChars; }
It has to be called with an array of strings which contains all groups of characters in the intended order:
const wchar_t *table[] = { L"aA", L"äÄ", L"bB", L"cC", L"dD", L"eE", L"fF", L"gG", L"hH", L"iI", L"jJ", L"kK", L"lL", L"mM", L"nN", L"oO", L"öÖ", L"pP", L"qQ", L"rR", L"sS", L"tT", L"uU", L"üÜ", L"vV", L"wW", L"xX", L"yY", L"zZ" };
The complete sample:
#include <string> #include <sstream> #include <vector> static const wchar_t *table[] = { L"aA", L"äÄ", L"bB", L"cC", L"dD", L"eE", L"fF", L"gG", L"hH", L"iI", L"jJ", L"kK", L"lL", L"mM", L"nN", L"oO", L"öÖ", L"pP", L"qQ", L"rR", L"sS", L"tT", L"uU", L"üÜ", L"vV", L"wW", L"xX", L"yY", L"zZ" }; static const wchar_t *tableGerman[] = { L"aAäÄ", L"bB", L"cC", L"dD", L"eE", L"fF", L"gG", L"hH", L"iI", L"jJ", L"kK", L"lL", L"mM", L"nN", L"oOöÖ", L"pP", L"qQ", L"rR", L"sS", L"tT", L"uUüÜ", L"vV", L"wW", L"xX", L"yY", L"zZ" }; typedef std::map<wchar_t, unsigned> CharMap; // fill a look-up table to map characters to the corresponding rank CharMap makeCharMap(const wchar_t *table[], size_t size) { CharMap mapChars; unsigned rank = 0; for (const wchar_t **chars = table; chars != table + size; ++chars) { for (const wchar_t *chr = *chars; *chr; ++chr) mapChars[*chr] = rank; ++rank; } return mapChars; } // conversion to UTF-8 found in https://stackoverflow.com/a/7561991/7478597 // needed to print to console // Please, note: std::codecvt_utf8() is deprecated in C++17. :-( std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8_conv; // collect words and sort accoring to table void printWordsSorted( const std::wstring &text, const wchar_t *table[], const size_t size) { // make look-up table const CharMap mapChars = makeCharMap(table, size); // strip punctuation and other noise std::wstring textClean; for (const wchar_t chr : text) { if (chr == ' ' || mapChars.find(chr) != mapChars.end()) { textClean += chr; } } // fill word list with sample text std::vector<std::wstring> words; for (std::wistringstream in(textClean);;) { std::wstring word; if (!(in >> word)) break; // bail out // store word words.push_back(word); } // sort words auto charIndex = [&mapChars](wchar_t chr) { const CharMap::const_iterator iter = mapChars.find(chr); return iter != mapChars.end() ? iter->second : (CharMap::mapped_type)mapChars.size(); }; auto pred = [&mapChars, &charIndex](const std::wstring &word1, const std::wstring &word2) { const size_t len = std::min(word1.size(), word2.size()); // + 1 to include zero terminator for (size_t i = 0; i < len; ++i) { const wchar_t chr1 = word1[i], chr2 = word2[i]; const unsigned i1 = charIndex(chr1), i2 = charIndex(chr2); if (i1 != i2) return i1 < i2; } return word1.size() < word2.size(); }; std::sort(words.begin(), words.end(), pred); // remove duplicates std::vector<std::wstring>::iterator last = std::unique(words.begin(), words.end()); words.erase(last, words.end()); // print result for (const std::wstring &word : words) { std::cout << utf8_conv.to_bytes(word) << '\n'; } } template<typename T, size_t N> size_t size(const T (&arr)[N]) { return sizeof arr / sizeof *arr; } int main() { // a sample string std::wstring sampleText = L"In the German language the ä (a umlaut), ö (o umlaut) and ü (u umlaut)" L" have the same lexicographical rank as their counterparts a, o, and u.\n"; std::cout << "Sample text:\n" << utf8_conv.to_bytes(sampleText) << '\n'; // sort like requested by OP std::cout << "Words of text sorted as requested by OP:\n"; printWordsSorted(sampleText, table, size(table)); // sort like correct in German std::cout << "Words of text sorted as usual in German language:\n"; printWordsSorted(sampleText, tableGerman, size(tableGerman)); }
Output:
Words of text sorted as requested by OP: a and as ä counterparts German have In language lexicographical o ö rank same the their u umlaut ü Words of text sorted as usual in German language: ä a and as counterparts German have In language lexicographical o ö rank same the their u ü umlaut
Live Demo on coliru
Note:
My original intention was to do the output with std::wcout. This didn't work correctly for ä, ö, ü. Hence, I looked up a simple way to convert wstrings to UTF-8. I already knew that UTF-8 is supported in coliru.
@Phil1970 reminded me that I forgot to mention something else:
Sorting of strings (according to “human dictionary” order) is usually provided by std::locale. std::collate provides a locale dependent lexicographical ordering of strings.
The locale plays a role because the order of characters might vary with distinct locales. The std::collate doc. has a nice example for this:
Default locale collation order: Zebra ar förnamn zebra ängel år ögrupp English locale collation order: ängel ar år förnamn ögrupp zebra Zebra Swedish locale collation order: ar förnamn zebra Zebra år ängel ögrupp
Conversion of UTF-16 ⇔ UTF-32 ⇔ UTF-8 can be achieved by mere bit-arithmetics. For conversion to/from any other encoding (ASCII excluded which is a subset of Unicode), I would recommend a library like e.g. libiconv.
std::mapcould be your solution (mapping the characters to the order index you intend). Thenstd::sortcan be used with a custom predicate considering the order index for comparison. However, "`A" is not a character, these are two. Or did you mean "À"? Please, edit your question to clarify. (Give an example of the order, you prefer, involving non-ASCIIs.)