How do I tokenize a string in C++?

Question

Java has a convenient split method:

String str = "The quick brown fox"; String[] results = str.split(" ");

Is there an easy way to do this in C++?

A solution to exact this question seems to be here: riptutorial.com/cplusplus/example/2148/tokenize — convert
– convert, Commented Mar 14, 2022 at 14:58

leanid.chaika · Accepted Answer · 2023-12-27 19:29:24Z

I just read all the answers and can't find solution with next preconditions:

no dynamic memory allocations
no use of boost
no use of regex
c++17 standard only

So here is my solution

#include <iomanip> #include <iostream> #include <iterator> #include <string_view> #include <utility> struct split_by_spaces { std::string_view text; static constexpr char delim = ' '; struct iterator { const std::string_view& text; std::size_t cur_pos; std::size_t end_pos; std::string_view operator*() const { return { &text[cur_pos], end_pos - cur_pos }; } bool operator==(const iterator& other) const { return cur_pos == other.cur_pos && end_pos == other.end_pos; } bool operator!=(const iterator& other) const { return !(*this == other); } iterator& operator++() { cur_pos = text.find_first_not_of(delim, end_pos); if (cur_pos == std::string_view::npos) { cur_pos = text.size(); end_pos = cur_pos; return *this; } end_pos = text.find(delim, cur_pos); if (end_pos == std::string_view::npos) { end_pos = text.size(); } return *this; } }; [[nodiscard]] iterator begin() const { auto start = text.find_first_not_of(delim); if (start == std::string_view::npos) { return iterator{ text, text.size(), text.size() }; } auto end_word = text.find(delim, start); if (end_word == std::string_view::npos) { end_word = text.size(); } return iterator{ text, start, end_word }; } [[nodiscard]] iterator end() const { return iterator{ text, text.size(), text.size() }; } }; int main(int argc, char** argv) { using namespace std::literals; auto str = " there should be no memory allocation during parsing" " into words this line and you should'n create any" " contaner for intermediate words "sv; auto comma = ""; for (std::string_view word : split_by_spaces{ str }) { std::cout << std::exchange(comma, ",") << std::quoted(word); } auto only_spaces = " "sv; for (std::string_view word : split_by_spaces{ only_spaces }) { std::cout << "you will not see this line in output" << std::endl; } }

in operator ++, the second ``` if (cur_pos == std::string_view::npos)``` should be if (end_pos == ...)

Angel Sinigersky · Accepted Answer · 2011-05-15 20:47:07Z

If the maximum length of the input string to be tokenized is known, one can exploit this and implement a very fast version. I am sketching the basic idea below, which was inspired by both strtok() and the "suffix array"-data structure described Jon Bentley's "Programming Perls" 2nd edition, chapter 15. The C++ class in this case only gives some organization and convenience of use. The implementation shown can be easily extended for removing leading and trailing whitespace characters in the tokens.

Basically one can replace the separator characters with string-terminating '\0'-characters and set pointers to the tokens withing the modified string. In the extreme case when the string consists only of separators, one gets string-length plus 1 resulting empty tokens. It is practical to duplicate the string to be modified.

Header file:

class TextLineSplitter { public: TextLineSplitter( const size_t max_line_len ); ~TextLineSplitter(); void SplitLine( const char *line, const char sep_char = ',', ); inline size_t NumTokens( void ) const { return mNumTokens; } const char * GetToken( const size_t token_idx ) const { assert( token_idx < mNumTokens ); return mTokens[ token_idx ]; } private: const size_t mStorageSize; char *mBuff; char **mTokens; size_t mNumTokens; inline void ResetContent( void ) { memset( mBuff, 0, mStorageSize ); // mark all items as empty: memset( mTokens, 0, mStorageSize * sizeof( char* ) ); // reset counter for found items: mNumTokens = 0L; } };

Implementattion file:

TextLineSplitter::TextLineSplitter( const size_t max_line_len ): mStorageSize ( max_line_len + 1L ) { // allocate memory mBuff = new char [ mStorageSize ]; mTokens = new char* [ mStorageSize ]; ResetContent(); } TextLineSplitter::~TextLineSplitter() { delete [] mBuff; delete [] mTokens; } void TextLineSplitter::SplitLine( const char *line, const char sep_char /* = ',' */, ) { assert( sep_char != '\0' ); ResetContent(); strncpy( mBuff, line, mMaxLineLen ); size_t idx = 0L; // running index for characters do { assert( idx < mStorageSize ); const char chr = line[ idx ]; // retrieve current character if( mTokens[ mNumTokens ] == NULL ) { mTokens[ mNumTokens ] = &mBuff[ idx ]; } // if if( chr == sep_char || chr == '\0' ) { // item or line finished // overwrite separator with a 0-terminating character: mBuff[ idx ] = '\0'; // count-up items: mNumTokens ++; } // if } while( line[ idx++ ] ); }

A scenario of usage would be:

// create an instance capable of splitting strings up to 1000 chars long: TextLineSplitter spl( 1000 ); spl.SplitLine( "Item1,,Item2,Item3" ); for( size_t i = 0; i < spl.NumTokens(); i++ ) { printf( "%s\n", spl.GetToken( i ) ); }

output:

Item1 Item2 Item3

Kijewski · Accepted Answer · 2012-07-16 01:19:26Z

boost::tokenizer is your friend, but consider making your code portable with reference to internationalization (i18n) issues by using wstring/wchar_t instead of the legacy string/char types.

#include <iostream> #include <boost/tokenizer.hpp> #include <string> using namespace std; using namespace boost; typedef tokenizer<char_separator<wchar_t>, wstring::const_iterator, wstring> Tok; int main() { wstring s; while (getline(wcin, s)) { char_separator<wchar_t> sep(L" "); // list of separator characters Tok tok(s, sep); for (Tok::iterator beg = tok.begin(); beg != tok.end(); ++beg) { wcout << *beg << L"\t"; // output (or store in vector) } wcout << L"\n"; } return 0; }

"legacy" is definitely not correct and wchar_t is a horrible implementation dependent type that nobody should use unless absolutely necessary.
Use of wchar_t doesn't somehow automatically solve any i18n issues. You use encodings to solve that problem. If you're splitting a string by a delimiter, it is implied that the delimiter doesn't collide with the encoded contents of any token inside the string. Escaping may be needed, etc. wchar_t isn't a magical solution to this.

vsoftco · Accepted Answer · 2013-12-15 01:32:53Z

Simple C++ code (standard C++98), accepts multiple delimiters (specified in a std::string), uses only vectors, strings and iterators.

#include <iostream> #include <vector> #include <string> #include <stdexcept> std::vector<std::string> split(const std::string& str, const std::string& delim){ std::vector<std::string> result; if (str.empty()) throw std::runtime_error("Can not tokenize an empty string!"); std::string::const_iterator begin, str_it; begin = str_it = str.begin(); do { while (delim.find(*str_it) == std::string::npos && str_it != str.end()) str_it++; // find the position of the first delimiter in str std::string token = std::string(begin, str_it); // grab the token if (!token.empty()) // empty token only when str starts with a delimiter result.push_back(token); // push the token into a vector<string> while (delim.find(*str_it) != std::string::npos && str_it != str.end()) str_it++; // ignore the additional consecutive delimiters begin = str_it; // process the remaining tokens } while (str_it != str.end()); return result; } int main() { std::string test_string = ".this is.a.../.simple;;test;;;END"; std::string delim = "; ./"; // string containing the delimiters std::vector<std::string> tokens = split(test_string, delim); for (std::vector<std::string>::const_iterator it = tokens.begin(); it != tokens.end(); it++) std::cout << *it << std::endl; }

Murphy78 · Accepted Answer · 2014-02-25 06:40:51Z

/// split a string into multiple sub strings, based on a separator string /// for example, if separator="::", /// /// s = "abc" -> "abc" /// /// s = "abc::def xy::st:" -> "abc", "def xy" and "st:", /// /// s = "::abc::" -> "abc" /// /// s = "::" -> NO sub strings found /// /// s = "" -> NO sub strings found /// /// then append the sub-strings to the end of the vector v. /// /// the idea comes from the findUrls() function of "Accelerated C++", chapt7, /// findurls.cpp /// void split(const string& s, const string& sep, vector<string>& v) { typedef string::const_iterator iter; iter b = s.begin(), e = s.end(), i; iter sep_b = sep.begin(), sep_e = sep.end(); // search through s while (b != e){ i = search(b, e, sep_b, sep_e); // no more separator found if (i == e){ // it's not an empty string if (b != e) v.push_back(string(b, e)); break; } else if (i == b){ // the separator is found and right at the beginning // in this case, we need to move on and search for the // next separator b = i + sep.length(); } else{ // found the separator v.push_back(string(b, i)); b = i; } } }

The boost library is good, but they are not always available. Doing this sort of things by hand is also a good brain exercise. Here we just use the std::search() algorithm from the STL, see the above code.

CATspellsDOG · Accepted Answer · 2015-01-15 17:13:05Z

I made a lexer/tokenizer before with the use of only standard libraries. Here's the code:

#include <iostream> #include <string> #include <vector> #include <sstream> using namespace std; string seps(string& s) { if (!s.size()) return ""; stringstream ss; ss << s[0]; for (int i = 1; i < s.size(); i++) { ss << '|' << s[i]; } return ss.str(); } void Tokenize(string& str, vector<string>& tokens, const string& delimiters = " ") { seps(str); // Skip delimiters at beginning. string::size_type lastPos = str.find_first_not_of(delimiters, 0); // Find first "non-delimiter". string::size_type pos = str.find_first_of(delimiters, lastPos); while (string::npos != pos || string::npos != lastPos) { // Found a token, add it to the vector. tokens.push_back(str.substr(lastPos, pos - lastPos)); // Skip delimiters. Note the "not_of" lastPos = str.find_first_not_of(delimiters, pos); // Find next "non-delimiter" pos = str.find_first_of(delimiters, lastPos); } } int main(int argc, char *argv[]) { vector<string> t; string s = "Tokens for everyone!"; Tokenize(s, t, "|"); for (auto c : t) cout << c << endl; system("pause"); return 0; }

duckie · Accepted Answer · 2025-03-02 08:22:14Z

Here is my answer to this. I ran into a similar problem and used the following solution. Works good for log parsing and can be used similar to other languages split function that returns back a list.

#include <iostream> std::pair<size_t,size_t> parse_print_string(const std::string_view* log, size_t index){ size_t offset = 0x0; size_t pos = 0x0; size_t count = 0x0; while((pos = (log->substr(offset)).find(" ")) != std::string_view::npos ){ if (count == index){ return std::pair<size_t,size_t>(offset,pos); } offset = offset + pos + 1; count++; } return std::pair<size_t,size_t>(offset,pos); } int main(){ std::string_view log = "nice apple orange"; std::pair<size_t,size_t> indices = parse_print_string(&log,2); std::cout << log.substr(indices.first,indices.second); }

Karthik · Accepted Answer · 2013-05-19 13:42:11Z

This a simple loop to tokenise with only standard library files

#include <iostream.h> #include <stdio.h> #include <string.h> #include <math.h> #include <conio.h> class word { public: char w[20]; word() { for(int j=0;j<=20;j++) {w[j]='\0'; } } }; void main() { int i=1,n=0,j=0,k=0,m=1; char input[100]; word ww[100]; gets(input); n=strlen(input); for(i=0;i<=m;i++) { if(context[i]!=' ') { ww[k].w[j]=context[i]; j++; } else { k++; j=0; m++; } } }

Collectives™ on Stack Overflow

How do I tokenize a string in C++?

38 Answers 38

1 Comment

Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

38 Answers 38

1 Comment

Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

Linked

Related