482

Java has a convenient split method:

String str = "The quick brown fox"; String[] results = str.split(" "); 

Is there an easy way to do this in C++?

2

38 Answers 38

1
2
1

I just read all the answers and can't find solution with next preconditions:

  1. no dynamic memory allocations
  2. no use of boost
  3. no use of regex
  4. c++17 standard only

So here is my solution

#include <iomanip> #include <iostream> #include <iterator> #include <string_view> #include <utility> struct split_by_spaces { std::string_view text; static constexpr char delim = ' '; struct iterator { const std::string_view& text; std::size_t cur_pos; std::size_t end_pos; std::string_view operator*() const { return { &text[cur_pos], end_pos - cur_pos }; } bool operator==(const iterator& other) const { return cur_pos == other.cur_pos && end_pos == other.end_pos; } bool operator!=(const iterator& other) const { return !(*this == other); } iterator& operator++() { cur_pos = text.find_first_not_of(delim, end_pos); if (cur_pos == std::string_view::npos) { cur_pos = text.size(); end_pos = cur_pos; return *this; } end_pos = text.find(delim, cur_pos); if (end_pos == std::string_view::npos) { end_pos = text.size(); } return *this; } }; [[nodiscard]] iterator begin() const { auto start = text.find_first_not_of(delim); if (start == std::string_view::npos) { return iterator{ text, text.size(), text.size() }; } auto end_word = text.find(delim, start); if (end_word == std::string_view::npos) { end_word = text.size(); } return iterator{ text, start, end_word }; } [[nodiscard]] iterator end() const { return iterator{ text, text.size(), text.size() }; } }; int main(int argc, char** argv) { using namespace std::literals; auto str = " there should be no memory allocation during parsing" " into words this line and you should'n create any" " contaner for intermediate words "sv; auto comma = ""; for (std::string_view word : split_by_spaces{ str }) { std::cout << std::exchange(comma, ",") << std::quoted(word); } auto only_spaces = " "sv; for (std::string_view word : split_by_spaces{ only_spaces }) { std::cout << "you will not see this line in output" << std::endl; } } 
Sign up to request clarification or add additional context in comments.

1 Comment

in operator ++, the second ``` if (cur_pos == std::string_view::npos)``` should be if (end_pos == ...)
0

If the maximum length of the input string to be tokenized is known, one can exploit this and implement a very fast version. I am sketching the basic idea below, which was inspired by both strtok() and the "suffix array"-data structure described Jon Bentley's "Programming Perls" 2nd edition, chapter 15. The C++ class in this case only gives some organization and convenience of use. The implementation shown can be easily extended for removing leading and trailing whitespace characters in the tokens.

Basically one can replace the separator characters with string-terminating '\0'-characters and set pointers to the tokens withing the modified string. In the extreme case when the string consists only of separators, one gets string-length plus 1 resulting empty tokens. It is practical to duplicate the string to be modified.

Header file:

class TextLineSplitter { public: TextLineSplitter( const size_t max_line_len ); ~TextLineSplitter(); void SplitLine( const char *line, const char sep_char = ',', ); inline size_t NumTokens( void ) const { return mNumTokens; } const char * GetToken( const size_t token_idx ) const { assert( token_idx < mNumTokens ); return mTokens[ token_idx ]; } private: const size_t mStorageSize; char *mBuff; char **mTokens; size_t mNumTokens; inline void ResetContent( void ) { memset( mBuff, 0, mStorageSize ); // mark all items as empty: memset( mTokens, 0, mStorageSize * sizeof( char* ) ); // reset counter for found items: mNumTokens = 0L; } }; 

Implementattion file:

TextLineSplitter::TextLineSplitter( const size_t max_line_len ): mStorageSize ( max_line_len + 1L ) { // allocate memory mBuff = new char [ mStorageSize ]; mTokens = new char* [ mStorageSize ]; ResetContent(); } TextLineSplitter::~TextLineSplitter() { delete [] mBuff; delete [] mTokens; } void TextLineSplitter::SplitLine( const char *line, const char sep_char /* = ',' */, ) { assert( sep_char != '\0' ); ResetContent(); strncpy( mBuff, line, mMaxLineLen ); size_t idx = 0L; // running index for characters do { assert( idx < mStorageSize ); const char chr = line[ idx ]; // retrieve current character if( mTokens[ mNumTokens ] == NULL ) { mTokens[ mNumTokens ] = &mBuff[ idx ]; } // if if( chr == sep_char || chr == '\0' ) { // item or line finished // overwrite separator with a 0-terminating character: mBuff[ idx ] = '\0'; // count-up items: mNumTokens ++; } // if } while( line[ idx++ ] ); } 

A scenario of usage would be:

// create an instance capable of splitting strings up to 1000 chars long: TextLineSplitter spl( 1000 ); spl.SplitLine( "Item1,,Item2,Item3" ); for( size_t i = 0; i < spl.NumTokens(); i++ ) { printf( "%s\n", spl.GetToken( i ) ); } 

output:

Item1 Item2 Item3 

Comments

0

boost::tokenizer is your friend, but consider making your code portable with reference to internationalization (i18n) issues by using wstring/wchar_t instead of the legacy string/char types.

#include <iostream> #include <boost/tokenizer.hpp> #include <string> using namespace std; using namespace boost; typedef tokenizer<char_separator<wchar_t>, wstring::const_iterator, wstring> Tok; int main() { wstring s; while (getline(wcin, s)) { char_separator<wchar_t> sep(L" "); // list of separator characters Tok tok(s, sep); for (Tok::iterator beg = tok.begin(); beg != tok.end(); ++beg) { wcout << *beg << L"\t"; // output (or store in vector) } wcout << L"\n"; } return 0; } 

2 Comments

"legacy" is definitely not correct and wchar_t is a horrible implementation dependent type that nobody should use unless absolutely necessary.
Use of wchar_t doesn't somehow automatically solve any i18n issues. You use encodings to solve that problem. If you're splitting a string by a delimiter, it is implied that the delimiter doesn't collide with the encoded contents of any token inside the string. Escaping may be needed, etc. wchar_t isn't a magical solution to this.
0

Simple C++ code (standard C++98), accepts multiple delimiters (specified in a std::string), uses only vectors, strings and iterators.

#include <iostream> #include <vector> #include <string> #include <stdexcept> std::vector<std::string> split(const std::string& str, const std::string& delim){ std::vector<std::string> result; if (str.empty()) throw std::runtime_error("Can not tokenize an empty string!"); std::string::const_iterator begin, str_it; begin = str_it = str.begin(); do { while (delim.find(*str_it) == std::string::npos && str_it != str.end()) str_it++; // find the position of the first delimiter in str std::string token = std::string(begin, str_it); // grab the token if (!token.empty()) // empty token only when str starts with a delimiter result.push_back(token); // push the token into a vector<string> while (delim.find(*str_it) != std::string::npos && str_it != str.end()) str_it++; // ignore the additional consecutive delimiters begin = str_it; // process the remaining tokens } while (str_it != str.end()); return result; } int main() { std::string test_string = ".this is.a.../.simple;;test;;;END"; std::string delim = "; ./"; // string containing the delimiters std::vector<std::string> tokens = split(test_string, delim); for (std::vector<std::string>::const_iterator it = tokens.begin(); it != tokens.end(); it++) std::cout << *it << std::endl; } 

Comments

0
/// split a string into multiple sub strings, based on a separator string /// for example, if separator="::", /// /// s = "abc" -> "abc" /// /// s = "abc::def xy::st:" -> "abc", "def xy" and "st:", /// /// s = "::abc::" -> "abc" /// /// s = "::" -> NO sub strings found /// /// s = "" -> NO sub strings found /// /// then append the sub-strings to the end of the vector v. /// /// the idea comes from the findUrls() function of "Accelerated C++", chapt7, /// findurls.cpp /// void split(const string& s, const string& sep, vector<string>& v) { typedef string::const_iterator iter; iter b = s.begin(), e = s.end(), i; iter sep_b = sep.begin(), sep_e = sep.end(); // search through s while (b != e){ i = search(b, e, sep_b, sep_e); // no more separator found if (i == e){ // it's not an empty string if (b != e) v.push_back(string(b, e)); break; } else if (i == b){ // the separator is found and right at the beginning // in this case, we need to move on and search for the // next separator b = i + sep.length(); } else{ // found the separator v.push_back(string(b, i)); b = i; } } } 

The boost library is good, but they are not always available. Doing this sort of things by hand is also a good brain exercise. Here we just use the std::search() algorithm from the STL, see the above code.

Comments

0

I made a lexer/tokenizer before with the use of only standard libraries. Here's the code:

#include <iostream> #include <string> #include <vector> #include <sstream> using namespace std; string seps(string& s) { if (!s.size()) return ""; stringstream ss; ss << s[0]; for (int i = 1; i < s.size(); i++) { ss << '|' << s[i]; } return ss.str(); } void Tokenize(string& str, vector<string>& tokens, const string& delimiters = " ") { seps(str); // Skip delimiters at beginning. string::size_type lastPos = str.find_first_not_of(delimiters, 0); // Find first "non-delimiter". string::size_type pos = str.find_first_of(delimiters, lastPos); while (string::npos != pos || string::npos != lastPos) { // Found a token, add it to the vector. tokens.push_back(str.substr(lastPos, pos - lastPos)); // Skip delimiters. Note the "not_of" lastPos = str.find_first_not_of(delimiters, pos); // Find next "non-delimiter" pos = str.find_first_of(delimiters, lastPos); } } int main(int argc, char *argv[]) { vector<string> t; string s = "Tokens for everyone!"; Tokenize(s, t, "|"); for (auto c : t) cout << c << endl; system("pause"); return 0; } 

Comments

0

Here is my answer to this. I ran into a similar problem and used the following solution. Works good for log parsing and can be used similar to other languages split function that returns back a list.

#include <iostream> std::pair<size_t,size_t> parse_print_string(const std::string_view* log, size_t index){ size_t offset = 0x0; size_t pos = 0x0; size_t count = 0x0; while((pos = (log->substr(offset)).find(" ")) != std::string_view::npos ){ if (count == index){ return std::pair<size_t,size_t>(offset,pos); } offset = offset + pos + 1; count++; } return std::pair<size_t,size_t>(offset,pos); } int main(){ std::string_view log = "nice apple orange"; std::pair<size_t,size_t> indices = parse_print_string(&log,2); std::cout << log.substr(indices.first,indices.second); } 

Comments

-4

This a simple loop to tokenise with only standard library files

#include <iostream.h> #include <stdio.h> #include <string.h> #include <math.h> #include <conio.h> class word { public: char w[20]; word() { for(int j=0;j<=20;j++) {w[j]='\0'; } } }; void main() { int i=1,n=0,j=0,k=0,m=1; char input[100]; word ww[100]; gets(input); n=strlen(input); for(i=0;i<=m;i++) { if(context[i]!=' ') { ww[k].w[j]=context[i]; j++; } else { k++; j=0; m++; } } } 

Comments

1
2

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.