How to split a string into C++?

Asked

Viewed 6,417 times

8

I received this simple (I really thought it was!) challenge of creating a "tokenizer". I had to split the string " O rato roeu a roupa do rei de roma " into spaces. Then, after a long time, I developed the following algorithm using vectors, the header Algorithm and strings.

#ifndef STR_PARSE_HPP
#define STR_PARSE_HPP
#include <algorithm>
#include <string>
#include <vector>
using std::reverse;
using std::vector;
using std::string;
    vector<string> split(string str, char delimiter = ' ')
    {
        vector<string> ret;
        if((str.find(delimiter) == string::npos) && (str.find_first_not_of(delimiter) == string::npos)) throw nullptr;
        else if ((str.find(delimiter) == string::npos)) ret.push_back(str);
        else if(str.find_first_not_of(delimiter) == string::npos) ret.push_back(string(""));
        else
        {
            unsigned i = 0;
            string strstack;
            while(str[0] == delimiter) {str.erase(0,1);}
            reverse(str.begin(), str.end());
            while(str[0] == delimiter) {str.erase(0,1);}
            reverse(str.begin(), str.end());
            while(!str.empty())
            {
                ret.push_back(str.substr(i, str.find(delimiter)));
                str.erase(0,str.find(delimiter));
                while(str[0] == delimiter) {str.erase(0,1);}
            }
        }
        return ret;
    }
#endif // STR_PARSE_HPP

The test:

#include <iostream>
#include "str_parse.hpp"
using std::string;
using std::cout;


int main()
{
    string a = "        O    rato roeu a roupa do rei de roma             ";
    for(int i = 0; i < split(a).size(); i++)
    cout << split(a)[i];
}

Output was as expected:

O
rato
roeu
a
roupa
do
rei
de
roma

Then, as I lost a "bit" of time, I decided to test with other delimiters. The crash is instant, and the debbuger here is "broken" (the breakpoints go straight). What’s wrong with my code?

  • Maybe the Strtok function meets your needs: http://www.cplusplus.com/reference/cstring/strtok/

  • Look normally the expected result would be that the empty "items" were returned, in the implementations I know is the most common. Not that this is a mistake, I’m just making an observation.

  • @Hwapx serio? Thank you!

4 answers

7

It is interesting to note that often stdlib already has very similar algorithms that can be used for its purposes. If the delimiter is always the space, you can rely on reading strings from streams. Do so:

vector<string> split(const string& str) {
    stringstream ss(str);
    vector<string> vec {istream_iterator<string>{ss}, istream_iterator<string>{}};
    return vec;
}

Or: (as you suggested @pepper_chico)

vector<string> split(const string& str) {
    stringstream ss(str);
    return {istream_iterator<string>{ss}, istream_iterator<string>{}};
}

Example: coliru.

In case you have different delimiters, you can use another function ready for this: the getline. Although the name doesn’t sound like it, it does exactly what you want: break a string. The detail is that the default delimiter is the line break, hence the name. Use so:

vector<string> split(const string& str, char delim=' ') {
    stringstream ss(str);
    string tok;
    vector<string> vec;
    while (getline(ss, tok, delim)) {
        if (!tok.empty())
            vec.push_back(tok);
    }
    return vec;
}

Example: coliru.

  • William, I always used to create my own algorithms, ending up that I don’t do much on the standard. Could you explain to me how Iterators work?

  • 2

    @Lucashenrique The Iterators function as a generic sequence of something. When using a istream_iterator, I’m creating a sequence of readings from a particular type of a stream. Being the reading of strings, the break occurs in the spaces and breaks of lines, as it would occur with cin >> str;. I used one of the builders of vector that takes a pair of Iterators, so I don’t have to make the loop myself. The code ends up getting pretty clean.

  • 2

    @Lucashenrique Already in the second case I used the getline which is somewhat similar to strtok, but run with strings. Writing the algorithms using standard functions gives you the best combination of performance and clarity. It is safe to assume that these functions are naturally fast if used in the right way. It is likely that this version of my code will be more efficient.

  • @downvoter Can you explain why you think this answer is not good? There is something wrong with it?

  • William, then, what is the logic of getline(cin,str)?

  • 1

    @Lucashenrique getline read from the first argument (a stream), place the result in the second argument (a string), stopping when finding a delimiter (the third argument). As this third is by default the line break, we have to getline(cin,str) reads an entire row of standard input cin and puts it in the string str.

  • So, if I wanted to, could I put any istream and ostream item? Or just read streams like ifstream? And the fstream case, which is both?

  • @Lucas Uma ostream could not, because getline must be able to read. A fstream no problem since it can be read. Note that I had to transform the string into a stringstream before doing anything in my examples.

  • And when the delimiter is not ' '? Here is giving crash.

  • @Lucas Works perfectly with other delimiters: http://coliru.stacked-crooked.com/a/8d6fa1ec1d060c15 How did you test? There must be a small mistake somewhere else.

  • Haha, excuse me, error in my main code :P

  • @Lucashenrique only a note, when it is the case of, "please can explain me", it is better to start a chat at Stackoverflow instead of lengthening the discussion in the comments.

  • 1

    return { istream_iterator<string>{ss}, istream_iterator<string>{} } also works ;-)

Show 8 more comments

5


Looking at your code the first thing I noticed was that its main function is not the way you posted, this does not even compile, I believe that there must be something missing in your post. Supposing you did something similar to this:

int main()
{
    string a = "        O    rato roeu a roupa do rei de roma             ";
    vector<string> split_vector = split(a);
    for(unsigned int i = 0; i < split_vector.size(); i++) {
    cout << split_vector[i];
    }
}

The problem of its split function is that in the following section:

while(!str.empty())
 {
    ret.push_back(str.substr(i, str.find(delimiter)));
    str.erase(0,str.find(delimiter));
     while(str[0] == delimiter) {str.erase(0,1);}
 }

Esse while

while(str[0] == delimiter) {str.erase(0,1);}

It does not test if the string has already arrived at the end, so the crash occurs. The correct one would only be to execute this if it is not an empty string.

while(!(str.empty()) && (str[0] == delimiter)) {str.erase(0,1);}

In fact, the correct thing would be to do the test whenever you have a code similar to this one. I remade its function with what I believe is correct, I removed some tests that I found unnecessary:

vector<string> split(string str, char delimiter = ' ')
{
    vector<string> ret;
    if(str.empty()) 
    {
        ret.push_back(string(""));
        return ret;
    }

    unsigned i = 0;
    string strstack;
    while(!(str.empty()) && (str[0] == delimiter)) {str.erase(0,1);}
    reverse(str.begin(), str.end());
    while(!(str.empty()) && (str[0] == delimiter)) {str.erase(0,1);}
    reverse(str.begin(), str.end());
    while(!str.empty())
    {
        ret.push_back(str.substr(i, str.find(delimiter)));
        str.erase(0,str.find(delimiter));
        while(!(str.empty()) && (str[0] == delimiter)) {str.erase(0,1);}
    }

    return ret;
}
  • I don’t understand! If it were so, the main while (!str.empty) should quit before the string bug occurs. Why?

  • Because when you get to the last word, in your "roma" example, the line (str.Rase(0,str.find(delimiter));) will make the string empty and the code (while(str[0] == delimiter) {str.Rase(0,1);}) will run before the main while.

  • Haha, now I get it. Thank you @Selma!

  • Selma, your answer is wrong. while, from Rase, will erase the entire string (at least it should), and other delimiters also do not work.

  • Lucas, when running the function I posted with the string of your example the first and the second while will remove the spaces of the beginning and the end, so O rato roeu a roupa do rei de roma, inside the bigger while while(!str.empty()) there will come a time when the iteration will start with the string having only the last word, in this case roma, push_back will insert it into the vector, the str.erase(0,str.find(delimiter)); make the string empty, and will enter an empty string in the while room, which in old code made the instruction (str[0] == delimiter) invalid.

2

Look @Selma already answered what the problem with their code, so I’ll just share an alternative implementation.

#include <string>
#include <vector>

using namespace std;

vector<string> split(string str, char delimiter = ' ')
{
    vector<string> ret;

    int start = 0;

    for(int i = 0; i < str.length(); ++i) {
        if(str[i] == delimiter) {
            ret.push_back(str.substr(start, i-start));
            start = i+1;
        }
    }

    ret.push_back(str.substr(start, start - str.length()));

    return ret;
}

This implementation returns the "items".

#include <string>
#include <vector>

using namespace std;

vector<string> split(string str, char delimiter = ' ')
{
    vector<string> ret;

    int start = 0;

    for(int i = 0; i < str.length(); ++i) {
        if(str[i] == delimiter) {
            if(i - start != 0)
                ret.push_back(str.substr(start, i-start));
            start = i+1;
        }
    }

    if(str.length() - start != 0)
        ret.push_back(str.substr(start, start - str.length()));

    return ret;
}

On that one I added two if's to ignore empty "items".

0

I didn’t want to wake up this dead question, but I redid the algorithm, now that I know more about iterators.

#include <iostream>
#include <string>
#include <vector>
std::vector<std::string> tokenize(std::string str, char delimiter = ' ')
{
    std::vector<char> string_ret;
    std::vector<std::string> ret;
    for(auto a : str)
    {
        if(a == delimiter)
        {
            std::cout << "Delimiter found >>" << a << "<<" <<  std::endl;
            if(!string_ret.empty())
            {
                std::cout << "Pushing string to ret!\n";
                std::string push;
                for(auto b : string_ret)
                {
                    push.push_back(b);
                }
                string_ret.clear();
                ret.push_back(push);
            }
            else std::cout << "Delimiter found, but string return is empty!" << std::endl;
        }
        else
        {
            std::cout << "char which is not delimiter found! >>" << a << "<<" << std::endl;
            string_ret.push_back(a);
        }
    }
    if(!string_ret.empty())
    {
        std::string push;
        for(auto b : string_ret)
        {
            push.push_back(b);
        }
        ret.push_back(push);
    }
    std::cout << "\n\n\n\n\n";
    return ret;
}

int main()
{
    for(auto a : tokenize("O rato roeu a roupa do rei de roma", 'r'))
    std::cout << a << std::endl;
}

Browser other questions tagged

You are not signed in. Login or sign up in order to post.