manipulate csv file in c++

Asked

Viewed 1,610 times

1

I need to read a cvs file with the following fields:

Id,OwnerUserId,CreationDate,Score,Title,Body    

Id    inteiro  
OwerUserID    inteiro  
Data     vou armazerna como char    
Score     inteiro  
Title      texto  
Body       texto 

Example of a line(many lines) :

469, 147, 2008-08-02T15:11:16Z, 21, How can I find the full path to a font from its   display name on a Mac?, "Iam using the Photoshop's.......</ul> "  

Those..... in the body field was to shorten, because the text size is much larger

and store in a vector of type structs:

struct Questions {

    int id;  
    int ownerUsedId;  
    char creationDate[30];  
    int score;
    char title[100];
    char body[200];

}; 

For that I did the following function:

void loadQuestions( fstream &file, Questions *questions)
{
    string registro;



    getline(file, registro);


    char *buffer =new char[500];
    char *ptr;

    getline(file, registro); 
    strcpy(buffer, registro.c_str());  



    ptr = strtok(buffer, ",");
       cout << atoi(ptr) << "   ";// pega o campo id

    ptr = strtok(NULL, ",");
       cout << atoi(ptr) << "    ";// pega o campo ownerUsedId

    ptr = strtok(NULL, ",");// pega o campo data
       cout << (ptr) << "    ";

    ptr = strtok(NULL, ",");// pega o campo score
       cout << atoi(ptr) << "    ";



} // mostrei na tela para ver se o programa tava certo, nao armazenei ainda na struct  

Up to the fourth comma everything worked out, I’m separating by comma, but the problem arises: either in the title or body field can appear a comma in the middle of text, forcing strtok for at this point , with that messed up the whole reading.

Question : how to store each field correctly in my estruct, since in the body and Tittle field can appear several commas, one thing I realized was that the Body field starts and ends with quotation marks("") , ie could use quotation marks as a delimiting point to copy this field, but inside the body( field which is a text) there can be quotes (" ")
How to copy each of these fields correctly?

  • If what you want to capture is quite specific then it becomes easier to use a regex. In your particular case the simplest will probably be to change the strotok from the title to catch " instead of ,

1 answer

1

You can use a composite approach. Up to the fourth field you use the Strtok, from then on you make a for looking for the first occurrence of quotation marks (beginning of Body), all that gathered until there is the Title and from there until the end is the Body.

  • the problem which, in the title field can appear a quote, is 600000 lines , with the most varied text in the title field

  • You didn’t mention quotes in title either :) You have control over file generation?

  • I forgot to mention the possibility of quotation marks in the title, I have no control no, this is a dataset of the site https://www.kaggle.com/stackoverflow/pythonquestions. is a file with 600000 thousand records, so I can’t look one by one to see if there are any quotation marks in the title field

Browser other questions tagged

You are not signed in. Login or sign up in order to post.