Remove stopwords listed in a txt file from another txt file

Asked

Viewed 258 times

1

Good evening guys, I need help. I am doing pre-word processing and for that I need to remove from a book in format . txt all stopwords found in a text file also "stopwords_br.txt". I found a program I think a little bit like what I’m looking for. However this in C++ and I do not understand the commands.

Help me if possible. Thank you.

string line, deleteline;
ifstream stopword;
stopword.open("example.txt");
if (stopword.is_open())
{
    while ( getline (stopword,line) )
    {
        cout << line << endl;
    }
    stopword.close();
}    
else cout << "Unable to open file";

ofstream temp;
temp.open("temp.txt");

cout << "Please input the stop-word you want to delete..\n ";
cin >> deleteline;

while (getline(stopword,line))
{
    if (line != deleteline)
    {
        temp << line << endl;
    }
}
temp.close();
stopword.close();
remove("example.txt");
rename("temp.txt","example.txt");
cout <<endl<<endl<<endl;
system("pause");
return 0;

2 answers

1

How is file format "stopwords_br.txt"?

The code below, based on what you’ve gone through, reads the file information and removes the word. Saves the information to a new file and removes the previous one.

    int main()
{
string line, stopword; ifstream text_file; text_file.open("c:\temp\exemplo.txt");

if(text_file.is_open()){ while(getline(text_file, line)){ cout << line << endl; } text_file.close(); }else cout << "Unable to open file"; cout << "\nPlease input the stop-word you want do delete." << endl; cin >> stopword; text_file.open("c:\\temp\\exemplo.txt"); ofstream temp; temp.open("c:\\temp\\temp.txt"); if(text_file.is_open()){ while(getline(text_file, line)){ int achou = 1; while(achou > 0){ int pos = line.find(stopword); if(pos >= 0){ line.erase(pos, stopword.length()); }else{ achou = pos; } } temp << line << endl; } } temp.close(); text_file.close(); remove("c:\\temp\\exemplo.txt"); rename("c:\\temp\\temp.txt", "c:\\temp\\exemplo.txt"); cout << endl << endl<< endl; system("pause"); return 0;

  • The general idea is good but (1) only treats a stopword; (2) for example being "a" a stopword, would not remove all "a" from all words in the text?

0

If we are in Linux environment, and if we can use Sed and Perl proposed...

sed -rf <(perl -00nE 'say "s/\\<(",join("|",split),")\\>//g"' stopw.txt) l.txt

Example:

$ cat stopwords 
a
de
que
para
em
é

$ cat livro 
a minha tia de Braga é que em breve me vem visitar.

$ sed -rf <(perl -00nE 'say "s/\\<(",join("|",split),")\\>//g"' stopwords) livro 
 minha tia  Braga    breve me vem visitar.

in which:

  • perl -00nE 'say "s/\\<(",join("|",split),")\\>//g"' stopwords, gives s/\<(a|de|que|para|em|é)\>//g, or is calculates a sed substitution,
  • which is then applied to the book (sed -rf prog livro).

Browser other questions tagged

You are not signed in. Login or sign up in order to post.