Delete words from a file

Asked

Viewed 85 times

1

I have a file CSV and I have this list of words here: https://gist.github.com/alopes/5358189

The archive CSV has 3 columns, Text, user.name and Class and has about 100K of rows. I need to delete from the first column of CSV all words that appear in the list.

Can you help me?

  • You want to delete only the word or take out the whole line?

  • Just the word itself.

  • I’m afraid how to put some records of your CSV to see how the pattern is?

  • Man, sorry it took me so long to reply, I’m extremely busy these days, I’ll see if I put a snippet of the file today. Thanks!

1 answer

1


Using Perl (sorry...) but easy to translate to awk

$ cat stoplist.txt 
de
a
o ....

$ cat ex.cvs 
meu caro amigo;jjoao;classe a
eu ando a aprender weka;Thyago;classe b
mas a sua sintaxe dá-me algumas dores de cabeça;Thyago;classe a

Let rmstopwords be the following Perl file:

BEGIN{  $patt="que";      ## contruir uma regexp reg com as palavras
  open(G,"stoplist.txt");
  while(<G>){chomp; 
    $patt.="|$_" if $_    ## patt="que|de|a|o|..."
  }
}

$F[0] =~ s/\b($patt)\b//g;  ## no primeiro campo, subst(patt por "") 
print join(";",@F)

Applied to our file ex.csv gives:

$ perl -naC -F';' rmovestopwords t.cvs 
 caro amigo;jjoao;classe a
 ando  aprender weka;Thyago;classe b
    sintaxe dá- algumas dores  cabeça;Thyago;classe a
  • Opa! Thanks a lot, I’ll try and let you know if it worked. Thanks a lot!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.