How to pre-process a text for the application in the Weka classification algorithms in Java?

Asked

Viewed 2,008 times

4

I am doing my TCC where the idea, roughly, part of the collection of tweets and a training of an algorithm machine Learning to classify this data.

How I would pre-process this tweet, and the idea would be to train a machine learning algorithm with inputs, where it will be fed tweets that mean buying and tweets that don’t mean does not buy, so that later, from this trained algorithm, I can input a tweet and it gives me the output if it refers yes or no to a purchase.

I already own the database of the tweets collected, and have already incorporated the Weka API into my project.

  • Just rewrite your base in format ARFF. Possibly with a small script.

  • I understand, however I’m trying to think about how this arff file would look, you would have an idea of what the sketch of this file would look like?

  • Post a little piece of your tweet base (it may only be 1) that I reply to you

  • "To and dead Galaxy S5 for R $ 2,600" -- "I need a Galaxy S5"

  • In these examples above, I have 2 tweets, one that demonstrates that the user would not buy and the other showing a tweet of a potential interest.

2 answers

2

WEKA reads a file in format ARFF.

To create an arff file, you must define the following headers:

Declaration of Relationship

A name for the relation, defined in the first line of the file. It is declared:

@relation <nome da relacao>

If the relation name contains spaces, quotation marks should be used.

Attribute Statement

Attributes are declared through an ordered sequence of @attributes. Each attribute in the dataset must have its own statement using @attribute that identifies solely the name of this attribute and the data type. The order in which they are declared indicates the order in which they appear in the data set.

Declares himself:

@attribute <nome do atributo> <tipo de dado>

Attribute name must start with letter and, if it contains spaces, must be in quotes.

The data types supported by WEKA are:

  • Numbers (actual or integer): Numeric
  • Text "free": String
  • Nominal attributes (default text)
  • Date: Date [<date-format>]
  • Relational attributes

Numerical attributes

It is suitable for both integers and reals. It declares itself:

@attribute idade numeric

Nominal attributes

Nominal values are defined when a list of possible values is provided. For example:

@attribute classe {comprador, possivel-comprador, nao-comprador}

Attributes of type String

Used for arbitrary texts. Declares:

@attribute tweet string

Note: should be in quotes if it contains spaces.

Declaration of the dataset

The data set is declared on a single line. It is declared:

@data

Delimits where instance data actually begins.

Instance data

Instance data is declared one per line and the attributes must be separated with comma.


By directly answering your question, a possible configuration of an ARFF file for your problem would be like this:

% Tudo depois do % é ignorado. Pode-se utilizar para inserir comentários
@relation compradores

@attribute tweet string
@attribute classe {compraria, nao-compraria}

@data
"To e morto Galaxy S5 por R$ 2,600", nao-compraria
"Preciso de um galaxy s5", compraria
"Configurando meu Galaxy s5", compraria
"Prefiro um iphone do que um galaxy s5", nao-compraria
  • I understand, thank you very much @Beet. I will see what I do here, I thank you already!

  • @Maiconfunke If the answer solves your problem, do not forget to mark it as accepted ;-)

  • This way it was not possible, however I created an arff file in this pattern, but I had to use the stringtowordvector filter of the Weka api, which generated the correct file, however I do not know how to add new instances, to test the algorithm. If you can help me...

1

So, man, I’m doing something similar, and I’ve come across the same problem. I collected the Tweets with Python and saved in a Json file, when I went to read the json on Weka it did not recognize. I solved it in the following way:

I converted the json to csv and took all the line breaks, commas, single and double quotes, I took the accent of the words and then tried to open in Weka and it worked.

After opening in Weka you can save your file in arff format, then I had to open the file to change a line of it, because Weka was not recognizing the text field as string, for this I needed to change a line of the file that was like this at the beginning of the file right after the @relation:

@attribute text string

You can apply Weka filters to the file such as RemoveDuplicates to remove duplicate instances and after you have done the above procedure, you can apply the StringToWordVector which will help you make a feeling analysis.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.