Python NLTK does not correctly classify my text even though show_most_informative_features looks okay

Asked

Viewed 34 times

0

Talk guys, I’m learning to mess with Python’s NLTK for a course, using the teacher’s example base, everything works fine, but when using a base of its own, something strange is happening and as I don’t know it n can unpack.

I’m trying to analyze a small dataset (1000 lines) of data from an internal system here. The result of "show_most_informative_features" is what is expected. Look:

              execut = True             1114 : 1790   =     91.2 : 1.0
                  r$ = True             1114 : 1790   =     56.2 : 1.0
                pens = True             1114 : 1790   =     40.1 : 1.0
                 val = True             1114 : 1790   =     39.8 : 1.0
                 pag = True             1114 : 1790   =     28.5 : 1.0
              parcel = True             1114 : 1790   =     24.1 : 1.0
                real = True             1114 : 1790   =     24.1 : 1.0
            municípi = True             2049 : 1790   =     23.6 : 1.0
             aliment = True             1114 : 203    =     22.2 : 1.0
              veícul = True             2049 : 1790   =     22.1 : 1.0
                   é = True             1114 : 1790   =     21.5 : 1.0
               efetu = True             1114 : 1790   =     21.2 : 1.0
          palmas/to, = True             1114 : 1790   =     21.2 : 1.0
                rend = True             2049 : 1790   =     20.5 : 1.0
                cont = True              178 : 1790   =     19.9 : 1.0

This is the string string I use to test the algorithm:

['execut', 'r$', 'pen', 'val', 'pag']

Based on this chart of Features, I was hoping that NLTK would classify her as 1114 (I used the top 5 radicals that have more chance to classify like this), but my code insists on classifying in "1790"

Looking at the problability of classification ( print prob_classify()), it gets even weirder, for me:

CODE:

def show_distribuicao():
  distribuicao = classificador.prob_classify(linha_teste_frase)

  for classe in distribuicao.samples():
      print("%s: %f" % (classe, distribuicao.prob(classe)))

RESULT:

178: 0.000000
1790: 1.000000
203: 0.000000
1114: 0.000000
2049: 0.000000

Have a 100% chance of classifying as 1790?? What do you mean? And nor is it because I have a very large dissiparity between the types of data, look at how the values count (qualificacao_id.value_counts()):

1790    323
203     237
178     152
1114    147
2049    141

Here a preview (head(10)) of the dataset I’m dealing with:

Unnamed: 0                                          historico qualificacao_id
0         648   A assistida chegou de Portugal para a audiênc...             178
1         889   AP 0025466-03.2018.827.2729   Não tem condena...            1790
2        4315   o assisitido retornou dizendo que sua colação...             203
3        4512   Atendida: Lucélia Santos de Sousa (9255-0805)...            1790
4        7287   PEDI TESTEMUNHAS PARA RESPOSTA. ACUSADO DE RO...            1790
5        7422   O  seu advogado  particular renunci...             203
6        7526   (Juliana Dias)      A assistida comparec...            1114
7        8272     Tel (63) 9213-2485 Ranilson Martins da Silv...            1790
8        8576                                       Orientação.             2049
9        9438   Compareceu a Assistida para apresentar recibo...             178

Does anyone have any idea what’s going on? PS: I’m sorry for the noobisses in the technologies, is that I am Noob same :P

  • 1

    From what you’ve presented, the base seems to be much unbalanced. Many 1790 items in the list. Try to balance the training base.

  • Thank you Paul. I’ll try to do that.

No answers

Browser other questions tagged

You are not signed in. Login or sign up in order to post.