0
Talk guys, I’m learning to mess with Python’s NLTK for a course, using the teacher’s example base, everything works fine, but when using a base of its own, something strange is happening and as I don’t know it n can unpack.
I’m trying to analyze a small dataset (1000 lines) of data from an internal system here. The result of "show_most_informative_features" is what is expected. Look:
execut = True 1114 : 1790 = 91.2 : 1.0
r$ = True 1114 : 1790 = 56.2 : 1.0
pens = True 1114 : 1790 = 40.1 : 1.0
val = True 1114 : 1790 = 39.8 : 1.0
pag = True 1114 : 1790 = 28.5 : 1.0
parcel = True 1114 : 1790 = 24.1 : 1.0
real = True 1114 : 1790 = 24.1 : 1.0
municípi = True 2049 : 1790 = 23.6 : 1.0
aliment = True 1114 : 203 = 22.2 : 1.0
veícul = True 2049 : 1790 = 22.1 : 1.0
é = True 1114 : 1790 = 21.5 : 1.0
efetu = True 1114 : 1790 = 21.2 : 1.0
palmas/to, = True 1114 : 1790 = 21.2 : 1.0
rend = True 2049 : 1790 = 20.5 : 1.0
cont = True 178 : 1790 = 19.9 : 1.0
This is the string string I use to test the algorithm:
['execut', 'r$', 'pen', 'val', 'pag']
Based on this chart of Features, I was hoping that NLTK would classify her as 1114 (I used the top 5 radicals that have more chance to classify like this), but my code insists on classifying in "1790"
Looking at the problability of classification ( print prob_classify()
), it gets even weirder, for me:
CODE:
def show_distribuicao():
distribuicao = classificador.prob_classify(linha_teste_frase)
for classe in distribuicao.samples():
print("%s: %f" % (classe, distribuicao.prob(classe)))
RESULT:
178: 0.000000
1790: 1.000000
203: 0.000000
1114: 0.000000
2049: 0.000000
Have a 100% chance of classifying as 1790?? What do you mean? And nor is it because I have a very large dissiparity between the types of data, look at how the values count (qualificacao_id.value_counts()
):
1790 323
203 237
178 152
1114 147
2049 141
Here a preview (head(10)
) of the dataset I’m dealing with:
Unnamed: 0 historico qualificacao_id
0 648 A assistida chegou de Portugal para a audiênc... 178
1 889 AP 0025466-03.2018.827.2729 Não tem condena... 1790
2 4315 o assisitido retornou dizendo que sua colação... 203
3 4512 Atendida: Lucélia Santos de Sousa (9255-0805)... 1790
4 7287 PEDI TESTEMUNHAS PARA RESPOSTA. ACUSADO DE RO... 1790
5 7422 O seu advogado particular renunci... 203
6 7526 (Juliana Dias) A assistida comparec... 1114
7 8272 Tel (63) 9213-2485 Ranilson Martins da Silv... 1790
8 8576 Orientação. 2049
9 9438 Compareceu a Assistida para apresentar recibo... 178
Does anyone have any idea what’s going on? PS: I’m sorry for the noobisses in the technologies, is that I am Noob same :P
From what you’ve presented, the base seems to be much unbalanced. Many 1790 items in the list. Try to balance the training base.
– Paulo Marques
Thank you Paul. I’ll try to do that.
– Luiz Carvalho