1
I am new to Python and I am having difficulty with my algorithm, the function of it is to check words in a set of files in PDF format and analyze the recurrence of each word generating with this information a graph of the law of Zipf(the second most recurring word repeats in quantity the square root of the first most recurring word) in order, I am using the matplotlib library to plot bar charts, but the number of words that appear on the graph is very large (on the x axis) and are overwriting themselves.
BS: all criticism is welcome, I thank everyone. follows the algorithm and the generated graph.
#!/usr/bin/env python3.6
import os
import re
from operator import itemgetter
import matplotlib.pyplot as plt
import numpy as np
import math
from tkinter import *
def io_pasta():
def on_press():
if not (os.path.exists(entrada.get())):
lb["fg"]="red"
lb["text"] = "Pasta inexistente/inacessivél"
lb["font"]= "Bold"
else:
zipf(entrada.get(),janela)
janela = Tk()
lb=Label(janela, text = "Onde estão os asquivos?", font = "arial")
lb.pack()
entrada = Entry(janela, width = 40)
entrada.place(x=40,y=40)
b = Button(janela,text="OK",width = 10, command=on_press)
b.place(x=150,y=75)
janela.geometry("400x120")
janela.title("Distribuição ZIPF")
janela.mainloop()
def zipf(pasta,win):
win.destroy()
if not pasta[-1]=="/":
pasta+="/"
palavra=[]
repetic=[]
for nome in os.listdir(pasta):
os.system("pdftotext -enc UTF-8 "+pasta+""+str(nome)+" "+pasta+""+str(nome)+".txt")
print("arquivos convertidos ......................ok!")
os.system("mkdir "+pasta+"arquivos_originais && mv "+pasta+"*pdf "+pasta+"arquivos_originais")
os.system("mkdir "+pasta+"convertidos_txt && mv "+pasta+"*txt "+pasta+"convertidos_txt/")
os.system("mkdir "+pasta+"zipf")
print("pasta ARQUIVOS_MOVIDOS criada .................ok!")
print("Arquivos Movidos.............................ok!")
frequency = {}
for arq in os.listdir(""+pasta+"convertidos_txt/"):
open_file = open(""+pasta+"convertidos_txt/"+str(arq)+"", "r", encoding='latin-1')
file_to_string = open_file.read()
w1 = re.findall(r'(\b[A-Za-z][a-z]{4,20}\b)', file_to_string)
control = True
for word in w1:
count = frequency.get(word,0)
frequency[word] = count + 1
for key, value in reversed(sorted(frequency.items(), key = itemgetter(1))):
if control == True:
j=value
control=False
else:
if abs(math.sqrt(j)-value)<4:
palavra.append(key)
repetic.append(value)
plt.title("Distribuição zipf")
plt.grid(True)
plt.xticks(repetic,palavra,rotation=90,size='small')
pos = np.arange(len(palavra)) + .5
plt.bar(pos,repetic,align='center',color='#b8ff5c')
plt.savefig(''+pasta+'zipf/grafico_'+str(arq)+'.png')
io_pasta()
Cannot generate a [mcve]?
– Woss
I’ll do and soon put.
– Roger Amaro
I took a look at his code, but I didn’t stop him because he wants to read, create and move files and folders (my files). However, this is not directly relevant to the problem described (bar, number and word problems). Even if I resolve to run, this code, I’m not sure I will repeat your problem as they are my pdfs (returning to the initial comment of playing the problem). It would be nice to have only the list of words and repetitions, as well as the part to plot the code that reproduces your problem.
– Guto
Opa, I managed to solve, my problem was that I was not passing the parameter "words" in plt.bar.
– Roger Amaro
In this case, you can answer your question and accept the answer. Putting a picture of the problem/result would be interesting as well. So there’s an answer for other users who have a similar problem.
– Guto