Error generating matplotlib graph

Asked

Viewed 982 times

1

I am new to Python and I am having difficulty with my algorithm, the function of it is to check words in a set of files in PDF format and analyze the recurrence of each word generating with this information a graph of the law of Zipf(the second most recurring word repeats in quantity the square root of the first most recurring word) in order, I am using the matplotlib library to plot bar charts, but the number of words that appear on the graph is very large (on the x axis) and are overwriting themselves.

BS: all criticism is welcome, I thank everyone. follows the algorithm and the generated graph.

    #!/usr/bin/env python3.6
import os
import re
from operator import itemgetter
import matplotlib.pyplot as plt
import numpy as np
import math

from tkinter import *
def io_pasta():
        def on_press():
            if not (os.path.exists(entrada.get())):
                lb["fg"]="red"
                lb["text"] = "Pasta inexistente/inacessivél"
                lb["font"]= "Bold"
            else:
                zipf(entrada.get(),janela)

        janela = Tk()
        lb=Label(janela, text = "Onde estão os asquivos?", font = "arial")
        lb.pack()
        entrada = Entry(janela, width = 40)
        entrada.place(x=40,y=40)
        b = Button(janela,text="OK",width = 10, command=on_press)
        b.place(x=150,y=75)

        janela.geometry("400x120")
        janela.title("Distribuição ZIPF")

        janela.mainloop()


def zipf(pasta,win):
    win.destroy()
    if not pasta[-1]=="/":
          pasta+="/"
    palavra=[]
    repetic=[]
    for nome in os.listdir(pasta):
        os.system("pdftotext -enc UTF-8   "+pasta+""+str(nome)+"  "+pasta+""+str(nome)+".txt")
    print("arquivos convertidos ......................ok!")
    os.system("mkdir "+pasta+"arquivos_originais && mv "+pasta+"*pdf "+pasta+"arquivos_originais")
    os.system("mkdir "+pasta+"convertidos_txt && mv "+pasta+"*txt "+pasta+"convertidos_txt/")
    os.system("mkdir "+pasta+"zipf")
    print("pasta ARQUIVOS_MOVIDOS criada .................ok!")
    print("Arquivos Movidos.............................ok!")
    frequency = {}
    for arq in os.listdir(""+pasta+"convertidos_txt/"):
        open_file = open(""+pasta+"convertidos_txt/"+str(arq)+"", "r", encoding='latin-1')
        file_to_string = open_file.read()
        w1 = re.findall(r'(\b[A-Za-z][a-z]{4,20}\b)', file_to_string)
        control = True
        for word in w1:
            count = frequency.get(word,0)
            frequency[word] = count + 1

        for key, value in reversed(sorted(frequency.items(), key = itemgetter(1))):
            if control == True:
                    j=value
                    control=False
            else:
                if abs(math.sqrt(j)-value)<4:
                        palavra.append(key)
                        repetic.append(value)

        plt.title("Distribuição zipf")
        plt.grid(True)
        plt.xticks(repetic,palavra,rotation=90,size='small')
        pos = np.arange(len(palavra)) + .5 
        plt.bar(pos,repetic,align='center',color='#b8ff5c')
        plt.savefig(''+pasta+'zipf/grafico_'+str(arq)+'.png')      

io_pasta()
  • Cannot generate a [mcve]?

  • I’ll do and soon put.

  • I took a look at his code, but I didn’t stop him because he wants to read, create and move files and folders (my files). However, this is not directly relevant to the problem described (bar, number and word problems). Even if I resolve to run, this code, I’m not sure I will repeat your problem as they are my pdfs (returning to the initial comment of playing the problem). It would be nice to have only the list of words and repetitions, as well as the part to plot the code that reproduces your problem.

  • Opa, I managed to solve, my problem was that I was not passing the parameter "words" in plt.bar.

  • In this case, you can answer your question and accept the answer. Putting a picture of the problem/result would be interesting as well. So there’s an answer for other users who have a similar problem.

1 answer

0

Tip 1: use the method tightlayout

Add the function tightlayout before save your image:

plt.bar(pos,repetic,align='center',color='#b8ff5c')
plt.tightlayout()
plt.savefig(''+pasta+'zipf/grafico_'+str(arq)+'.png')      

Suggestion 2: Install a figure and adjust its size

  1. Process your text and have a word number X at the end.
  2. Create a picture with the function figure (or analogous function) and specify its size according to the size of your word number X:

    import matplotlib.pyplot as plt
    fig = plt.figure(figsize=(0.5 * x, 10))
    

(adjust the factor of 0.5 for a larger/smaller number).

Browser other questions tagged

You are not signed in. Login or sign up in order to post.