Bar graph generated with Python has become unreadable. How to improve it? How to work with a dataset of more than 1 million lines?

Asked

Viewed 452 times

-2

Friends,

The following bar chart was generated (the first column of datasets is UNIX time):

Gráfico ilegível

The Python code (version 3.5) used was the following:

# -*- coding: utf-8 -*-

import matplotlib.pyplot as plt
import matplotlib.dates as dates
from datetime import datetime, timedelta

x = []
y = []
with open("/Radhe/LabAbril2017Capturas/slices_calculos/winTime_10Abril_SemAtaques.csv") as f:
    for l in f:
        X,Y = l.split(",") #separador eh a virgula
        x.append(float(X))
        y.append(float (Y))

x1 = [datetime.fromtimestamp(int(d)) for d in x]
y_pos = [idx for idx, i in enumerate(y)]

plt.gca().xaxis.set_major_formatter(dates.DateFormatter('%m/%d/%Y %H:%M:%S'))

y1 = []
v = 0
y_sorted = sorted(y)
for i in y_sorted:
    if(abs(i-v > 50)):
        y1.append(i)
        v = i

plt.bar(y_pos, y, align='edge', color="blue", alpha=0.5, width=0.5) 

plt.title("Tamanho da janela TCP durante período sem ataques")
plt.ylabel("Tamanho da janela TCP")
plt.xlabel('Tempo')
plt.xticks(y_pos, x1, size='small',rotation=35, ha="right")
plt.yticks(y1)
plt.ylim(ymin=y_sorted[0]-200) # valor minimo do eixo y

plt.show()

Using the winTime_10Abril_slowloris.csv dataset, the chart also went bad:

usando winTime_10Abril_slowloris.csv

winTime_10Abril_SemAtaques.csv dataset is available here: https://ufile.io/l2ejn

winTime_10Abril_slowloris.csv dataset is available here: https://ufile.io/8mbc0

How to make the chart more readable? Any more efficient way to do it? My next dataset has about 1 million lines.... It will take too long...

1 million line dataset (winTime_10Abril_sockstress.csv): https://ufile.io/qolsg

  • A fairly simple solution for Abels is to make a multipled if of 10 (or another value) writes the label, so at least it will be readable

  • @Bacco: I did not understand very well. I could exemplify or write an answer please?

  • @Bacco: and the program is very slow. I tried to run on the dataset with over 1 million lines and on a machine with 16 GB RAM has more than an hour running and nothing... And is using 99.9% RAM...

  • 1

    Difficult question. When you have this amount of data (which is not so absurd, but it is already difficult to visualize per item), the ideal is to use grouping methods or general statistics. That is, the bar graph is not suitable for plotting each of the items. If you don’t want to plot averages or something in place, take a look at the graphics in the package Seaborn.

  • 1

    Ah, and on large datasets, your best friend is Pandas.

1 answer

2

Half answer, half comment.

Take a look at this question/answer:

https://stackoverflow.com/questions/45855794/plotting-too-many-lines-in-matplotlib-out-of-memory

It is based on the use of capacity Line Collection of matplotlib.

Maybe you can use the same technique. Another option is to reduce the number of lines and points as suggested. In general you don’t need 1 million points, you need 1000 points. The hard part is to select and put only what is needed (for example, just a few ticks to reduce the graphic part already help).

  • 3

    This link may be a good suggestion, but your reply will not be valid if one day the link crashes. In addition, it is important for the community to have content right here on the site. It would be better to include more details in your response. A summary of the content of the link would be helpful enough! Learn more about it in this item of our Community FAQ: We want answers that contain only links?

  • @Guto: the other problem is the graph has become unreadable. It could help me?

  • I edited the answer, a little bit clearer now. I’ll keep that in mind, even if the original link is perfect.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.