How to generate a statistic chart, if you have a huge data set?
A few days ago, I had to do a huge task. I had to generate a scatter plot of a huge file. This text file contains ~ 42 million data points. It was a huge deal for me, because I never did something like this before (in this dimensions).
Try #1: highcharts
A few minutes later, my laptop started to slow down dramatically. I even couldn't move the cursor smoothly. And finally, the OS froze. This library probably can't handle this amount of data.
Try #2: jfreechart
jfreechart was my choice. Easy to implement and it has a nice interface. At this point, I had the possibility to use a powerful server for this computation. The system has 64 Gb RAM and a 4 Core CPU Processor. I generated a jar file and started the program on the evening (also increased JVM Heap size).
Next day: Process was still running.
Finally, Try #3: matplotlib (pyplot)
Python is known as a solid machine learning- and big data programming language. That's why I used a python library called matplotlib (pyplot). A scatter plot for this large text file (~ 42 million data points) was generated in under 12 minutes!
This library can either display a GUI with a diagram or save a png file on your disk.
To help you implement this as well, I want to share my code:
#!/usr/bin/env python import os import pandas as pd import matplotlib matplotlib.use('Agg') from matplotlib import pyplot as plt from os.path import basename filename = "./input.txt" data = pd.read_csv( filename, header=None, sep=' ', usecols=[0, 1], names=["ROW0", "ROW1"] ) plt.scatter(data["ROW0"], data["ROW1"], linewidths=7, color='r') # plt.title("title") # plt.xlabel('x') # plt.ylabel('y') plt.grid(True) # input.txt -> FileFormat: X Y filename = os.path.splitext(basename(filename)) fig = plt.gcf() fig.set_size_inches(18.5, 10.5) fig.savefig(filename + ".png", dpi=300)
1 8 5 9 6 8 2 7 4 3 6 7 4 8 6 1 5 3 8 4 1 2 1 4 3 4 5 7
plt.scatter is just one example to generate a plot. There are a few more types to use: https://matplotlib.org/users/pyplot_tutorial.html
This given python script is a basic implementation to generate a scatter plot for a large file. It also can be used to generate a various type of charts.
Note: If you have a weak Computer, you should use a (powerful) server to execute this script. It's hard to keep the OS running during the execution on a system with low memory.
Please comment below, if you have any questions.
- OS: CentOS 7
- Python: 3.6.1
- Pandas: 0.20.2
- Matplotlib: 2.0.2