How to generate a statistic chart, if you have a huge data set?

A few days ago, I had to do a huge task. I had to generate a scatter plot of a huge file.  This text file contains ~ 42 million data points. It was a huge deal for me, because I never did something like this before (in this dimensions).

Try #1: highcharts

So I started to search for a visualizer library that can handle this large data set. My first choice was highcharts.  It is written in JavaScript. To execute this on my computer I  configured a local web server. It started pretty well. But after a wile,  a message from Firefox was shown: "A script on this page may be busy, or it may have stopped responding. You can stop the script now, or you can continue to see if the script will complete." I just had to press continue.

A few minutes later, my laptop started to slow down dramatically. I even  couldn't move the cursor smoothly. And finally, the OS froze. This  library probably can't handle this amount of data.

Try #2: jfreechart

JavaScript is probably not the best programming language for this scatter plot. I switched to Java, trusting that I will have a better result.

jfreechart was my choice. Easy to implement and it has a nice interface. At this  point, I had the possibility to use a powerful server for this computation. The system has 64 Gb RAM and a 4 Core CPU Processor. I generated a jar file and started the program on the evening (also increased JVM Heap size).

Next day: Process was still running.

Finally, Try #3: matplotlib (pyplot)

Python is known as a solid machine learning- and big data programming language. That's why I used a python library called matplotlib (pyplot). A scatter plot for this large text file (~ 42 million data points) was generated in under 12 minutes!

This library can either display a GUI with a diagram or save a png file on your disk.

To help you implement this as well, I want to share my code:

#!/usr/bin/env python
import os
import pandas as pd
import matplotlib
 
matplotlib.use('Agg')
from matplotlib import pyplot as plt
from os.path import basename
 
filename = "./input.txt"
 
data = pd.read_csv(
    filename,
    header=None,
    sep=' ',
    usecols=[0, 1],
    names=["ROW0", "ROW1"]
)
 
plt.scatter(data["ROW0"], data["ROW1"], linewidths=7, color='r')
 
# plt.title("title")
# plt.xlabel('x')
# plt.ylabel('y')
plt.grid(True)
 
# input.txt -> FileFormat: X Y
filename = os.path.splitext(basename(filename))[0]
 
fig = plt.gcf()
fig.set_size_inches(18.5, 10.5)
fig.savefig(filename + ".png", dpi=300)

input.txt

1 8
5 9
6 8
2 7
4 3
6 7
4 8
6 1
5 3
8 4
1 2
1 4
3 4
5 7

result/image:

generated plot diagram

plt.scatter is just one example to generate a plot. There are a few more types to use: https://matplotlib.org/users/pyplot_tutorial.html

This given python script is a basic implementation to generate a scatter plot for a large file. It also can be used to generate a various type of charts.

Note: If you have  a weak Computer, you should use a (powerful) server to execute this  script. It's hard to keep the OS running during the execution on a  system with low memory.


Please comment below, if you have any questions.

Tested on:

  • OS: CentOS 7
  • Python: 3.6.1
  • Pandas: 0.20.2
  • Matplotlib: 2.0.2

Credits: