2

I am currently trying to reduce the file size of a scatter plot. My code looks like:

plt.scatter(a1,b1) plt.savefig('test.ps') 

where a1,b1 are arrays of size 400,000 or so, and it gives a file size of 7.8MB.

I have tried adding

plt.rcParams['path.simplify'] = True 

before this chunk of code, but the file is still 7.8MB. Is this an issue with how it saves as a ".ps" file or another issue?

2 Answers 2

7

One approach is to use plot instead of scatter (you can still produce scatter plots using plot by using the 'o' argument), and use the rasterized keyword argument, like so:

import numpy as np import matplotlib.pyplot as plt a1,b1 = np.random.randn(400000,2).T #mock data of similar size to yours plt.plot(a1,b1,'o',rasterized=True) plt.savefig("test.ps") 

This should significantly reduce the size of the output file. The text and line art will remain vector, only the points are rasterized, so it is a nice compromise.

Depending on what you're looking to achieve, however, it might be better to histogram your data and plot that instead (e.g. pyplot.hist2d or pyplot.hexbin).

Sign up to request clarification or add additional context in comments.

4 Comments

But this solution loses all advantages of using a vector-based format like PostScript in the first place, doesn't it?
This approach only rasterizes the points, the text and line art stay vector. It's true that you can't arbitrarily zoom, but I guess there has to be some trade-off between the information content of the graphic and its size in memory. For me, this is often a useful compromise.
Thank you. This got my PDF size from 15MB to 4MB
I love this approach for scatter plots! It's indeed a great compromise
6

You could consider using e.g. hexbin -- I particularly like this when you have a dense collection of points, since it better indicates where your data is concentrated. For example:

import numpy as np import matplotlib.pylab as pl x = np.random.normal(size=40000) y = np.random.normal(size=40000) pl.figure() pl.subplot(121) pl.scatter(x, y) pl.xlim(-4,4) pl.ylim(-4,4) pl.subplot(122) pl.hexbin(x, y, gridsize=40) pl.xlim(-4,4) pl.ylim(-4,4) 

enter image description here

From the left figure, I would have concluded that the distribution of points between x,y = {-3,3} is roughly equal, which clearly is not the case.

(http://matplotlib.org/examples/pylab_examples/hexbin_demo.html)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.