Reduce size on disk of vectorized scatter plot with many overlapping points and alpha

Question

When plotting scatter plots in matplotlib and saving to a vector format, in this case PDF, the generated file size is scaling with the number of points.

Since I have lots of points with large amount of overlapping points, I set alpha=.2 to see how densely distributed the points are. In central regions, this results in the displayed color equalling the appearance of alpha=1.

Is there any way to "crop" these regions (f.i. by combining overlapping points within a specified distance) when saving the figure to a vectorized file, so some kind of area is saved instead of saving each single point?

What I forgot to mention: Since I need to plot the correlations of multiple variables, I need a (n x n) scatter plot matrix where n is the number of variables. This impedes the use of hexbin or other methods, since I'd have to create a full grid of plots by myself.

For example as in:

fig_sc = plt.figure(figsize=(5, 5)) ax_sc = fig_sc.gca() ax_sc.scatter( np.random.normal(size=100000), np.random.normal(size=100000), s=10, marker='o', facecolors='none', edgecolors='black', alpha=.3) fig_sc.savefig('test.pdf', format='pdf')

This results in a file size of approximately 1.5MB, since each point is saved. Can I somehow "reduce" this image by combining overlapping points?

I tried several options such as setting dpi=300 and transparence=False, but since PDF stores the figure as a vectorized image, this naturally didn't change anything.

Things that might work, but have drawbacks:

hexbin plots: Works for a single scatter plot if the resolution and cmap is adjusted correctly, but I want to plot a scatter-matrix with (n x n) scatter plots. There is afaik no hexbin-matrix plot.
saving to a rasterized format: The plots are for a journal which requests vectorized plots whereever possible. Thus I'd like to avoid storing the image as a rasterized image.
randomly extracting parts of the data: might work, but will alter the appearance of the plots.

Any ideas?
Thanks in advance!

gboffi · Accepted Answer · 2019-03-08 17:25:10Z

Maybe you want to change your approach and use something different from a scatter plot, leaving to Numpy and Matplotlib the task of lowsampling your data set — in other words, use Numpy's histogram2d and Matplotlib's imshow

x, y = [p.random.normal(size=100000) for _ in (4, 34)] h, xedge, yedge = np.histogram2d(x, y, bins=25) cmap = plt.get_cmap('Greys') plt.imshow(h, interpolation='lanczos', origin='low', cmap=cmap, extent=[xedge[0], xedge[-1], yedge[0], yedge[-1]])

plt.savefig('Figure1.pdf') # → 30384 bytes

Grid arrangement (this time using hexbin)

np.random.seed(20190308) fig, axes = plt.subplots(3, 2, figsize=(4,6), subplot_kw={'xticks': [], 'yticks': []}) fig.subplots_adjust(hspace=0.05, wspace=0.05) for ax in axes.flat: ax.hexbin(*(np.random.normal(size=10000) for _ in ('x', 'y')), cmap=cmap)

This is a really nice approach, thanks! But unluckily it has the same flaws as the hexbin-plots: No matrix (n x n) plotting is supported. I am currently in the last steps of making a small tool to make a hexbin-matrix plot and will post it as soon as it is done.
Do you want to make a grid of {hexbins, imshows}? You can create a grid of axes and then use the hexbin method of each one of axes, the difficult part is to make the grid to your exact specs.
Yep, should have made that more clear in my question. I only stated it in the methods having drawbacks. Sorry for that!
I have posted hexbins in a grid, it's rough and you possibly want to do different tweaks, e.g., same x limits and y limits, axis on the external borders, legends etc.
Thanks alot. As alread said, I am almost finished with my own implementation of hexbin-matrix which is made to resemble pandas pd.plotting.scatter_matrix as much as possible. I'll post it, if someone is interested in it. I'll still accept your answer as appreciation for the effort you put into it. :) Thanks again!

Richard Lenkiewicz · Accepted Answer · 2019-03-08 14:07:04Z

This may be a cheat but you could save it as a .png file and then insert it into pdf canvas via latex and fit the document margins to the figure.

Collectives™ on Stack Overflow

Reduce size on disk of vectorized scatter plot with many overlapping points and alpha

2 Answers 2

5 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Related