When plotting scatter plots in matplotlib and saving to a vector format, in this case PDF, the generated file size is scaling with the number of points.
Since I have lots of points with large amount of overlapping points, I set alpha=.2 to see how densely distributed the points are. In central regions, this results in the displayed color equalling the appearance of alpha=1.
Is there any way to "crop" these regions (f.i. by combining overlapping points within a specified distance) when saving the figure to a vectorized file, so some kind of area is saved instead of saving each single point?
What I forgot to mention: Since I need to plot the correlations of multiple variables, I need a (n x n) scatter plot matrix where n is the number of variables. This impedes the use of hexbin or other methods, since I'd have to create a full grid of plots by myself.
For example as in:
fig_sc = plt.figure(figsize=(5, 5)) ax_sc = fig_sc.gca() ax_sc.scatter( np.random.normal(size=100000), np.random.normal(size=100000), s=10, marker='o', facecolors='none', edgecolors='black', alpha=.3) fig_sc.savefig('test.pdf', format='pdf') This results in a file size of approximately 1.5MB, since each point is saved. Can I somehow "reduce" this image by combining overlapping points?
I tried several options such as setting dpi=300 and transparence=False, but since PDF stores the figure as a vectorized image, this naturally didn't change anything.
Things that might work, but have drawbacks:
- hexbin plots: Works for a single scatter plot if the resolution and cmap is adjusted correctly, but I want to plot a scatter-matrix with (n x n) scatter plots. There is afaik no hexbin-matrix plot.
- saving to a rasterized format: The plots are for a journal which requests vectorized plots whereever possible. Thus I'd like to avoid storing the image as a rasterized image.
- randomly extracting parts of the data: might work, but will alter the appearance of the plots.
Any ideas?
Thanks in advance!

