How can I plot ca. 20 million points as a scatterplot?

Question

I am trying to create a scatterplot with matplotlib that consists of ca. ca. 20 million data points. Even after setting the alpha value to its lowest before ending up with no visible data at all the result is just a completely black plot.

plt.scatter(timedPlotData, plotData, alpha=0.01, marker='.')

The x-axis is a continuous timeline of about 2 months and the y-axis consists of 150k consecutive integer values.

Is there any way to plot all the points so that their distribution over time is still visible?

Thank you for your help.

What are you trying to show? I would think a density map (2D histogram) would be more informative. — tacaswell
– tacaswell, Commented Sep 18, 2013 at 18:15
A very high res monitor is 2500 x 1400 pixels -- 3.6 million pixels. If you are doing a scatterplot of 20 million pixels (ie, many many more points than a high res monitor) and getting nothing but black, maybe you need to filter your data? — the wolf
– the wolf, Commented Sep 18, 2013 at 18:19
@tcaswell I want to show how access points are used over the course of the 2 months. I looked at numpy's histogram2d but wouldn't I loose the date information? — FrozenSUSHI
– FrozenSUSHI, Commented Sep 18, 2013 at 18:26
no, just use a time-bin that is sized for the scale you care about (ie 1 day). Can you provide some more information about what you are trying to plot? What you are trying to show will feed into how you show it. — tacaswell
– tacaswell, Commented Sep 18, 2013 at 18:28
@FrozenSUSHI: Ok, 2560 x 1440. My mistake. That is still only 3,686,400 pixels. Let's say you could fill A4 paper at 300 PPI. That is 2480 x 3508 or 8,699,840 points. Paper or monitor - You still have a too much data to see if the points are evenly distributed roughly. You need to filter more specific data or display it differently. — the wolf
– the wolf, Commented Sep 18, 2013 at 18:49

Joe Kington · Accepted Answer · 2013-09-18 19:39:14Z

There's more than one way to do this. A lot of folks have suggested a heatmap/kernel-density-estimate/2d-histogram. @Bucky suggesed using a moving average. In addition, you can fill between a moving min and moving max, and plot the moving mean over the top. I often call this a "chunkplot", but that's a terrible name. The implementation below assumes that your time (x) values are monotonically increasing. If they're not, it's simple enough to sort y by x before "chunking" in the chunkplot function.

Here are a couple of different ideas. Which is best will depend on what you want to emphasize in the plot. Note that this will be rather slow to run, but that's mostly due to the scatterplot. The other plotting styles are much faster.

import numpy as np import matplotlib.pyplot as plt import matplotlib.dates as mdates import datetime as dt np.random.seed(1977) def main(): x, y = generate_data() fig, axes = plt.subplots(nrows=3, sharex=True) for ax in axes.flat: ax.xaxis_date() fig.autofmt_xdate() axes[0].set_title('Scatterplot of all data') axes[0].scatter(x, y, marker='.') axes[1].set_title('"Chunk" plot of data') chunkplot(x, y, chunksize=1000, ax=axes[1], edgecolor='none', alpha=0.5, color='gray') axes[2].set_title('Hexbin plot of data') axes[2].hexbin(x, y) plt.show() def generate_data(): # Generate a very noisy but interesting timeseries x = mdates.drange(dt.datetime(2010, 1, 1), dt.datetime(2013, 9, 1), dt.timedelta(minutes=10)) num = x.size y = np.random.random(num) - 0.5 y.cumsum(out=y) y += 0.5 * y.max() * np.random.random(num) return x, y def chunkplot(x, y, chunksize, ax=None, line_kwargs=None, **kwargs): if ax is None: ax = plt.gca() if line_kwargs is None: line_kwargs = {} # Wrap the array into a 2D array of chunks, truncating the last chunk if # chunksize isn't an even divisor of the total size. # (This part won't use _any_ additional memory) numchunks = y.size // chunksize ychunks = y[:chunksize*numchunks].reshape((-1, chunksize)) xchunks = x[:chunksize*numchunks].reshape((-1, chunksize)) # Calculate the max, min, and means of chunksize-element chunks... max_env = ychunks.max(axis=1) min_env = ychunks.min(axis=1) ycenters = ychunks.mean(axis=1) xcenters = xchunks.mean(axis=1) # Now plot the bounds and the mean... fill = ax.fill_between(xcenters, min_env, max_env, **kwargs) line = ax.plot(xcenters, ycenters, **line_kwargs)[0] return fill, line main()

enter image description here

Or just do your scatter plot in light grey, and overlay with a line of daily average values in a dark blue or red. Then any isolated outliers won't artificially expand the width of the chunkplot's data band, they will just be lone points away from the main group.
@PaulMcGuire - The problem with that is that the scatter plot takes a very long time to render for a large number of points. A big advantage of the "cunkplot" is that it's much faster to draw. That having been said, you're making a good point. The min and max often aren't the best statistics to use. You could do all kinds of other things, including plotting two filled intervals, say one for the 25th and 75th quantiles, and then the min and max behind that. Or, if you don't mind the long rendering time, just do the scatterplot as you suggested.
I ended up with calculating the sum for each 10 minutes and filtering out a lot of unnecessary data. Applying your "Hexbin" plot did the trick for me in the end.

PaulMcG · Accepted Answer · 2013-09-18 18:47:37Z

For each day, tally up the frequency of each value (a collections.Counter will do this nicely), then plot a heatmap of the values, one per day. For publication, use a grayscale for the heatmap colors.

I think this will still leave you with a 60x150k heat map, but yes.

Bucky · Accepted Answer · 2013-09-18 18:09:50Z

My recommendation would be to use a sorting and moving average algorithm on the raw data before you plot it. This should leave the mean and trend intact over the time period of interest while providing you with a reduction in clutter on the plot.

Steve Barnes · Accepted Answer · 2013-09-18 20:29:45Z

Group values into bands on each day and use a 3d histogram of count, value band, day.

That way you can get the number of occurrences in a given band on each day clearly.

Collectives™ on Stack Overflow

How can I plot ca. 20 million points as a scatterplot?

4 Answers 4

3 Comments

1 Comment

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

1 Comment

Comments

Comments

Linked

Related