Rob Emanuele @lossyrob ANALYZING LARGE RASTER DATA IN A JUPYTER NOTEBOOK WITH GEOPYSPARK ON AWS
Connect to the WIFI Network: Harvard University http://getonline.harvard.edu Click “I am a guest” Credentials: U: foss4g2017@gmail.com P: 7RFQU3rm FIRST: Find your Jupyter Notebook URL https://git.io/v77lh (lowercase L) visit the URL next to your name Log in to the Jupyter Hub U: hadoop P: hadoop
OUTLINE 8:00 - 8:30 Intro and Background 8:30 - 9:10 Section 1: Land Cover data 9:10 - 10:00 Section 2: Landsat 8 data 10:00 - 10:10 BREAK 10:10 - 10:30 Deployment and Ingestion 10:30 - 11:10 Section 3: Combining data layers 11:10 - 12:00 Section 4: Making Cool Maps
NOW: A MOTIVATING EXAMPLE
BY
rdd.map(lambda x: x + 1) Source: http://silverpond.com.au/2016/10/06/balancing-spark.ht
(1, 1) (2, 1)(0, 1) (0, 0) (1, 0) (2, 0) (1, 2) (2, 2)(0, 2)
(1, 1) (2, 1)(0, 1) (0, 0) (1, 0) (2, 0) (1, 2) (2, 2)(0, 2) Node 1 Node 2 Node 3
(1, 1) (2, 1)(0, 1) (0, 0) (1, 0) (2, 0) (1, 2) (2, 2)(0, 2) Node 1 Node 2 Node 3
(1, 1) (2, 1)(0, 1) (0, 0) (1, 0) (2, 0) (1, 2) (2, 2)(0, 2) Node 1 Node 2 Node 3
(1, 1) (2, 1)(0, 1) Node 1 Node 2 Node 3
(1, 1) (2, 1)(0, 1) Node 1 Node 2 Node 3 rdd.bufferTiles(…)
+ + Interactive and Batch Processing of large raster data Web-Speed Processing of small to medium sized raster data
GeoTrellis Ecosystem Raster Foundry by Spark SQL and Spark ML support Raster Frames by Spark SQL and Spark ML support GeoPySpark Python bindings Vector Pipes Vector Tiles on Spark PDAL integration Point Clouds on Spark
GeoPySpark
Started December 2016 Follows PySpark’s model of communication between the JavaVirtual Machine and Python Access GeoTrellis functionality through Python, and integrates with your favorite python raster tools (numpy + friends). 0.2 is released! GeoPySpark
EXERCISE 1: ANALYZING LAND COVER DATA
EXERCISE 2: WORKING WITH LANDSAT IMAGERY AND NDVITHROUGHTIME
(SpaceTimeKey, Tile) (SpaceTimeKey, Tile) (SpaceTimeKey, Tile) … SpaceTimeKey ≈  (col, row, instant)
(SpaceTimeKey, Tile) (SpaceTimeKey, Tile) (SpaceTimeKey, Tile) … lambda lambda lambda (SpatialKey, (DateTime, Tile)) (SpatialKey, (DateTime, Tile)) (SpatialKey, (DateTime, Tile)) …
… (SpatialKey, [(DateTime, Tile) (DateTime, Tile)]) (SpatialKey, (DateTime, Tile)) (SpatialKey, (DateTime, Tile)) (SpatialKey, (DateTime, Tile)) (SpatialKey, [(DateTime, Tile)]) …
… (SpatialKey, [(DateTime, Tile) (DateTime, Tile)]) (SpatialKey, (DateTime, Tile)) (SpatialKey, (DateTime, Tile)) (SpatialKey, (DateTime, Tile)) (SpatialKey, [(DateTime, Tile)]) (Shuffle) …
(SpatialKey, [(DateTime, Tile) (DateTime, Tile)]) (SpatialKey, [(DateTime, Tile)]) … mosaic (SpatialKey, Tile) (SpatialKey, Tile) … mosaic
BREAK!
WHERE AND HOW ARETHESE NOTEBOOKS RUNNING?
WHERE’STHIS DATA COMING FROM?
Supported Backends
EXERCISE 3: COMBINING LAND COVER AND NDVITO DETECT CROP CYCLES
(SpaceTimeKey, Tile) (SpaceTimeKey, Tile) (SpaceTimeKey, Tile) …
(SpaceTimeKey, Tile) (SpaceTimeKey, Tile) (SpaceTimeKey, Tile) … map_to_spatial (SpatialKey, (STK, Tile)) (SpatialKey, (STK, Tile)) (SpatialKey, (STK, Tile)) … map_to_spatial map_to_spatial STK = SpaceTimeKey
(SpatialKey, (STK, Tile)) (SpatialKey, (STK, Tile)) (SpatialKey, (STK, Tile)) … (SpatialKey, Tile) (SpatialKey, Tile) … ndwi_rdd nlcd_layer.to_numpy_rdd() (SpatialKey, ((STK, Tile), Tile)) (SpatialKey, ((STK, Tile), Tile)) (SpatialKey, ((STK, Tile),Tile)) …
(SpatialKey, (STK, Tile)) (SpatialKey, (STK, Tile)) (SpatialKey, (STK, Tile)) … (SpatialKey, Tile) (SpatialKey, Tile) … ndwi_rdd nlcd_layer.to_numpy_rdd() (SpatialKey, ((STK, Tile), Tile)) (SpatialKey, ((STK, Tile), Tile)) (SpatialKey, ((STK, Tile),Tile)) … (Shuffle)
mask_ndwi mask_ndwi mask_ndwi (SpaceTimeKey, Tile) (SpaceTimeKey, Tile) (SpaceTimeKey, Tile) … (SpatialKey, ((STK, Tile), Tile)) (SpatialKey, ((STK, Tile), Tile)) (SpatialKey, ((STK, Tile),Tile)) …
EXERCISE 4: COMBINING IMAGERY, ELEVATION AND LAND COVER DATA TO MAKE A COOL LOOKING MAP
EXERCISE 4: COMBINING IMAGERY, ELEVATION AND LAND COVER DATA TO MAKE A COOL LOOKING MAP TWEETYOUR SWEET MAP SCREENSHOTS WITH #GEOPYSPARK #FOSS4G!
FINAL QUESTIONS?
Thank you!

Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FOSS4G 2017 Workshop