WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Petar Zečević, SV Group, University of Zagreb Mario Jurić, DIRAC Institute, University of Washington AXS - Astronomical Data Processing on the LSST Scale with Apache Spark #UnifiedDataAnalytics #SparkAISummit
About us Mario Jurić • Prof. of Astronomy at the University of Washington • Founding faculty of DIRAC & eScience Institute Fellow • Fmr. lead of LSST Data Management Petar Zečević • CTO at SV Group, Croatia • CS PhD student at University of Zagreb • Visiting Fellow at DiRAC institute @ UW • Author of “Spark in Action” 3#UnifiedDataAnalytics #SparkAISummit
About us 4#UnifiedDataAnalytics #SparkAISummit
Context: The Large Survey Revolution in Astronomy
Hipparchus of Rhodes (180-125 BC) In 129 BC, constructed one of the first star catalogs, containing about 850 stars.
Galileo Galilei (1564-1642) Researched a variety of topics in physics, but called out here for the introduction of the Galilean telescope. Galileo’s telescope allowed us for the first time to zoom in on the cosmos, and study the individual objects in great detail.
The Astrophysics Two-Step • Surveys – Construct catalogs and maps of objects in the sky. Focus on coarse classification and discovering targets for further follow-up. • Large telescopes – Acquire detailed observations of a few representative objects. Understand the details of astrophysical processes that govern them, and extrapolate that understanding to the entire class.
The Story of Astronomy: 2000 Years of being Data Poor 10
Sloan Digital Sky Survey 2.5m telescope >14,500 deg2 0.1” astrometry r<22.5 flux limit 5 band, 1%, photometry for over 900M stars Over 3M R=2000 spectra 10 years of ops: ~10 TB of imaging
1,231,051,050 rows (SDSS DR10, PhotoObjAll table) ~500 columns Facilitated the development of large databases, data- driven discovery, motion towards what we recognize as Data Science today.
Panoramic Survey Telescope and Rapid Response System 1.8m telescope 30,000 deg2 50mas astrometry r<23 flux limit 5 band, better than 1% photometry (goal) ~700 GB/night
14 https://sci.esa.int/s/wV6oG5w Gaia DR2: 1.7 billion stars
First Light: 2020 Operations: 2022 Deep (24th mag), Wide (60% of the sky), Fast (every 15 seconds) Largest astronomical camera in the world Will repeatedly observe the night sky over 10 years 10 million alerts each night (60 seconds) 37 billion astronomical sources, with time series 30 trillion measurements The Large Synoptic Survey Telescope A Public, Deep, Wide and Fast, Optical Sky Survey
Overview LSST’s mission is to build a well-understood system that provides a vast astronomical dataset for unprecedented discovery of the deep and dynamic universe.
The Scale of Things to Come 17 Metric Amount Number of detections 7 trillion rows Number of objects 37 billion rows Nightly alert rate 10 million Nightly data rate >15 TB Alert latency 60 seconds Total images after 10 yrs 50 PB Total data after 10 yrs 83 PB Objects detected, measured, and stored in queryable catalogs (tables)
Catalog-driven Science • Once a catalog is available, astronomers “ask” all kinds of questions 18#UnifiedDataAnalytics #SparkAISummit – Download data locally – Analyze (usually Python) • • The traditional paradigm: – Subset (filter data using a catalog SQL interface online)
Challenges (part 0) Dataset Size (keeping ~PBs of data in RBDMSes is not easy, or cheap) What do you do when the dataset subset is a few ~TBs?
Challenges (part 1) I Want it AllBetter Together (joining datasets is powerful) (interesting science w. whole dataset operations) Dataset Size (keeping ~TBs of data in RBDMs-es is not easy)
Challenges (part 2) Scalability Resources (how do I write an analysis code that will scale to petabytes of data?) (where are the resources to run this code?) How do you scale exploratory data analysis to ~PB-sized datasets and thousands of simultaneous users?
Enter Spark, AXS • AXS: Astronomy eXtensions for Spark • The main idea: – Spark is a proven, scalable, cloud-ready and widely-supported analytics framework with full SQL support (legacy support). – Extend it to exploratory data analysis. – Add a scalable positional cross-match operator – Add a domain-specific Python API layer to PySpark – Couple to S3 API for storage, Kubernetes for orchestration… • … A scalable platform supporting an arbitrarily sized dataset and a large number of users, deployable on either public or private cloud. 22
Key Issue: Scalable Cross-matching 23#UnifiedDataAnalytics #SparkAISummit DEC and RA coordinates Search perimeter (can also use similarity) A match
AXS data partitioning • Data partitioning is at the root of AXS' efficient cross- matching • Based on (late) Jim Gray's “zones algorithm” (MS Rsch) • Sky divided into horizontal “zones” of a certain height • Adapted for distributed architectures • Data stored in Parquet files – bucketed by zone – sorted by zone and ra columns – data from zone borders duplicated to the zone below 24
AXS data partitioning 25
AXS - optimal joins 26
AXS - optimal joins 27
Epsilon join SELECT ... FROM TA, TB WHERE TA.zone = TB.zone AND TA.ra BETWEEN TB.ra - e AND TB.ra + e 28 SPARK-24020: Sort-merge join “inner range optimization”
Other approaches Other systems use HEALPix or Hierarchical Triangular Mesh (HTM) 29
AXS performance results Gaia (1.7 B) x SDSS (800 M) 37s warm (148s cold) Gaia (1.7 B) x ZTF (2.9 B) 39s warm (315s cold) Left: tests on a single large machine. An AWS deployment scales out nearly linearly, as long as there are sufficient partitions in the dataset. 30#UnifiedDataAnalytics #SparkAISummit
AXS API 31#UnifiedDataAnalytics #SparkAISummit
AXS - other functionalities • crossmatch (return all or the first crossmatch candidate) • region queries • cone queries • histogram • histogram2d • Spark array functions for handling lightcurve data • All other Spark functions
Astronomy Example: Computing Light Curve Features with Python UDFs This works on arbitrarily large datasets! Cesium (Naul, 2016), Astronomy eXtensions for Spark (Zecevic+ 2018)
Observations and experiences • Spark scales really well! • SQL support is fantastic for supporting legacy code • Efficient data exchange with Python is key to having reasonable performance (Arrow and friends) • The language barrier is non-trivial: astronomy is in Python, little experience with JVM/Scala • Pushing Spark into exploratory data analysis – the challenge of converting a batch system to support more dynamic workflows.
“Astronomy 2025” Towards a scalable astronomical analysis platform
DATA INTENSIVE RESEARCH IN ASTROPHYSICS AND COSMOLOGY DIRAC Data Engineering Group We’re a collaborative incubator that supports people and communities researching and building next generations of software technologies for astronomy. We emphasize cross-pollination with other fields, the industry, and delivering usable, community supported, projects.
DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT
Backups 38
39 http://astro.washington.e du
EPSC-DPS Meeting 2019 • Geneva, Switzerland • September 16, 2019 4 0 Cataloging the Solar System • Potentially Hazardous Asteroids • Main Belt Asteroids • Census of small bodies in the Solar System Exploring the Transient sky • Variable stars, Supernovae • Fill in the variability phase-space • Discovery of new classes of transients Dark Matter, Dark Energy • Weak Lensing • Baryon acoustic oscillations • Supernovae, Quasars Milky Way Structure & Formation • Structure and evolutionary history • Spatial maps of stellar characteristics • Reach well into the halo LSST Science Drivers
Solar System Science with LSST Animation: SDSS Asteroids (Alex Parker, SwRI) About ~0.7 million are known Will grow to >5 million in the next 5 years Estimates: Lynne Jones et al.
Whole Dataset Operations• Galactic structure: density/proper motion maps of the Galaxy – => forall stars, compute distance, bin, create 5D map • Galactic structure: dust distribution – => forall stars, compute g-r color, bin, find blue tip edge, infer dust distribution • Near-field cosmology: MW satellite searches – => forall stars, compute colors, convolve with spatial filters, report any satellite-like peaks • Variability: Bayesian classification of transients and discovery of variables – => forall stars, get light curves, compute likelihoods, alert if interesting • …
Astronomical catalogs • Just (big!) databases • Each row corresponds to a detection or an object (star/galaxy/asteroid) • Producing catalogs from images is not trivial - non-exhaustive list of problems (for software to solve): – background estimation – PSF estimation – object detection – image co-addition – deblending 44
AXS history: LSD by Mario Jurić • Tool for querying, cross-matching and analysis of positionally or temporally indexed datasets • Inspired by Google's BigTable and MapReduce papers • However it has some shortcomings: – Fixed data partitioning (significant data skew) – Time-partitioning problematic (most queries do not slice by time) – Not resilient to worker failures – Contains a lot of custom solutions for functionalities that are common today 45
Enter Spark and AXS • Astronomy eXtensions for Spark • DiRAC institute @ UW saw the need for next generation astronomical analysis tool • Efficient cross-matching • Based on industry standards (Apache Spark) • Provides simple (but powerful) astronomical API extensions • Easy to use on-premises or in the cloud 46
Scaling with Spark https://www.toptal.com/spark/introduction-to-apache-spark
+ government-sponsored private clouds (e.g., JetStream)
Meeting the Challenges Resources Dataset Storage Scalable Analysis Code Interface

Astronomical Data Processing on the LSST Scale with Apache Spark

  • 1.
    WIFI SSID:Spark+AISummit |Password: UnifiedDataAnalytics
  • 2.
    Petar Zečević, SVGroup, University of Zagreb Mario Jurić, DIRAC Institute, University of Washington AXS - Astronomical Data Processing on the LSST Scale with Apache Spark #UnifiedDataAnalytics #SparkAISummit
  • 3.
    About us Mario Jurić •Prof. of Astronomy at the University of Washington • Founding faculty of DIRAC & eScience Institute Fellow • Fmr. lead of LSST Data Management Petar Zečević • CTO at SV Group, Croatia • CS PhD student at University of Zagreb • Visiting Fellow at DiRAC institute @ UW • Author of “Spark in Action” 3#UnifiedDataAnalytics #SparkAISummit
  • 4.
  • 5.
    Context: The LargeSurvey Revolution in Astronomy
  • 7.
    Hipparchus of Rhodes(180-125 BC) In 129 BC, constructed one of the first star catalogs, containing about 850 stars.
  • 8.
    Galileo Galilei (1564-1642) Researcheda variety of topics in physics, but called out here for the introduction of the Galilean telescope. Galileo’s telescope allowed us for the first time to zoom in on the cosmos, and study the individual objects in great detail.
  • 9.
    The Astrophysics Two-Step •Surveys – Construct catalogs and maps of objects in the sky. Focus on coarse classification and discovering targets for further follow-up. • Large telescopes – Acquire detailed observations of a few representative objects. Understand the details of astrophysical processes that govern them, and extrapolate that understanding to the entire class.
  • 10.
    The Story ofAstronomy: 2000 Years of being Data Poor 10
  • 11.
    Sloan Digital SkySurvey 2.5m telescope >14,500 deg2 0.1” astrometry r<22.5 flux limit 5 band, 1%, photometry for over 900M stars Over 3M R=2000 spectra 10 years of ops: ~10 TB of imaging
  • 12.
    1,231,051,050 rows (SDSSDR10, PhotoObjAll table) ~500 columns Facilitated the development of large databases, data- driven discovery, motion towards what we recognize as Data Science today.
  • 13.
    Panoramic Survey Telescopeand Rapid Response System 1.8m telescope 30,000 deg2 50mas astrometry r<23 flux limit 5 band, better than 1% photometry (goal) ~700 GB/night
  • 14.
  • 15.
    First Light: 2020Operations: 2022 Deep (24th mag), Wide (60% of the sky), Fast (every 15 seconds) Largest astronomical camera in the world Will repeatedly observe the night sky over 10 years 10 million alerts each night (60 seconds) 37 billion astronomical sources, with time series 30 trillion measurements The Large Synoptic Survey Telescope A Public, Deep, Wide and Fast, Optical Sky Survey
  • 16.
    Overview LSST’s mission isto build a well-understood system that provides a vast astronomical dataset for unprecedented discovery of the deep and dynamic universe.
  • 17.
    The Scale ofThings to Come 17 Metric Amount Number of detections 7 trillion rows Number of objects 37 billion rows Nightly alert rate 10 million Nightly data rate >15 TB Alert latency 60 seconds Total images after 10 yrs 50 PB Total data after 10 yrs 83 PB Objects detected, measured, and stored in queryable catalogs (tables)
  • 18.
    Catalog-driven Science • Oncea catalog is available, astronomers “ask” all kinds of questions 18#UnifiedDataAnalytics #SparkAISummit – Download data locally – Analyze (usually Python) • • The traditional paradigm: – Subset (filter data using a catalog SQL interface online)
  • 19.
    Challenges (part 0) DatasetSize (keeping ~PBs of data in RBDMSes is not easy, or cheap) What do you do when the dataset subset is a few ~TBs?
  • 20.
    Challenges (part 1) IWant it AllBetter Together (joining datasets is powerful) (interesting science w. whole dataset operations) Dataset Size (keeping ~TBs of data in RBDMs-es is not easy)
  • 21.
    Challenges (part 2) ScalabilityResources (how do I write an analysis code that will scale to petabytes of data?) (where are the resources to run this code?) How do you scale exploratory data analysis to ~PB-sized datasets and thousands of simultaneous users?
  • 22.
    Enter Spark, AXS •AXS: Astronomy eXtensions for Spark • The main idea: – Spark is a proven, scalable, cloud-ready and widely-supported analytics framework with full SQL support (legacy support). – Extend it to exploratory data analysis. – Add a scalable positional cross-match operator – Add a domain-specific Python API layer to PySpark – Couple to S3 API for storage, Kubernetes for orchestration… • … A scalable platform supporting an arbitrarily sized dataset and a large number of users, deployable on either public or private cloud. 22
  • 23.
    Key Issue: ScalableCross-matching 23#UnifiedDataAnalytics #SparkAISummit DEC and RA coordinates Search perimeter (can also use similarity) A match
  • 24.
    AXS data partitioning •Data partitioning is at the root of AXS' efficient cross- matching • Based on (late) Jim Gray's “zones algorithm” (MS Rsch) • Sky divided into horizontal “zones” of a certain height • Adapted for distributed architectures • Data stored in Parquet files – bucketed by zone – sorted by zone and ra columns – data from zone borders duplicated to the zone below 24
  • 25.
  • 26.
    AXS - optimaljoins 26
  • 27.
    AXS - optimaljoins 27
  • 28.
    Epsilon join SELECT ...FROM TA, TB WHERE TA.zone = TB.zone AND TA.ra BETWEEN TB.ra - e AND TB.ra + e 28 SPARK-24020: Sort-merge join “inner range optimization”
  • 29.
    Other approaches Other systemsuse HEALPix or Hierarchical Triangular Mesh (HTM) 29
  • 30.
    AXS performance results Gaia(1.7 B) x SDSS (800 M) 37s warm (148s cold) Gaia (1.7 B) x ZTF (2.9 B) 39s warm (315s cold) Left: tests on a single large machine. An AWS deployment scales out nearly linearly, as long as there are sufficient partitions in the dataset. 30#UnifiedDataAnalytics #SparkAISummit
  • 31.
  • 32.
    AXS - otherfunctionalities • crossmatch (return all or the first crossmatch candidate) • region queries • cone queries • histogram • histogram2d • Spark array functions for handling lightcurve data • All other Spark functions
  • 33.
    Astronomy Example: ComputingLight Curve Features with Python UDFs This works on arbitrarily large datasets! Cesium (Naul, 2016), Astronomy eXtensions for Spark (Zecevic+ 2018)
  • 34.
    Observations and experiences •Spark scales really well! • SQL support is fantastic for supporting legacy code • Efficient data exchange with Python is key to having reasonable performance (Arrow and friends) • The language barrier is non-trivial: astronomy is in Python, little experience with JVM/Scala • Pushing Spark into exploratory data analysis – the challenge of converting a batch system to support more dynamic workflows.
  • 35.
    “Astronomy 2025” Towards ascalable astronomical analysis platform
  • 36.
    DATA INTENSIVE RESEARCHIN ASTROPHYSICS AND COSMOLOGY DIRAC Data Engineering Group We’re a collaborative incubator that supports people and communities researching and building next generations of software technologies for astronomy. We emphasize cross-pollination with other fields, the industry, and delivering usable, community supported, projects.
  • 37.
    DON’T FORGET TORATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT
  • 38.
  • 39.
  • 40.
    EPSC-DPS Meeting 2019• Geneva, Switzerland • September 16, 2019 4 0 Cataloging the Solar System • Potentially Hazardous Asteroids • Main Belt Asteroids • Census of small bodies in the Solar System Exploring the Transient sky • Variable stars, Supernovae • Fill in the variability phase-space • Discovery of new classes of transients Dark Matter, Dark Energy • Weak Lensing • Baryon acoustic oscillations • Supernovae, Quasars Milky Way Structure & Formation • Structure and evolutionary history • Spatial maps of stellar characteristics • Reach well into the halo LSST Science Drivers
  • 41.
    Solar System Sciencewith LSST Animation: SDSS Asteroids (Alex Parker, SwRI) About ~0.7 million are known Will grow to >5 million in the next 5 years Estimates: Lynne Jones et al.
  • 43.
    Whole Dataset Operations•Galactic structure: density/proper motion maps of the Galaxy – => forall stars, compute distance, bin, create 5D map • Galactic structure: dust distribution – => forall stars, compute g-r color, bin, find blue tip edge, infer dust distribution • Near-field cosmology: MW satellite searches – => forall stars, compute colors, convolve with spatial filters, report any satellite-like peaks • Variability: Bayesian classification of transients and discovery of variables – => forall stars, get light curves, compute likelihoods, alert if interesting • …
  • 44.
    Astronomical catalogs • Just(big!) databases • Each row corresponds to a detection or an object (star/galaxy/asteroid) • Producing catalogs from images is not trivial - non-exhaustive list of problems (for software to solve): – background estimation – PSF estimation – object detection – image co-addition – deblending 44
  • 45.
    AXS history: LSDby Mario Jurić • Tool for querying, cross-matching and analysis of positionally or temporally indexed datasets • Inspired by Google's BigTable and MapReduce papers • However it has some shortcomings: – Fixed data partitioning (significant data skew) – Time-partitioning problematic (most queries do not slice by time) – Not resilient to worker failures – Contains a lot of custom solutions for functionalities that are common today 45
  • 46.
    Enter Spark andAXS • Astronomy eXtensions for Spark • DiRAC institute @ UW saw the need for next generation astronomical analysis tool • Efficient cross-matching • Based on industry standards (Apache Spark) • Provides simple (but powerful) astronomical API extensions • Easy to use on-premises or in the cloud 46
  • 47.
  • 48.
    + government-sponsored privateclouds (e.g., JetStream)
  • 49.
    Meeting the Challenges Resources DatasetStorage Scalable Analysis Code Interface