Continuous Applications at the Scale of 100 Teams Viacheslav Inozemtsev Max Schultze April 25, 2019 with Databricks Delta and Structured Streaming
OUTLINE Introduction of Zalando Zalando’s Processing Platform Databricks Use Cases Lessons Learned
3 Who we are Viacheslav Inozemtsev ● Data Engineer ● Degrees in Applied Math and in Computer Science ● Working with Spark since 0.9.2 Max Schultze ● Data Engineer ● MSc in Computer Science ● Took part in early development of Apache Flink
4 Introduction of Zalando
5 Zalando’s Data Lake Ingestion Storage Serving
6 Zalando’s Data Lake DWH Data Center Ingestion Storage Serving Event Bus
7 Zalando’s Data Lake DWH Data Center Ingestion Storage Serving Event Bus Metastore
8 Zalando’s Data Lake DWH Data Center Ingestion Storage Serving Event Bus Metastore Ad-Hoc querying
9 Zalando’s Data Lake DWH Data Center Ingestion Storage Serving Event Bus Metastore Ad-Hoc Querying Processing Platform
10 Zalando’s Data Lake DWH Data Center Ingestion Storage Serving Event Bus Metastore Ad-Hoc Querying Processing Platform
11 Zalando’s Databricks Processing Platform
12 Zalando’s Databricks Processing Platform - Technical Setup
13 Zalando’s Databricks Processing Platform - Technical Setup
14 Zalando’s Databricks Processing Platform - Technical Setup
15 Zalando’s Databricks Processing Platform - Technical Setup
16 Zalando’s Databricks Processing Platform - Technical Setup
17 Zalando’s Databricks Processing Platform - Organizational Setup Introduction to Databricks ● RSA ● Office Hours
18 Introduction to Databricks ● RSA ● Office Hours Initial Setup ● Inner Source Configuration Zalando’s Databricks Processing Platform - Organizational Setup
19 Introduction to Databricks ● RSA ● Office Hours Initial Setup ● Inner Source Configuration Development Phase ● Office Hours ● Guest Developer Zalando’s Databricks Processing Platform - Organizational Setup
20 Introduction to Databricks ● RSA ● Office Hours Initial Setup ● Inner Source Configuration Development Phase ● Office Hours ● Guest Developer Productionizing ● 24/7 Support Zalando’s Databricks Processing Platform - Organizational Setup
21 Databricks Use Cases
22 DWH Data Center Ingestion Storage Serving Event Bus Metastore Ad-Hoc Querying Processing Platform Batch Ingestion from Data Warehouse
23 Batch Ingestion from Data Warehouse
24 Batch Ingestion from Data Warehouse ● Problem 1: extraction from databases via JDBC can be slow
25 Batch Ingestion from Data Warehouse ● Problem 1: extraction from databases via JDBC can be slow ● Solution: ○ use parallelism of Spark JDBC reader
26 Batch Ingestion from Data Warehouse ● Problem 1: extraction from databases via JDBC can be slow ● Solution: ○ use parallelism of Spark JDBC reader ○ for partitioned tables a view with a column PARTITION_ID can be created
27 Batch Ingestion from Data Warehouse ● Problem 1: extraction from databases via JDBC can be slow ● Solution: ○ use parallelism of Spark JDBC reader ○ for partitioned tables a view with a column PARTITION_ID can be created ○ works especially well for tables partitioned on multiple machines
28 Batch Ingestion from Data Warehouse ● Problem 2: data warehouse is still often on premises
29 Batch Ingestion from Data Warehouse ● Problem 2: data warehouse is still often on premises ● Solution: ○ resolve this early!
30 Batch Ingestion from Data Warehouse ● Problem 2: data warehouse is still often on premises ● Solution: ○ resolve this early! ○ obtain a dedicated VPN or a direct fiber connection
31 Batch Ingestion from Data Warehouse ● Problem 2: data warehouse is still often on premises ● Solution: ○ resolve this early! ○ obtain a dedicated VPN or a direct fiber connection ○ move your DBs to the Cloud
32 Batch Ingestion from Data Warehouse ● Problem 3: you can overload the database
33 Batch Ingestion from Data Warehouse ● Problem 3: you can overload the database ● Solution: ○ have one cluster and one Databricks job per database
34 Batch Ingestion from Data Warehouse ● Problem 3: you can overload the database ● Solution: ○ have one cluster and one Databricks job per database ○ control the number of concurrent connections using environment variable SPARK_WORKER_CORES
35 DWH Data Center Ingestion Storage Serving Event Bus Metastore Ad-Hoc Querying Processing Platform Continuous Ingestion of Event Data
36 Continuous Ingestion of Event Data
37 Continuous Ingestion of Event Data ● Problem 1: evolving JSON schema of topics
38 Continuous Ingestion of Event Data ● Problem 1: evolving JSON schema of topics ● Solution: ○ make pipeline customizable
39 Continuous Ingestion of Event Data ● Problem 1: evolving JSON schema of topics ● Solution: ○ make pipeline customizable ○ ask consumers to provide schema
40 Continuous Ingestion of Event Data ● Problem 1: evolving JSON schema of topics ● Solution: ○ make pipeline customizable ○ ask consumers to provide schema ○ use inner source
41 ● Problem 2: there are 3000 topics to archive Continuous Ingestion of Event Data
42 ● Problem 2: there are 3000 topics to archive ● Solution: ○ group Spark structured streaming jobs into families Continuous Ingestion of Event Data
43 ● Problem 2: there are 3000 topics to archive ● Solution: ○ group Spark structured streaming jobs into families ○ create one Databricks job per family Continuous Ingestion of Event Data
44 ● Problem 2: there are 3000 topics to archive ● Solution: ○ group Spark structured streaming jobs into families ○ create one Databricks job per family ○ run each family on a separate cluster Continuous Ingestion of Event Data
45 ● Problem 3: small files problem Continuous Ingestion of Event Data
46 ● Problem 3: small files problem ● Solution: ○ run OPTIMIZE every night Continuous Ingestion of Event Data
47 ● Problem 3: small files problem ● Solution: ○ run OPTIMIZE every night ○ use job cluster of Databricks Continuous Ingestion of Event Data
48 DWH Data Center Ingestion Storage Serving Event Bus Metastore Ad-Hoc Querying Processing Platform Continuous Serving of Data
49 Continuous Serving of Data
50 ● Problem 1: data governance and access control Continuous Serving of Data
51 ● Problem 1: data governance and access control ● Solution: ○ centralised management of IAM roles Continuous Serving of Data
52 ● Problem 1: data governance and access control ● Solution: ○ centralised management of IAM roles ○ approved data access request has to be provided Continuous Serving of Data
53 ● Problem 2: level of expertise of the teams Continuous Serving of Data
54 ● Problem 2: level of expertise of the teams ● Solution: ○ inner source for cluster creation and edition Continuous Serving of Data
55 ● Problem 2: level of expertise of the teams ● Solution: ○ inner source for cluster creation and edition ○ office hours to consult teams Continuous Serving of Data
56 ● Problem 2: level of expertise of the teams ● Solution: ○ inner source for cluster creation and edition ○ office hours to consult teams ○ guest developer Continuous Serving of Data
57 ● Problem 3: access from the cluster to the team’s resources Continuous Serving of Data
58 ● Problem 3: access from the cluster to the team’s resources ● Solution: ○ S3 bucket - policy in the role, policy in the bucket Continuous Serving of Data
59 ● Problem 3: access from the cluster to the team’s resources ● Solution: ○ S3 bucket - policy in the role, policy in the bucket ○ Redshift - create private link Continuous Serving of Data
60 Lessons Learned
61 1. Understand capabilities of Spark and Databricks early Lessons Learned
62 1. Understand capabilities of Spark and Databricks early ● Involve solutions architects of Databricks Lessons Learned
63 1. Understand capabilities of Spark and Databricks early ● Involve solutions architects of Databricks ● Make POCs of different approaches Lessons Learned
64 1. Understand capabilities of Spark and Databricks early ● Involve solutions architects of Databricks ● Make POCs of different approaches ● Learn Spark Lessons Learned
65 1. Understand capabilities of Spark and Databricks early 2. Delta tables are great Lessons Learned
66 1. Understand capabilities of Spark and Databricks early 2. Delta tables are great ● Delta + Structured Streaming enable continuous applications Lessons Learned
67 1. Understand capabilities of Spark and Databricks early 2. Delta tables are great ● Delta + Structured Streaming enable continuous applications ● Transactionality of readers Lessons Learned
68 1. Understand capabilities of Spark and Databricks early 2. Delta tables are great ● Delta + Structured Streaming enable continuous applications ● Transactionality of readers ● Deletion of data for right to be forgotten (GDPR) Lessons Learned
69 1. Understand capabilities of Spark and Databricks early 2. Delta tables are great 3. Automate management of infrastructure Lessons Learned
70 1. Understand capabilities of Spark and Databricks early 2. Delta tables are great 3. Automate management of infrastructure ● Definitions of clusters as JSON files Lessons Learned
71 1. Understand capabilities of Spark and Databricks early 2. Delta tables are great 3. Automate management of infrastructure ● Definitions of clusters as JSON files ● Easy tooling Lessons Learned
72 1. Understand capabilities of Spark and Databricks early 2. Delta tables are great 3. Automate management of infrastructure ● Definitions of clusters as JSON files ● Easy tooling ● Automated roll-outs of updates Lessons Learned
73 1. Understand capabilities of Spark and Databricks early 2. Delta tables are great 3. Automate management of infrastructure ● Definitions of clusters as JSON files ● Easy tooling ● Automated roll-outs of updates ● Disposable infrastructure Lessons Learned
74 1. Understand capabilities of Spark and Databricks early 2. Delta tables are great 3. Automate management of infrastructure 4. Externalise responsibilities Lessons Learned
75 1. Understand capabilities of Spark and Databricks early 2. Delta tables are great 3. Automate management of infrastructure 4. Externalise responsibilities ● Use inner source to provide schemas, transformations, and layout Lessons Learned
76 1. Understand capabilities of Spark and Databricks early 2. Delta tables are great 3. Automate management of infrastructure 4. Externalise responsibilities ● Use inner source to provide schemas, transformations, and layout ● Educate your users Lessons Learned
77 1. Understand capabilities of Spark and Databricks early 2. Delta tables are great 3. Automate management of infrastructure 4. Externalise responsibilities ● Use inner source to provide schemas, transformations, and layout ● Educate your users ● Build a community Lessons Learned
78 Continuous Applications at the Scale of 100 Teams with Databricks Delta and Structured Streaming Viacheslav Inozemtsev Max Schultze viacheslav.inozemtsev@zalando.de max.schultze@zalando.de @mcs1408 www.zalando.com jobs.zalando.com/tech

Continuous Applications at Scale of 100 Teams with Databricks Delta and Structured Streaming

  • 1.
    Continuous Applications at theScale of 100 Teams Viacheslav Inozemtsev Max Schultze April 25, 2019 with Databricks Delta and Structured Streaming
  • 2.
    OUTLINE Introduction of Zalando Zalando’sProcessing Platform Databricks Use Cases Lessons Learned
  • 3.
    3 Who we are ViacheslavInozemtsev ● Data Engineer ● Degrees in Applied Math and in Computer Science ● Working with Spark since 0.9.2 Max Schultze ● Data Engineer ● MSc in Computer Science ● Took part in early development of Apache Flink
  • 4.
  • 5.
  • 6.
    6 Zalando’s Data Lake DWH DataCenter Ingestion Storage Serving Event Bus
  • 7.
    7 Zalando’s Data Lake DWH DataCenter Ingestion Storage Serving Event Bus Metastore
  • 8.
    8 Zalando’s Data Lake DWH DataCenter Ingestion Storage Serving Event Bus Metastore Ad-Hoc querying
  • 9.
    9 Zalando’s Data Lake DWH DataCenter Ingestion Storage Serving Event Bus Metastore Ad-Hoc Querying Processing Platform
  • 10.
    10 Zalando’s Data Lake DWH DataCenter Ingestion Storage Serving Event Bus Metastore Ad-Hoc Querying Processing Platform
  • 11.
  • 12.
    12 Zalando’s Databricks ProcessingPlatform - Technical Setup
  • 13.
    13 Zalando’s Databricks ProcessingPlatform - Technical Setup
  • 14.
    14 Zalando’s Databricks ProcessingPlatform - Technical Setup
  • 15.
    15 Zalando’s Databricks ProcessingPlatform - Technical Setup
  • 16.
    16 Zalando’s Databricks ProcessingPlatform - Technical Setup
  • 17.
    17 Zalando’s Databricks ProcessingPlatform - Organizational Setup Introduction to Databricks ● RSA ● Office Hours
  • 18.
    18 Introduction to Databricks ●RSA ● Office Hours Initial Setup ● Inner Source Configuration Zalando’s Databricks Processing Platform - Organizational Setup
  • 19.
    19 Introduction to Databricks ●RSA ● Office Hours Initial Setup ● Inner Source Configuration Development Phase ● Office Hours ● Guest Developer Zalando’s Databricks Processing Platform - Organizational Setup
  • 20.
    20 Introduction to Databricks ●RSA ● Office Hours Initial Setup ● Inner Source Configuration Development Phase ● Office Hours ● Guest Developer Productionizing ● 24/7 Support Zalando’s Databricks Processing Platform - Organizational Setup
  • 21.
  • 22.
    22 DWH Data Center Ingestion Storage Serving Event Bus Metastore Ad-HocQuerying Processing Platform Batch Ingestion from Data Warehouse
  • 23.
  • 24.
    24 Batch Ingestion fromData Warehouse ● Problem 1: extraction from databases via JDBC can be slow
  • 25.
    25 Batch Ingestion fromData Warehouse ● Problem 1: extraction from databases via JDBC can be slow ● Solution: ○ use parallelism of Spark JDBC reader
  • 26.
    26 Batch Ingestion fromData Warehouse ● Problem 1: extraction from databases via JDBC can be slow ● Solution: ○ use parallelism of Spark JDBC reader ○ for partitioned tables a view with a column PARTITION_ID can be created
  • 27.
    27 Batch Ingestion fromData Warehouse ● Problem 1: extraction from databases via JDBC can be slow ● Solution: ○ use parallelism of Spark JDBC reader ○ for partitioned tables a view with a column PARTITION_ID can be created ○ works especially well for tables partitioned on multiple machines
  • 28.
    28 Batch Ingestion fromData Warehouse ● Problem 2: data warehouse is still often on premises
  • 29.
    29 Batch Ingestion fromData Warehouse ● Problem 2: data warehouse is still often on premises ● Solution: ○ resolve this early!
  • 30.
    30 Batch Ingestion fromData Warehouse ● Problem 2: data warehouse is still often on premises ● Solution: ○ resolve this early! ○ obtain a dedicated VPN or a direct fiber connection
  • 31.
    31 Batch Ingestion fromData Warehouse ● Problem 2: data warehouse is still often on premises ● Solution: ○ resolve this early! ○ obtain a dedicated VPN or a direct fiber connection ○ move your DBs to the Cloud
  • 32.
    32 Batch Ingestion fromData Warehouse ● Problem 3: you can overload the database
  • 33.
    33 Batch Ingestion fromData Warehouse ● Problem 3: you can overload the database ● Solution: ○ have one cluster and one Databricks job per database
  • 34.
    34 Batch Ingestion fromData Warehouse ● Problem 3: you can overload the database ● Solution: ○ have one cluster and one Databricks job per database ○ control the number of concurrent connections using environment variable SPARK_WORKER_CORES
  • 35.
    35 DWH Data Center Ingestion Storage Serving Event Bus Metastore Ad-HocQuerying Processing Platform Continuous Ingestion of Event Data
  • 36.
  • 37.
    37 Continuous Ingestion ofEvent Data ● Problem 1: evolving JSON schema of topics
  • 38.
    38 Continuous Ingestion ofEvent Data ● Problem 1: evolving JSON schema of topics ● Solution: ○ make pipeline customizable
  • 39.
    39 Continuous Ingestion ofEvent Data ● Problem 1: evolving JSON schema of topics ● Solution: ○ make pipeline customizable ○ ask consumers to provide schema
  • 40.
    40 Continuous Ingestion ofEvent Data ● Problem 1: evolving JSON schema of topics ● Solution: ○ make pipeline customizable ○ ask consumers to provide schema ○ use inner source
  • 41.
    41 ● Problem 2:there are 3000 topics to archive Continuous Ingestion of Event Data
  • 42.
    42 ● Problem 2:there are 3000 topics to archive ● Solution: ○ group Spark structured streaming jobs into families Continuous Ingestion of Event Data
  • 43.
    43 ● Problem 2:there are 3000 topics to archive ● Solution: ○ group Spark structured streaming jobs into families ○ create one Databricks job per family Continuous Ingestion of Event Data
  • 44.
    44 ● Problem 2:there are 3000 topics to archive ● Solution: ○ group Spark structured streaming jobs into families ○ create one Databricks job per family ○ run each family on a separate cluster Continuous Ingestion of Event Data
  • 45.
    45 ● Problem 3:small files problem Continuous Ingestion of Event Data
  • 46.
    46 ● Problem 3:small files problem ● Solution: ○ run OPTIMIZE every night Continuous Ingestion of Event Data
  • 47.
    47 ● Problem 3:small files problem ● Solution: ○ run OPTIMIZE every night ○ use job cluster of Databricks Continuous Ingestion of Event Data
  • 48.
    48 DWH Data Center Ingestion Storage Serving Event Bus Metastore Ad-HocQuerying Processing Platform Continuous Serving of Data
  • 49.
  • 50.
    50 ● Problem 1:data governance and access control Continuous Serving of Data
  • 51.
    51 ● Problem 1:data governance and access control ● Solution: ○ centralised management of IAM roles Continuous Serving of Data
  • 52.
    52 ● Problem 1:data governance and access control ● Solution: ○ centralised management of IAM roles ○ approved data access request has to be provided Continuous Serving of Data
  • 53.
    53 ● Problem 2:level of expertise of the teams Continuous Serving of Data
  • 54.
    54 ● Problem 2:level of expertise of the teams ● Solution: ○ inner source for cluster creation and edition Continuous Serving of Data
  • 55.
    55 ● Problem 2:level of expertise of the teams ● Solution: ○ inner source for cluster creation and edition ○ office hours to consult teams Continuous Serving of Data
  • 56.
    56 ● Problem 2:level of expertise of the teams ● Solution: ○ inner source for cluster creation and edition ○ office hours to consult teams ○ guest developer Continuous Serving of Data
  • 57.
    57 ● Problem 3:access from the cluster to the team’s resources Continuous Serving of Data
  • 58.
    58 ● Problem 3:access from the cluster to the team’s resources ● Solution: ○ S3 bucket - policy in the role, policy in the bucket Continuous Serving of Data
  • 59.
    59 ● Problem 3:access from the cluster to the team’s resources ● Solution: ○ S3 bucket - policy in the role, policy in the bucket ○ Redshift - create private link Continuous Serving of Data
  • 60.
  • 61.
    61 1. Understand capabilitiesof Spark and Databricks early Lessons Learned
  • 62.
    62 1. Understand capabilitiesof Spark and Databricks early ● Involve solutions architects of Databricks Lessons Learned
  • 63.
    63 1. Understand capabilitiesof Spark and Databricks early ● Involve solutions architects of Databricks ● Make POCs of different approaches Lessons Learned
  • 64.
    64 1. Understand capabilitiesof Spark and Databricks early ● Involve solutions architects of Databricks ● Make POCs of different approaches ● Learn Spark Lessons Learned
  • 65.
    65 1. Understand capabilitiesof Spark and Databricks early 2. Delta tables are great Lessons Learned
  • 66.
    66 1. Understand capabilitiesof Spark and Databricks early 2. Delta tables are great ● Delta + Structured Streaming enable continuous applications Lessons Learned
  • 67.
    67 1. Understand capabilitiesof Spark and Databricks early 2. Delta tables are great ● Delta + Structured Streaming enable continuous applications ● Transactionality of readers Lessons Learned
  • 68.
    68 1. Understand capabilitiesof Spark and Databricks early 2. Delta tables are great ● Delta + Structured Streaming enable continuous applications ● Transactionality of readers ● Deletion of data for right to be forgotten (GDPR) Lessons Learned
  • 69.
    69 1. Understand capabilitiesof Spark and Databricks early 2. Delta tables are great 3. Automate management of infrastructure Lessons Learned
  • 70.
    70 1. Understand capabilitiesof Spark and Databricks early 2. Delta tables are great 3. Automate management of infrastructure ● Definitions of clusters as JSON files Lessons Learned
  • 71.
    71 1. Understand capabilitiesof Spark and Databricks early 2. Delta tables are great 3. Automate management of infrastructure ● Definitions of clusters as JSON files ● Easy tooling Lessons Learned
  • 72.
    72 1. Understand capabilitiesof Spark and Databricks early 2. Delta tables are great 3. Automate management of infrastructure ● Definitions of clusters as JSON files ● Easy tooling ● Automated roll-outs of updates Lessons Learned
  • 73.
    73 1. Understand capabilitiesof Spark and Databricks early 2. Delta tables are great 3. Automate management of infrastructure ● Definitions of clusters as JSON files ● Easy tooling ● Automated roll-outs of updates ● Disposable infrastructure Lessons Learned
  • 74.
    74 1. Understand capabilitiesof Spark and Databricks early 2. Delta tables are great 3. Automate management of infrastructure 4. Externalise responsibilities Lessons Learned
  • 75.
    75 1. Understand capabilitiesof Spark and Databricks early 2. Delta tables are great 3. Automate management of infrastructure 4. Externalise responsibilities ● Use inner source to provide schemas, transformations, and layout Lessons Learned
  • 76.
    76 1. Understand capabilitiesof Spark and Databricks early 2. Delta tables are great 3. Automate management of infrastructure 4. Externalise responsibilities ● Use inner source to provide schemas, transformations, and layout ● Educate your users Lessons Learned
  • 77.
    77 1. Understand capabilitiesof Spark and Databricks early 2. Delta tables are great 3. Automate management of infrastructure 4. Externalise responsibilities ● Use inner source to provide schemas, transformations, and layout ● Educate your users ● Build a community Lessons Learned
  • 78.
    78 Continuous Applications at theScale of 100 Teams with Databricks Delta and Structured Streaming Viacheslav Inozemtsev Max Schultze viacheslav.inozemtsev@zalando.de max.schultze@zalando.de @mcs1408 www.zalando.com jobs.zalando.com/tech