Continuous Applications at Scale of 100 Teams with Databricks Delta and Structured Streaming

Continuous Applications at the Scale of 100 Teams Viacheslav Inozemtsev Max Schultze April 25, 2019 with Databricks Delta and Structured Streaming

OUTLINE Introduction of Zalando Zalando’s Processing Platform Databricks Use Cases Lessons Learned

3 Who we are Viacheslav Inozemtsev ● Data Engineer ● Degrees in Applied Math and in Computer Science ● Working with Spark since 0.9.2 Max Schultze ● Data Engineer ● MSc in Computer Science ● Took part in early development of Apache Flink

5 Zalando’s Data Lake Ingestion Storage Serving

6 Zalando’s Data Lake DWH Data Center Ingestion Storage Serving Event Bus

7 Zalando’s Data Lake DWH Data Center Ingestion Storage Serving Event Bus Metastore

8 Zalando’s Data Lake DWH Data Center Ingestion Storage Serving Event Bus Metastore Ad-Hoc querying

9 Zalando’s Data Lake DWH Data Center Ingestion Storage Serving Event Bus Metastore Ad-Hoc Querying Processing Platform

10 Zalando’s Data Lake DWH Data Center Ingestion Storage Serving Event Bus Metastore Ad-Hoc Querying Processing Platform

11 Zalando’s Databricks Processing Platform

12 Zalando’s Databricks Processing Platform - Technical Setup

17 Zalando’s Databricks Processing Platform - Organizational Setup Introduction to Databricks ● RSA ● Office Hours

18 Introduction to Databricks ● RSA ● Office Hours Initial Setup ● Inner Source Configuration Zalando’s Databricks Processing Platform - Organizational Setup

19 Introduction to Databricks ● RSA ● Office Hours Initial Setup ● Inner Source Configuration Development Phase ● Office Hours ● Guest Developer Zalando’s Databricks Processing Platform - Organizational Setup

20 Introduction to Databricks ● RSA ● Office Hours Initial Setup ● Inner Source Configuration Development Phase ● Office Hours ● Guest Developer Productionizing ● 24/7 Support Zalando’s Databricks Processing Platform - Organizational Setup

22 DWH Data Center Ingestion Storage Serving Event Bus Metastore Ad-Hoc Querying Processing Platform Batch Ingestion from Data Warehouse

23 Batch Ingestion from Data Warehouse

24 Batch Ingestion from Data Warehouse ● Problem 1: extraction from databases via JDBC can be slow

25 Batch Ingestion from Data Warehouse ● Problem 1: extraction from databases via JDBC can be slow ● Solution: ○ use parallelism of Spark JDBC reader

26 Batch Ingestion from Data Warehouse ● Problem 1: extraction from databases via JDBC can be slow ● Solution: ○ use parallelism of Spark JDBC reader ○ for partitioned tables a view with a column PARTITION_ID can be created

27 Batch Ingestion from Data Warehouse ● Problem 1: extraction from databases via JDBC can be slow ● Solution: ○ use parallelism of Spark JDBC reader ○ for partitioned tables a view with a column PARTITION_ID can be created ○ works especially well for tables partitioned on multiple machines

28 Batch Ingestion from Data Warehouse ● Problem 2: data warehouse is still often on premises

29 Batch Ingestion from Data Warehouse ● Problem 2: data warehouse is still often on premises ● Solution: ○ resolve this early!

30 Batch Ingestion from Data Warehouse ● Problem 2: data warehouse is still often on premises ● Solution: ○ resolve this early! ○ obtain a dedicated VPN or a direct fiber connection

31 Batch Ingestion from Data Warehouse ● Problem 2: data warehouse is still often on premises ● Solution: ○ resolve this early! ○ obtain a dedicated VPN or a direct fiber connection ○ move your DBs to the Cloud

32 Batch Ingestion from Data Warehouse ● Problem 3: you can overload the database

33 Batch Ingestion from Data Warehouse ● Problem 3: you can overload the database ● Solution: ○ have one cluster and one Databricks job per database

34 Batch Ingestion from Data Warehouse ● Problem 3: you can overload the database ● Solution: ○ have one cluster and one Databricks job per database ○ control the number of concurrent connections using environment variable SPARK_WORKER_CORES

35 DWH Data Center Ingestion Storage Serving Event Bus Metastore Ad-Hoc Querying Processing Platform Continuous Ingestion of Event Data

36 Continuous Ingestion of Event Data

37 Continuous Ingestion of Event Data ● Problem 1: evolving JSON schema of topics

38 Continuous Ingestion of Event Data ● Problem 1: evolving JSON schema of topics ● Solution: ○ make pipeline customizable

39 Continuous Ingestion of Event Data ● Problem 1: evolving JSON schema of topics ● Solution: ○ make pipeline customizable ○ ask consumers to provide schema

40 Continuous Ingestion of Event Data ● Problem 1: evolving JSON schema of topics ● Solution: ○ make pipeline customizable ○ ask consumers to provide schema ○ use inner source

41 ● Problem 2: there are 3000 topics to archive Continuous Ingestion of Event Data

42 ● Problem 2: there are 3000 topics to archive ● Solution: ○ group Spark structured streaming jobs into families Continuous Ingestion of Event Data

43 ● Problem 2: there are 3000 topics to archive ● Solution: ○ group Spark structured streaming jobs into families ○ create one Databricks job per family Continuous Ingestion of Event Data

44 ● Problem 2: there are 3000 topics to archive ● Solution: ○ group Spark structured streaming jobs into families ○ create one Databricks job per family ○ run each family on a separate cluster Continuous Ingestion of Event Data

45 ● Problem 3: small files problem Continuous Ingestion of Event Data

46 ● Problem 3: small files problem ● Solution: ○ run OPTIMIZE every night Continuous Ingestion of Event Data

47 ● Problem 3: small files problem ● Solution: ○ run OPTIMIZE every night ○ use job cluster of Databricks Continuous Ingestion of Event Data

48 DWH Data Center Ingestion Storage Serving Event Bus Metastore Ad-Hoc Querying Processing Platform Continuous Serving of Data

50 ● Problem 1: data governance and access control Continuous Serving of Data

51 ● Problem 1: data governance and access control ● Solution: ○ centralised management of IAM roles Continuous Serving of Data

52 ● Problem 1: data governance and access control ● Solution: ○ centralised management of IAM roles ○ approved data access request has to be provided Continuous Serving of Data

53 ● Problem 2: level of expertise of the teams Continuous Serving of Data

54 ● Problem 2: level of expertise of the teams ● Solution: ○ inner source for cluster creation and edition Continuous Serving of Data

55 ● Problem 2: level of expertise of the teams ● Solution: ○ inner source for cluster creation and edition ○ office hours to consult teams Continuous Serving of Data

56 ● Problem 2: level of expertise of the teams ● Solution: ○ inner source for cluster creation and edition ○ office hours to consult teams ○ guest developer Continuous Serving of Data

57 ● Problem 3: access from the cluster to the team’s resources Continuous Serving of Data

58 ● Problem 3: access from the cluster to the team’s resources ● Solution: ○ S3 bucket - policy in the role, policy in the bucket Continuous Serving of Data

59 ● Problem 3: access from the cluster to the team’s resources ● Solution: ○ S3 bucket - policy in the role, policy in the bucket ○ Redshift - create private link Continuous Serving of Data

61 1. Understand capabilities of Spark and Databricks early Lessons Learned

62 1. Understand capabilities of Spark and Databricks early ● Involve solutions architects of Databricks Lessons Learned

63 1. Understand capabilities of Spark and Databricks early ● Involve solutions architects of Databricks ● Make POCs of different approaches Lessons Learned

64 1. Understand capabilities of Spark and Databricks early ● Involve solutions architects of Databricks ● Make POCs of different approaches ● Learn Spark Lessons Learned

65 1. Understand capabilities of Spark and Databricks early 2. Delta tables are great Lessons Learned

66 1. Understand capabilities of Spark and Databricks early 2. Delta tables are great ● Delta + Structured Streaming enable continuous applications Lessons Learned

67 1. Understand capabilities of Spark and Databricks early 2. Delta tables are great ● Delta + Structured Streaming enable continuous applications ● Transactionality of readers Lessons Learned

68 1. Understand capabilities of Spark and Databricks early 2. Delta tables are great ● Delta + Structured Streaming enable continuous applications ● Transactionality of readers ● Deletion of data for right to be forgotten (GDPR) Lessons Learned

69 1. Understand capabilities of Spark and Databricks early 2. Delta tables are great 3. Automate management of infrastructure Lessons Learned

70 1. Understand capabilities of Spark and Databricks early 2. Delta tables are great 3. Automate management of infrastructure ● Definitions of clusters as JSON files Lessons Learned

71 1. Understand capabilities of Spark and Databricks early 2. Delta tables are great 3. Automate management of infrastructure ● Definitions of clusters as JSON files ● Easy tooling Lessons Learned

72 1. Understand capabilities of Spark and Databricks early 2. Delta tables are great 3. Automate management of infrastructure ● Definitions of clusters as JSON files ● Easy tooling ● Automated roll-outs of updates Lessons Learned

73 1. Understand capabilities of Spark and Databricks early 2. Delta tables are great 3. Automate management of infrastructure ● Definitions of clusters as JSON files ● Easy tooling ● Automated roll-outs of updates ● Disposable infrastructure Lessons Learned

74 1. Understand capabilities of Spark and Databricks early 2. Delta tables are great 3. Automate management of infrastructure 4. Externalise responsibilities Lessons Learned

75 1. Understand capabilities of Spark and Databricks early 2. Delta tables are great 3. Automate management of infrastructure 4. Externalise responsibilities ● Use inner source to provide schemas, transformations, and layout Lessons Learned

76 1. Understand capabilities of Spark and Databricks early 2. Delta tables are great 3. Automate management of infrastructure 4. Externalise responsibilities ● Use inner source to provide schemas, transformations, and layout ● Educate your users Lessons Learned

77 1. Understand capabilities of Spark and Databricks early 2. Delta tables are great 3. Automate management of infrastructure 4. Externalise responsibilities ● Use inner source to provide schemas, transformations, and layout ● Educate your users ● Build a community Lessons Learned

78 Continuous Applications at the Scale of 100 Teams with Databricks Delta and Structured Streaming Viacheslav Inozemtsev Max Schultze viacheslav.inozemtsev@zalando.de max.schultze@zalando.de @mcs1408 www.zalando.com jobs.zalando.com/tech

Continuous Applications at Scale of 100 Teams with Databricks Delta and Structured Streaming

More Related Content

What's hot

Similar to Continuous Applications at Scale of 100 Teams with Databricks Delta and Structured Streaming

More from Databricks

Recently uploaded

Continuous Applications at Scale of 100 Teams with Databricks Delta and Structured Streaming