Tangram: Distributed Scheduling Framework for Apache Spark at Facebook

WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

Rui Jian, Hao Lin, Facebook Inc. rjian@fb.com, hlin@fb.com Tangram: Distributed Scheduling Framework for Apache Spark at Facebook #UnifiedAnalytics #SparkAISummit

About Us • Rui Jian – Software Engineer at Facebook (Data Warehouse & Graph Indexing) – Master of Computer Science (Shanghai Jiao Tong university) • Hao Lin – Research scientist at Facebook (Data Warehouse Batch Scheduling) – PhD in Parallel Computing (Purdue ECE) 3#UnifiedAnalytics #SparkAISummit

Agenda • Overview • Tangram Architecture • Scheduling Policies & Resource Allocation • Future work 4#UnifiedAnalytics #SparkAISummit

What is Tangram? The scheduling platform for • reliably running various batch workloads • with efficient heterogenous resource management • at scale 5#UnifiedAnalytics #SparkAISummit

Tangram Scheduling Targets • Single jobs: adhoc/periodic • Batch jobs: adhoc/periodic, malleable • Gang jobs: adhoc/periodic, rigid • Long-running jobs: steady and regular; e.g. online training 6#UnifiedAnalytics #SparkAISummit

Why Tangram? • Various workload characteristics – ML – Apache Spark – Apache Giraph – Single jobs • Customized scheduling policies • Scalability – Fleet size: hundreds of thousands worker nodes – Job scheduling throughput: hundreds of millions jobs per day 7#UnifiedAnalytics #SparkAISummit

Overview • What is Tangram? 8#UnifiedAnalytics #SparkAISummit Admin Job Manager DB ML Resource Manager Master Agent AgentAgent Single Job Gang Job ML Elastic Scheduler 1 2 3 4 5 6 SQL query Giraph Spark

Client Library 9#UnifiedAnalytics #SparkAISummit • Job management • Request/Release resources • Resource grant • Preemption notification • Launch containers • Container status change event Tangram client Resource Manager Agent Application 1 2 3 4 5 6

Agent • Report schedulable resources and runtime usage • Health check reports • Detect labels • Launch/Kill Containers • Container recovery • Resource isolation with cgroup v2 10#UnifiedAnalytics #SparkAISummit

Failure Recovery • Agent failure – Scan the recovery directory and recover the running containers • RM failure – Both agent and client hold off communication to the RM until the new master shows up – Client sync session info to the new master to help it build the states – Agents add them to the new master 11#UnifiedAnalytics #SparkAISummit

Scheduling Policies • Hierarchical queue structure • Jobs to be queued on leaves • Queue configs: – min/max resources – Policy: • FIFO • Dominant Resource Fairness (DRF) • User fairness • Global • … 12#UnifiedAnalytics #SparkAISummit / ads feed pipelines interactive Job DRF DRF DRF User FairnessFIFO 20%80% 50% 50% user1 user2 50% 50% FIFO FIFO Job Job Job

Scheduling Policies • Jobs ordered by priority, submission time within queue • Gang job as first class in scheduling and resource allocation • Lookahead scheduling for better throughput and utilization • Job starvation prevention 13#UnifiedAnalytics #SparkAISummit Gang 200 Gang 20 Single Gang 4 Single

Resource Allocation • Fine-grained resource specification: – {cpuMilliCores: 3000, memoryBytes: 200GB} • Constraints: – “dataCenter = dc1 & type in [1,2] & kernelVersion > 4.10” • Job Affinity: – inSameDatacenter 14#UnifiedAnalytics #SparkAISummit

Resource Allocation 15#UnifiedAnalytics #SparkAISummit Prefetched Host Cache • Bypass the steps of host filtering and scoring • Speedup allocation process Host Filtering • Hard & Soft constraints • Resource constraint • Label constraint • Job affinity Host Scoring and Ordering • Packing efficiency • Host healthiness • Data locality Commit Allocation • Book keeping resources • Update cluster & queue parameters

Constraint-based Scheduling • Machine type • Datacenter • Region • CPU architecture • Host prefix • … 16#UnifiedAnalytics #SparkAISummit Merged host pool - type 1 & 2 Job Job Job Host 1 Host 2 Host 3 Host 4 Host 5 Labeled with {”type”:”2”} Labeled with {”type”:”1”} Job Job Job constraint: type=2 Job constraint: type=1 Queue

Preemption • Guarantee resource availability SLO within and across queues • Identify the starving jobs and overallocated jobs • Minimize preemption cost: two-phase protocol – Only candidates appearing in both phases will be preempted – Resource Manager notifies client with preemption intent s.t. necessary action can be taken, e.g. checkpointing 17#UnifiedAnalytics #SparkAISummit

Cross Datacenter Scheduling • The growing demand of computation and storage for Hive tables spans across data centers • Stranded capacity with imbalanced load • Poor data locality and waste of network bandwidth • Slow reaction to recover from crisis and disaster 18#UnifiedAnalytics #SparkAISummit

Cross Datacenter Scheduling • Dispatcher Proxy – Monitors resource consumption across data centers – Decides the Resource Manager for scheduling jobs – Provides location hints to the Resource Manager for enforcement • Planner – Decides where the data will be replaced based on utilization and available resources 19#UnifiedAnalytics #SparkAISummit Datacenter 1 Datacenter 2 Datacenter 3 Resource Manager 1 Resource Manager 2 Dispatcher Job

Cross Datacenter Scheduling • Dispatcher Proxy – Monitors resource consumption across data centers – Decides the Resource Manager for scheduling jobs – Provides location hints to the Resource Manager for enforcement • Planner – Decides where the data will be replaced based on utilization and available resources 20#UnifiedAnalytics #SparkAISummit Datacenter 1 Datacenter 2 Datacenter 3 Resource Manager 1 Resource Manager 2 Dispatcher Job Job constraint: datacenter=1

Cross Datacenter Scheduling • Dispatcher Proxy – Monitors resource consumption across data centers – Decides the Resource Manager for scheduling jobs – Provides location hints to the Resource Manager for enforcement • Planner – Decides where the data will be replaced based on utilization and available resources 21#UnifiedAnalytics #SparkAISummit Datacenter 1 Datacenter 2 Datacenter 3 Resource Manager 1 Resource Manager 2 Dispatcher Job Job constraint: datacenter=1 Table DataTable Data

Future Work • Mix workloads managed by one resource manager • Run batch workloads with off-peak resources from online services • Automatic resource tuning for high utilization • We’re hiring! Contact: rjian@fb.com 22#UnifiedAnalytics #SparkAISummit

DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

Tangram: Distributed Scheduling Framework for Apache Spark at Facebook

In this document

More Related Content

What's hot

Similar to Tangram: Distributed Scheduling Framework for Apache Spark at Facebook

More from Databricks

Recently uploaded

Tangram: Distributed Scheduling Framework for Apache Spark at Facebook