WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Rui Jian, Hao Lin, Facebook Inc. rjian@fb.com, hlin@fb.com Tangram: Distributed Scheduling Framework for Apache Spark at Facebook #UnifiedAnalytics #SparkAISummit
About Us • Rui Jian – Software Engineer at Facebook (Data Warehouse & Graph Indexing) – Master of Computer Science (Shanghai Jiao Tong university) • Hao Lin – Research scientist at Facebook (Data Warehouse Batch Scheduling) – PhD in Parallel Computing (Purdue ECE) 3#UnifiedAnalytics #SparkAISummit
Agenda • Overview • Tangram Architecture • Scheduling Policies & Resource Allocation • Future work 4#UnifiedAnalytics #SparkAISummit
What is Tangram? The scheduling platform for • reliably running various batch workloads • with efficient heterogenous resource management • at scale 5#UnifiedAnalytics #SparkAISummit
Tangram Scheduling Targets • Single jobs: adhoc/periodic • Batch jobs: adhoc/periodic, malleable • Gang jobs: adhoc/periodic, rigid • Long-running jobs: steady and regular; e.g. online training 6#UnifiedAnalytics #SparkAISummit
Why Tangram? • Various workload characteristics – ML – Apache Spark – Apache Giraph – Single jobs • Customized scheduling policies • Scalability – Fleet size: hundreds of thousands worker nodes – Job scheduling throughput: hundreds of millions jobs per day 7#UnifiedAnalytics #SparkAISummit
Overview • What is Tangram? 8#UnifiedAnalytics #SparkAISummit Admin Job Manager DB ML Resource Manager Master Agent AgentAgent Single Job Gang Job ML Elastic Scheduler 1 2 3 4 5 6 SQL query Giraph Spark
Client Library 9#UnifiedAnalytics #SparkAISummit • Job management • Request/Release resources • Resource grant • Preemption notification • Launch containers • Container status change event Tangram client Resource Manager Agent Application 1 2 3 4 5 6
Agent • Report schedulable resources and runtime usage • Health check reports • Detect labels • Launch/Kill Containers • Container recovery • Resource isolation with cgroup v2 10#UnifiedAnalytics #SparkAISummit
Failure Recovery • Agent failure – Scan the recovery directory and recover the running containers • RM failure – Both agent and client hold off communication to the RM until the new master shows up – Client sync session info to the new master to help it build the states – Agents add them to the new master 11#UnifiedAnalytics #SparkAISummit
Scheduling Policies • Hierarchical queue structure • Jobs to be queued on leaves • Queue configs: – min/max resources – Policy: • FIFO • Dominant Resource Fairness (DRF) • User fairness • Global • … 12#UnifiedAnalytics #SparkAISummit / ads feed pipelines interactive Job DRF DRF DRF User FairnessFIFO 20%80% 50% 50% user1 user2 50% 50% FIFO FIFO Job Job Job
Scheduling Policies • Jobs ordered by priority, submission time within queue • Gang job as first class in scheduling and resource allocation • Lookahead scheduling for better throughput and utilization • Job starvation prevention 13#UnifiedAnalytics #SparkAISummit Gang 200 Gang 20 Single Gang 4 Single
Resource Allocation • Fine-grained resource specification: – {cpuMilliCores: 3000, memoryBytes: 200GB} • Constraints: – “dataCenter = dc1 & type in [1,2] & kernelVersion > 4.10” • Job Affinity: – inSameDatacenter 14#UnifiedAnalytics #SparkAISummit
Resource Allocation 15#UnifiedAnalytics #SparkAISummit Prefetched Host Cache • Bypass the steps of host filtering and scoring • Speedup allocation process Host Filtering • Hard & Soft constraints • Resource constraint • Label constraint • Job affinity Host Scoring and Ordering • Packing efficiency • Host healthiness • Data locality Commit Allocation • Book keeping resources • Update cluster & queue parameters
Constraint-based Scheduling • Machine type • Datacenter • Region • CPU architecture • Host prefix • … 16#UnifiedAnalytics #SparkAISummit Merged host pool - type 1 & 2 Job Job Job Host 1 Host 2 Host 3 Host 4 Host 5 Labeled with {”type”:”2”} Labeled with {”type”:”1”} Job Job Job constraint: type=2 Job constraint: type=1 Queue
Preemption • Guarantee resource availability SLO within and across queues • Identify the starving jobs and overallocated jobs • Minimize preemption cost: two-phase protocol – Only candidates appearing in both phases will be preempted – Resource Manager notifies client with preemption intent s.t. necessary action can be taken, e.g. checkpointing 17#UnifiedAnalytics #SparkAISummit
Cross Datacenter Scheduling • The growing demand of computation and storage for Hive tables spans across data centers • Stranded capacity with imbalanced load • Poor data locality and waste of network bandwidth • Slow reaction to recover from crisis and disaster 18#UnifiedAnalytics #SparkAISummit
Cross Datacenter Scheduling • Dispatcher Proxy – Monitors resource consumption across data centers – Decides the Resource Manager for scheduling jobs – Provides location hints to the Resource Manager for enforcement • Planner – Decides where the data will be replaced based on utilization and available resources 19#UnifiedAnalytics #SparkAISummit Datacenter 1 Datacenter 2 Datacenter 3 Resource Manager 1 Resource Manager 2 Dispatcher Job
Cross Datacenter Scheduling • Dispatcher Proxy – Monitors resource consumption across data centers – Decides the Resource Manager for scheduling jobs – Provides location hints to the Resource Manager for enforcement • Planner – Decides where the data will be replaced based on utilization and available resources 20#UnifiedAnalytics #SparkAISummit Datacenter 1 Datacenter 2 Datacenter 3 Resource Manager 1 Resource Manager 2 Dispatcher Job Job constraint: datacenter=1
Cross Datacenter Scheduling • Dispatcher Proxy – Monitors resource consumption across data centers – Decides the Resource Manager for scheduling jobs – Provides location hints to the Resource Manager for enforcement • Planner – Decides where the data will be replaced based on utilization and available resources 21#UnifiedAnalytics #SparkAISummit Datacenter 1 Datacenter 2 Datacenter 3 Resource Manager 1 Resource Manager 2 Dispatcher Job Job constraint: datacenter=1 Table DataTable Data
Future Work • Mix workloads managed by one resource manager • Run batch workloads with off-peak resources from online services • Automatic resource tuning for high utilization • We’re hiring! Contact: rjian@fb.com 22#UnifiedAnalytics #SparkAISummit
DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

Tangram: Distributed Scheduling Framework for Apache Spark at Facebook

  • 1.
    WIFI SSID:SparkAISummit |Password: UnifiedAnalytics
  • 2.
    Rui Jian, HaoLin, Facebook Inc. rjian@fb.com, hlin@fb.com Tangram: Distributed Scheduling Framework for Apache Spark at Facebook #UnifiedAnalytics #SparkAISummit
  • 3.
    About Us • RuiJian – Software Engineer at Facebook (Data Warehouse & Graph Indexing) – Master of Computer Science (Shanghai Jiao Tong university) • Hao Lin – Research scientist at Facebook (Data Warehouse Batch Scheduling) – PhD in Parallel Computing (Purdue ECE) 3#UnifiedAnalytics #SparkAISummit
  • 4.
    Agenda • Overview • TangramArchitecture • Scheduling Policies & Resource Allocation • Future work 4#UnifiedAnalytics #SparkAISummit
  • 5.
    What is Tangram? Thescheduling platform for • reliably running various batch workloads • with efficient heterogenous resource management • at scale 5#UnifiedAnalytics #SparkAISummit
  • 6.
    Tangram Scheduling Targets •Single jobs: adhoc/periodic • Batch jobs: adhoc/periodic, malleable • Gang jobs: adhoc/periodic, rigid • Long-running jobs: steady and regular; e.g. online training 6#UnifiedAnalytics #SparkAISummit
  • 7.
    Why Tangram? • Variousworkload characteristics – ML – Apache Spark – Apache Giraph – Single jobs • Customized scheduling policies • Scalability – Fleet size: hundreds of thousands worker nodes – Job scheduling throughput: hundreds of millions jobs per day 7#UnifiedAnalytics #SparkAISummit
  • 8.
    Overview • What isTangram? 8#UnifiedAnalytics #SparkAISummit Admin Job Manager DB ML Resource Manager Master Agent AgentAgent Single Job Gang Job ML Elastic Scheduler 1 2 3 4 5 6 SQL query Giraph Spark
  • 9.
    Client Library 9#UnifiedAnalytics #SparkAISummit •Job management • Request/Release resources • Resource grant • Preemption notification • Launch containers • Container status change event Tangram client Resource Manager Agent Application 1 2 3 4 5 6
  • 10.
    Agent • Report schedulableresources and runtime usage • Health check reports • Detect labels • Launch/Kill Containers • Container recovery • Resource isolation with cgroup v2 10#UnifiedAnalytics #SparkAISummit
  • 11.
    Failure Recovery • Agentfailure – Scan the recovery directory and recover the running containers • RM failure – Both agent and client hold off communication to the RM until the new master shows up – Client sync session info to the new master to help it build the states – Agents add them to the new master 11#UnifiedAnalytics #SparkAISummit
  • 12.
    Scheduling Policies • Hierarchicalqueue structure • Jobs to be queued on leaves • Queue configs: – min/max resources – Policy: • FIFO • Dominant Resource Fairness (DRF) • User fairness • Global • … 12#UnifiedAnalytics #SparkAISummit / ads feed pipelines interactive Job DRF DRF DRF User FairnessFIFO 20%80% 50% 50% user1 user2 50% 50% FIFO FIFO Job Job Job
  • 13.
    Scheduling Policies • Jobsordered by priority, submission time within queue • Gang job as first class in scheduling and resource allocation • Lookahead scheduling for better throughput and utilization • Job starvation prevention 13#UnifiedAnalytics #SparkAISummit Gang 200 Gang 20 Single Gang 4 Single
  • 14.
    Resource Allocation • Fine-grainedresource specification: – {cpuMilliCores: 3000, memoryBytes: 200GB} • Constraints: – “dataCenter = dc1 & type in [1,2] & kernelVersion > 4.10” • Job Affinity: – inSameDatacenter 14#UnifiedAnalytics #SparkAISummit
  • 15.
    Resource Allocation 15#UnifiedAnalytics #SparkAISummit Prefetched HostCache • Bypass the steps of host filtering and scoring • Speedup allocation process Host Filtering • Hard & Soft constraints • Resource constraint • Label constraint • Job affinity Host Scoring and Ordering • Packing efficiency • Host healthiness • Data locality Commit Allocation • Book keeping resources • Update cluster & queue parameters
  • 16.
    Constraint-based Scheduling • Machinetype • Datacenter • Region • CPU architecture • Host prefix • … 16#UnifiedAnalytics #SparkAISummit Merged host pool - type 1 & 2 Job Job Job Host 1 Host 2 Host 3 Host 4 Host 5 Labeled with {”type”:”2”} Labeled with {”type”:”1”} Job Job Job constraint: type=2 Job constraint: type=1 Queue
  • 17.
    Preemption • Guarantee resourceavailability SLO within and across queues • Identify the starving jobs and overallocated jobs • Minimize preemption cost: two-phase protocol – Only candidates appearing in both phases will be preempted – Resource Manager notifies client with preemption intent s.t. necessary action can be taken, e.g. checkpointing 17#UnifiedAnalytics #SparkAISummit
  • 18.
    Cross Datacenter Scheduling •The growing demand of computation and storage for Hive tables spans across data centers • Stranded capacity with imbalanced load • Poor data locality and waste of network bandwidth • Slow reaction to recover from crisis and disaster 18#UnifiedAnalytics #SparkAISummit
  • 19.
    Cross Datacenter Scheduling •Dispatcher Proxy – Monitors resource consumption across data centers – Decides the Resource Manager for scheduling jobs – Provides location hints to the Resource Manager for enforcement • Planner – Decides where the data will be replaced based on utilization and available resources 19#UnifiedAnalytics #SparkAISummit Datacenter 1 Datacenter 2 Datacenter 3 Resource Manager 1 Resource Manager 2 Dispatcher Job
  • 20.
    Cross Datacenter Scheduling •Dispatcher Proxy – Monitors resource consumption across data centers – Decides the Resource Manager for scheduling jobs – Provides location hints to the Resource Manager for enforcement • Planner – Decides where the data will be replaced based on utilization and available resources 20#UnifiedAnalytics #SparkAISummit Datacenter 1 Datacenter 2 Datacenter 3 Resource Manager 1 Resource Manager 2 Dispatcher Job Job constraint: datacenter=1
  • 21.
    Cross Datacenter Scheduling •Dispatcher Proxy – Monitors resource consumption across data centers – Decides the Resource Manager for scheduling jobs – Provides location hints to the Resource Manager for enforcement • Planner – Decides where the data will be replaced based on utilization and available resources 21#UnifiedAnalytics #SparkAISummit Datacenter 1 Datacenter 2 Datacenter 3 Resource Manager 1 Resource Manager 2 Dispatcher Job Job constraint: datacenter=1 Table DataTable Data
  • 22.
    Future Work • Mixworkloads managed by one resource manager • Run batch workloads with off-peak resources from online services • Automatic resource tuning for high utilization • We’re hiring! Contact: rjian@fb.com 22#UnifiedAnalytics #SparkAISummit
  • 23.
    DON’T FORGET TORATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT