Container-based Microservices DevOps in AWS How Perfecto Did it and What We Learned So Far © 2018, Perfecto Mobile Ltd. All Rights Reserved.
About Perfecto 1/10/2018 2© 2018, Perfecto Mobile Ltd. All Rights Reserved.
How We Started We started 11 years ago with developing monolith servers in our own DCs We were moving slowly… 1/10/2018 3© 2018, Perfecto Mobile Ltd. All Rights Reserved.
But Then we Heard Some Buzzwords 1/10/2018 4© 2018, Perfecto Mobile Ltd. All Rights Reserved. And we decided we want to move faster and do more impact on the company business
Big Change 1/10/2018 5© 2018, Perfecto Mobile Ltd. All Rights Reserved. Waterfall Monolith servers Deployment in DC Dependencies Agile Microservices Cloud Autonomous teams
The 3 Components of the Change technology methodologyculture 1/10/2018 6© 2018, Perfecto Mobile Ltd. All Rights Reserved.
Autonomous Teams DevOps Dev QA 1/10/2018 7© 2018, Perfecto Mobile Ltd. All Rights Reserved. Development Continuous integration Continuous deployment Monitoring Budget control
Technologies we Use (partial list…) 1/10/2018 8© 2018, Perfecto Mobile Ltd. All Rights Reserved.
Why we Chose ECS for Container Orchestration • We were new to the containers world, but we understood container orchestration is a key decision • We looked at ECS, Kubernetes, Swarm and other alternatives. • ECS seemed best in terms on time to value 1/10/2018 9© 2018, Perfecto Mobile Ltd. All Rights Reserved.
Our First Microservice 1/10/2018 10© 2018, Perfecto Mobile Ltd. All Rights Reserved. • Deployed in ECS • ELB + ECS tasks = ECS service • EC2 instances are managed in an Auto Scaling Group • Service Discovery using Route53 • Task per EC2 instance (ELB static port limitation)
Decisions we Took (1) 1/10/2018 11© 2018, Perfecto Mobile Ltd. All Rights Reserved. • Deploying in a single availability zone • One of those decisions you regret - overhead of changing it grows with time
Decisions We took (2) • Single VPC for all teams • Seems natural – it’s network, right? • Pros • Less work for teams • Simpler to move services between teams • Cons • Dependency between teams. Who owns the VPC? • Simpler to take shortcuts (e.g. use VPN to DC) • Budget control is more difficult - no option for account per team (need to tag all services) 1/10/2018 12© 2018, Perfecto Mobile Ltd. All Rights Reserved.
Decisions we took (3) • ECS cluster per… what? • Options • One cluster to rule them all • Cluster per service (group of microservices) • Cluster per team • We let our teams decide between the two last options • No dependencies between teams • Better budget control • Reduce blast radius of ECS cluster issues (more on that soon) 1/10/2018 13© 2018, Perfecto Mobile Ltd. All Rights Reserved.
Infrastructure as code is the only way to go • We (try to) do everything (except for very small and initial POCs) with CloudFormation • Every time you do a change in UI, CLI or API without CloudFormation – think again • CloudFormation templates stored in Git • CloudFormation invoked by Jenkins • We maintain shared CloudFormation templates used by all teams to create ECS clusters, services and more. 1/10/2018 14© 2018, Perfecto Mobile Ltd. All Rights Reserved.
Working with CloudFormation • There is a learning curve • Templates can become long and unreadable • Split to sub-templates • Consider generating templates • CloudFormation behavior can be surprising, but it is consistent • Practice in product-like environments (dev/staging) • Using the UI is dangerous • Automate all CloudFormation invocations • Read-only access to UI • Protect your stacks using stack policies 1/10/2018 15© 2018, Perfecto Mobile Ltd. All Rights Reserved.
Moving to ALBs • ALB – Application Load Balancer, can (should) replace ELB • Why • Cost - 1 ALB can replace X ELBs - Less expensive for clusters with large number of services • Dynamic port management – Allows deploying multiple services on one EC2 instance, more flexibility • ELB is (kind of) legacy – e.g. not supported in Fargate • Routes requests to backend containers based on request path rules • Challenge with ALBs – no URL rewrite in rules. If if you have no control on the request path in the deployed services, you will need a reverse proxy. 1/10/2018 16© 2018, Perfecto Mobile Ltd. All Rights Reserved.
ELB vs ALB 1/10/2018 17© 2018, Perfecto Mobile Ltd. All Rights Reserved.
What about logs? • We’re using CloudWatch logs • Note perfect, but very simple to integrate with anything in AWS • Container logs • Standard container logs can be sent to CloudWatch – that is easy, supported natively in Docker • To take application log files to CloudWatch – we’re using a ”satellite container” (AKA sidecar) per task - https://github.com/moshebs/docker-awslogs • ELB/ALB access logs: • Sent to S3, natively supported by ELB/ALB • CloudWatch event from S3  Lambda that parses the logs and pushes them to CloudWatch logs 1/10/2018 18© 2018, Perfecto Mobile Ltd. All Rights Reserved.
Logs 1/10/2018 19© 2018, Perfecto Mobile Ltd. All Rights Reserved.
Monitoring with Prometheus 1/10/2018 20© 2018, Perfecto Mobile Ltd. All Rights Reserved.
Dashboards with Grafana 1/10/2018 21© 2018, Perfecto Mobile Ltd. All Rights Reserved.
Monitoring in Perfecto • Each team owns their own monitoring system • Deployment • Maintenance • Building dashboards • Getting alerts, usually in Slack • Deployed using CloudFormation • All teams use the same templates • Coniguration using sidecar containers 1/10/2018 22© 2018, Perfecto Mobile Ltd. All Rights Reserved.
What we monitor • EC2 instance metrics – by deploying Prometheus node_exporter on the EC2 instances • Application metrics • If metrics are shared between the microservice nodes – scrape through LB • Otherwise – scrape each microservice tasks (how do you find them? Next slide…) • 3rd party (Mongo, RabbitMQ, Redis, e.g.) – standard open- source exporters • CloudWatch metrics – using cloudwatch_exporter (but be careful, pulling metrics from CloudWatch is expensive!) 1/10/2018 23© 2018, Perfecto Mobile Ltd. All Rights Reserved.
Monitoring Architecture 1/10/2018 24© 2018, Perfecto Mobile Ltd. All Rights Reserved.
Scraping ECS Tasks • The challenge: • Prometheus needs to know where each task runs, and what port to use for scraping • But Prometheus supports filtering EC2 instances by tags only • ECS decide which task goes where • The solution – a container that dynamically tags EC2 instances according to ECS tasks running on them 1/10/2018 25© 2018, Perfecto Mobile Ltd. All Rights Reserved.
Scraping ECS Tasks 1/10/2018 26© 2018, Perfecto Mobile Ltd. All Rights Reserved.
ECS Biggest Challenge • The integration between ECS and Auto Scaling Group is not perfect • ASG changes ignore ECS tasks • Let’s look at 2 examples 1/10/2018 27© 2018, Perfecto Mobile Ltd. All Rights Reserved.
Auto Scaling Group Downscale 1/10/2018 28© 2018, Perfecto Mobile Ltd. All Rights Reserved. VM1 VM2 VM3 VM4 VM5 VM6 VM7
Upgrade of ECS-Optimized AMI 1/10/2018 29© 2018, Perfecto Mobile Ltd. All Rights Reserved. VM1 VM2 VM3 VM4VM5 Auto Scaling Group
Simple Workaround • You can control when the EC2 instance sends the “I’m ready” signal to CloudFormation (in fact you must send it in the userdata) • Add a sleep, to allow the ECS task to start • Helps with the AMI upgrade scenario only • Upgrades are slower, but a bit safer • In practice – this really helped us 1/10/2018 30© 2018, Perfecto Mobile Ltd. All Rights Reserved.
Better Solution • Auto Scaling Group has life cycle hooks • We can add a hook to prevent VM shutdown until the task in the new VM is ready. 1/10/2018 31© 2018, Perfecto Mobile Ltd. All Rights Reserved. VM1 VM2 VM3 VM4VM5 Auto Scaling Group Shutdown Hook SNS ECS Deregister VM2 Wait for task Complete lifecycle action
But the truth is… • We don’t want to manage VMs at all • We just want to deploy containers over CPU and memory • Enter Fargate – serverless containers • We plan to try it soon, but we’re still missing • Storage attachment • Availability outside of us-east-1 1/10/2018 32© 2018, Perfecto Mobile Ltd. All Rights Reserved.
© 2018, Perfecto Mobile Ltd. All Rights Reserved. moshe_benshoham mosheb@perfectomobile.com Thank You!

Container-based Microservices DevOps in AWS

  • 1.
    Container-based Microservices DevOpsin AWS How Perfecto Did it and What We Learned So Far © 2018, Perfecto Mobile Ltd. All Rights Reserved.
  • 2.
    About Perfecto 1/10/2018 2©2018, Perfecto Mobile Ltd. All Rights Reserved.
  • 3.
    How We Started Westarted 11 years ago with developing monolith servers in our own DCs We were moving slowly… 1/10/2018 3© 2018, Perfecto Mobile Ltd. All Rights Reserved.
  • 4.
    But Then weHeard Some Buzzwords 1/10/2018 4© 2018, Perfecto Mobile Ltd. All Rights Reserved. And we decided we want to move faster and do more impact on the company business
  • 5.
    Big Change 1/10/2018 5©2018, Perfecto Mobile Ltd. All Rights Reserved. Waterfall Monolith servers Deployment in DC Dependencies Agile Microservices Cloud Autonomous teams
  • 6.
    The 3 Componentsof the Change technology methodologyculture 1/10/2018 6© 2018, Perfecto Mobile Ltd. All Rights Reserved.
  • 7.
    Autonomous Teams DevOps Dev QA 1/10/2018 7©2018, Perfecto Mobile Ltd. All Rights Reserved. Development Continuous integration Continuous deployment Monitoring Budget control
  • 8.
    Technologies we Use(partial list…) 1/10/2018 8© 2018, Perfecto Mobile Ltd. All Rights Reserved.
  • 9.
    Why we ChoseECS for Container Orchestration • We were new to the containers world, but we understood container orchestration is a key decision • We looked at ECS, Kubernetes, Swarm and other alternatives. • ECS seemed best in terms on time to value 1/10/2018 9© 2018, Perfecto Mobile Ltd. All Rights Reserved.
  • 10.
    Our First Microservice 1/10/201810© 2018, Perfecto Mobile Ltd. All Rights Reserved. • Deployed in ECS • ELB + ECS tasks = ECS service • EC2 instances are managed in an Auto Scaling Group • Service Discovery using Route53 • Task per EC2 instance (ELB static port limitation)
  • 11.
    Decisions we Took(1) 1/10/2018 11© 2018, Perfecto Mobile Ltd. All Rights Reserved. • Deploying in a single availability zone • One of those decisions you regret - overhead of changing it grows with time
  • 12.
    Decisions We took(2) • Single VPC for all teams • Seems natural – it’s network, right? • Pros • Less work for teams • Simpler to move services between teams • Cons • Dependency between teams. Who owns the VPC? • Simpler to take shortcuts (e.g. use VPN to DC) • Budget control is more difficult - no option for account per team (need to tag all services) 1/10/2018 12© 2018, Perfecto Mobile Ltd. All Rights Reserved.
  • 13.
    Decisions we took(3) • ECS cluster per… what? • Options • One cluster to rule them all • Cluster per service (group of microservices) • Cluster per team • We let our teams decide between the two last options • No dependencies between teams • Better budget control • Reduce blast radius of ECS cluster issues (more on that soon) 1/10/2018 13© 2018, Perfecto Mobile Ltd. All Rights Reserved.
  • 14.
    Infrastructure as codeis the only way to go • We (try to) do everything (except for very small and initial POCs) with CloudFormation • Every time you do a change in UI, CLI or API without CloudFormation – think again • CloudFormation templates stored in Git • CloudFormation invoked by Jenkins • We maintain shared CloudFormation templates used by all teams to create ECS clusters, services and more. 1/10/2018 14© 2018, Perfecto Mobile Ltd. All Rights Reserved.
  • 15.
    Working with CloudFormation •There is a learning curve • Templates can become long and unreadable • Split to sub-templates • Consider generating templates • CloudFormation behavior can be surprising, but it is consistent • Practice in product-like environments (dev/staging) • Using the UI is dangerous • Automate all CloudFormation invocations • Read-only access to UI • Protect your stacks using stack policies 1/10/2018 15© 2018, Perfecto Mobile Ltd. All Rights Reserved.
  • 16.
    Moving to ALBs •ALB – Application Load Balancer, can (should) replace ELB • Why • Cost - 1 ALB can replace X ELBs - Less expensive for clusters with large number of services • Dynamic port management – Allows deploying multiple services on one EC2 instance, more flexibility • ELB is (kind of) legacy – e.g. not supported in Fargate • Routes requests to backend containers based on request path rules • Challenge with ALBs – no URL rewrite in rules. If if you have no control on the request path in the deployed services, you will need a reverse proxy. 1/10/2018 16© 2018, Perfecto Mobile Ltd. All Rights Reserved.
  • 17.
    ELB vs ALB 1/10/201817© 2018, Perfecto Mobile Ltd. All Rights Reserved.
  • 18.
    What about logs? •We’re using CloudWatch logs • Note perfect, but very simple to integrate with anything in AWS • Container logs • Standard container logs can be sent to CloudWatch – that is easy, supported natively in Docker • To take application log files to CloudWatch – we’re using a ”satellite container” (AKA sidecar) per task - https://github.com/moshebs/docker-awslogs • ELB/ALB access logs: • Sent to S3, natively supported by ELB/ALB • CloudWatch event from S3  Lambda that parses the logs and pushes them to CloudWatch logs 1/10/2018 18© 2018, Perfecto Mobile Ltd. All Rights Reserved.
  • 19.
    Logs 1/10/2018 19© 2018,Perfecto Mobile Ltd. All Rights Reserved.
  • 20.
    Monitoring with Prometheus 1/10/201820© 2018, Perfecto Mobile Ltd. All Rights Reserved.
  • 21.
    Dashboards with Grafana 1/10/201821© 2018, Perfecto Mobile Ltd. All Rights Reserved.
  • 22.
    Monitoring in Perfecto •Each team owns their own monitoring system • Deployment • Maintenance • Building dashboards • Getting alerts, usually in Slack • Deployed using CloudFormation • All teams use the same templates • Coniguration using sidecar containers 1/10/2018 22© 2018, Perfecto Mobile Ltd. All Rights Reserved.
  • 23.
    What we monitor •EC2 instance metrics – by deploying Prometheus node_exporter on the EC2 instances • Application metrics • If metrics are shared between the microservice nodes – scrape through LB • Otherwise – scrape each microservice tasks (how do you find them? Next slide…) • 3rd party (Mongo, RabbitMQ, Redis, e.g.) – standard open- source exporters • CloudWatch metrics – using cloudwatch_exporter (but be careful, pulling metrics from CloudWatch is expensive!) 1/10/2018 23© 2018, Perfecto Mobile Ltd. All Rights Reserved.
  • 24.
    Monitoring Architecture 1/10/2018 24©2018, Perfecto Mobile Ltd. All Rights Reserved.
  • 25.
    Scraping ECS Tasks •The challenge: • Prometheus needs to know where each task runs, and what port to use for scraping • But Prometheus supports filtering EC2 instances by tags only • ECS decide which task goes where • The solution – a container that dynamically tags EC2 instances according to ECS tasks running on them 1/10/2018 25© 2018, Perfecto Mobile Ltd. All Rights Reserved.
  • 26.
    Scraping ECS Tasks 1/10/201826© 2018, Perfecto Mobile Ltd. All Rights Reserved.
  • 27.
    ECS Biggest Challenge •The integration between ECS and Auto Scaling Group is not perfect • ASG changes ignore ECS tasks • Let’s look at 2 examples 1/10/2018 27© 2018, Perfecto Mobile Ltd. All Rights Reserved.
  • 28.
    Auto Scaling GroupDownscale 1/10/2018 28© 2018, Perfecto Mobile Ltd. All Rights Reserved. VM1 VM2 VM3 VM4 VM5 VM6 VM7
  • 29.
    Upgrade of ECS-OptimizedAMI 1/10/2018 29© 2018, Perfecto Mobile Ltd. All Rights Reserved. VM1 VM2 VM3 VM4VM5 Auto Scaling Group
  • 30.
    Simple Workaround • Youcan control when the EC2 instance sends the “I’m ready” signal to CloudFormation (in fact you must send it in the userdata) • Add a sleep, to allow the ECS task to start • Helps with the AMI upgrade scenario only • Upgrades are slower, but a bit safer • In practice – this really helped us 1/10/2018 30© 2018, Perfecto Mobile Ltd. All Rights Reserved.
  • 31.
    Better Solution • AutoScaling Group has life cycle hooks • We can add a hook to prevent VM shutdown until the task in the new VM is ready. 1/10/2018 31© 2018, Perfecto Mobile Ltd. All Rights Reserved. VM1 VM2 VM3 VM4VM5 Auto Scaling Group Shutdown Hook SNS ECS Deregister VM2 Wait for task Complete lifecycle action
  • 32.
    But the truthis… • We don’t want to manage VMs at all • We just want to deploy containers over CPU and memory • Enter Fargate – serverless containers • We plan to try it soon, but we’re still missing • Storage attachment • Availability outside of us-east-1 1/10/2018 32© 2018, Perfecto Mobile Ltd. All Rights Reserved.
  • 33.
    © 2018, PerfectoMobile Ltd. All Rights Reserved. moshe_benshoham mosheb@perfectomobile.com Thank You!

Editor's Notes

  • #12 One VPC for all Here we will show a diagram of a set of microservices Deployed in ECS ECS cluster running on top of ASG (we started in a single AZ – don’t do that!) Using ELB (one task per EC2 instance) Service discovery using DNS in Route53 Network level One VPC for all Each cluster in separate subnet Using security groups to control access between machines Expose service to the Internet – using CheckPoint