Timeline for Using Fargate (serverless compute) to replace leadership election
Current License: CC BY-SA 4.0
14 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Jan 26, 2024 at 18:01 | comment | added | JimmyJames | @simonalexander2005 To elaborate, if we assume 24/7 operation and no scheduled downtime, that SLA means the system can be down around 20 minutes during an entire year. I would expect that your SLA would be evaluated over a shorter period e.g., a week. That means less than a minute of downtime. The question is: how fast can Fargate detect a failure and start a new instance of this system? | |
| Jan 26, 2024 at 15:57 | comment | added | JimmyJames | @simonalexander2005 That would seem to be the challenge with this approach i.e., it presupposes some downtime in order to work, unless I am missing something. I think you will be hard-pressed to guarantee that SLA with this approach. | |
| Jan 26, 2024 at 8:48 | comment | added | simonalexander2005 | @JimmyJames for the PoC at the moment, not much - but eventually 99.996% | |
| Jan 26, 2024 at 3:08 | answer | added | RibaldEddie | timeline score: 2 | |
| Jan 25, 2024 at 19:29 | comment | added | JimmyJames | What are your uptime requirements? | |
| Jan 25, 2024 at 18:41 | answer | added | Jon Raynor | timeline score: 3 | |
| Jan 25, 2024 at 17:07 | comment | added | simonalexander2005 | @JonRaynor leadership election means that only one instance is active at a time - the other instances are running as a standby, so that if the leader instance stops for some reason another instance can pick it up. Therefore effectively only one instance is doing the work | |
| Jan 25, 2024 at 16:22 | comment | added | Jon Raynor | Can you explain why you mention not running multiple instances in the first paragraph, but you are currently running multiple instances with the leadership election in the 3rd paragraph. This seems to be a contradiction. | |
| Jan 25, 2024 at 14:09 | comment | added | Philip Kendall | @simonalexander2005 because distributed computing is hard, and particularly hard over an unreliable network. And all networks are unreliable. AWS can hide a lot of the complexity, but underneath it all you can't magically solve the Two Generals Problem. | |
| Jan 25, 2024 at 14:07 | comment | added | PMah | During deployments, depending on configuration, Fargate might start up the new task and wait for it to be healthy before stopping the old one. You can configure it to stop the old one first though, if that's what's needed. However, you'd also need to make sure your container health checks are rock solid; I've seen (admittedly rare) situations where a task lost network connectivity due to some issue with the underlying host, but because the health check didn't check for that, the task was still considered healthy, and it wasn't replaced. | |
| Jan 25, 2024 at 13:47 | review | Close votes | |||
| Jan 30, 2024 at 3:08 | |||||
| Jan 25, 2024 at 13:45 | comment | added | simonalexander2005 | Well, yeah, if they give you the option to choose the max number of running containers to be 1, why wouldn't you assume that? | |
| Jan 25, 2024 at 13:26 | comment | added | Philip Kendall | Do you regard "blindly trusting that Fargate will never actually run 2 copies of a container when the instance count is set to 1" as "missing something"? For avoidance of doubt, I have no particularly knowledge on what Fargate does here but it's possible it's been engineered to behave differently under some circumstances than you seem to be assuming. | |
| Jan 25, 2024 at 13:05 | history | asked | simonalexander2005 | CC BY-SA 4.0 |