Timeline for Using Fargate (serverless compute) to replace leadership election

Current License: CC BY-SA 4.0

14 events

when toggle format	what		by	license	comment
Jan 26, 2024 at 18:01	comment	added	JimmyJames		@simonalexander2005 To elaborate, if we assume 24/7 operation and no scheduled downtime, that SLA means the system can be down around 20 minutes during an entire year. I would expect that your SLA would be evaluated over a shorter period e.g., a week. That means less than a minute of downtime. The question is: how fast can Fargate detect a failure and start a new instance of this system?
Jan 26, 2024 at 15:57	comment	added	JimmyJames		@simonalexander2005 That would seem to be the challenge with this approach i.e., it presupposes some downtime in order to work, unless I am missing something. I think you will be hard-pressed to guarantee that SLA with this approach.
Jan 26, 2024 at 8:48	comment	added	simonalexander2005		@JimmyJames for the PoC at the moment, not much - but eventually 99.996%
Jan 26, 2024 at 3:08	answer	added	RibaldEddie		timeline score: 2
Jan 25, 2024 at 19:29	comment	added	JimmyJames		What are your uptime requirements?
Jan 25, 2024 at 18:41	answer	added	Jon Raynor		timeline score: 3
Jan 25, 2024 at 17:07	comment	added	simonalexander2005		@JonRaynor leadership election means that only one instance is active at a time - the other instances are running as a standby, so that if the leader instance stops for some reason another instance can pick it up. Therefore effectively only one instance is doing the work
Jan 25, 2024 at 16:22	comment	added	Jon Raynor		Can you explain why you mention not running multiple instances in the first paragraph, but you are currently running multiple instances with the leadership election in the 3rd paragraph. This seems to be a contradiction.
Jan 25, 2024 at 14:09	comment	added	Philip Kendall		@simonalexander2005 because distributed computing is hard, and particularly hard over an unreliable network. And all networks are unreliable. AWS can hide a lot of the complexity, but underneath it all you can't magically solve the Two Generals Problem.
Jan 25, 2024 at 14:07	comment	added	PMah		During deployments, depending on configuration, Fargate might start up the new task and wait for it to be healthy before stopping the old one. You can configure it to stop the old one first though, if that's what's needed. However, you'd also need to make sure your container health checks are rock solid; I've seen (admittedly rare) situations where a task lost network connectivity due to some issue with the underlying host, but because the health check didn't check for that, the task was still considered healthy, and it wasn't replaced.
Jan 25, 2024 at 13:47	review	Close votes
Jan 30, 2024 at 3:08
Jan 25, 2024 at 13:45	comment	added	simonalexander2005		Well, yeah, if they give you the option to choose the max number of running containers to be 1, why wouldn't you assume that?
Jan 25, 2024 at 13:26	comment	added	Philip Kendall		Do you regard "blindly trusting that Fargate will never actually run 2 copies of a container when the instance count is set to 1" as "missing something"? For avoidance of doubt, I have no particularly knowledge on what Fargate does here but it's possible it's been engineered to behave differently under some circumstances than you seem to be assuming.
Jan 25, 2024 at 13:05	history	asked	simonalexander2005	CC BY-SA 4.0