Skip to main content

Questions tagged [system-reliability]

2 votes
2 answers
2k views

Have Grafana on premise with prometheus. Some anomalies can be detected by viewing a set of charts (slow requests, retries, pending transactions, etc.). SRE operators need to have the opportunity to ...
Vasin Yuriy's user avatar
5 votes
3 answers
676 views

I am confused with the following terms: Reliability and Fault Tolerance According to Designing Data Intensive Applications book, the definition of Reliability is: The system should continue to work ...
Raghav Goyal's user avatar
2 votes
2 answers
276 views

I am having difficulties on understanding when I should be worried about TOCTOU vulnerabilities and how to avoid them because yes, we can use database transactions but there are different level of ...
Alessandro's user avatar
3 votes
1 answer
541 views

All, We're just started on SRE journey and trying to define SLI / SLO for our application. It is an ETL application where 1. feeds (e.g. start of day, end of day data feeds) comes from various ...
Ravi Parekh's user avatar
1 vote
1 answer
306 views

Source-:https://cs.ccsu.edu/~stan/classes/CS410/Notes16/11-ReliabilityEngineering.html This is self monitoring architecture. So here computations carried across 2 channels, if they both provide same ...
cuajiu's user avatar
  • 19
0 votes
2 answers
419 views

We all know that if we delete a file, the operating system is recycling it but doesn't actually delete it. It just removes it from the directory indexes, and until the data is needed and overwritten, ...
VJZ's user avatar
  • 127
-2 votes
1 answer
195 views

I have been developing an app that will require a cron task every minute. We are handling our cron tasks with Spring Boot Scheduling. Although, I am a little worried about the following question: ...
Juan's user avatar
  • 3
2 votes
0 answers
685 views

Backstory: I am unable to use RDS, as I need to install cartridges in my PostgreSQL instances. I have been trying to pin down an architecture for PostgreSQL running on EC2 instances for a few days. ...
tjwoon's user avatar
  • 29
-1 votes
2 answers
4k views

I hear and see this statement almost in every academic book related to software engineering Testing can detect the presence of errors and not the absence of errors. But I do not get it clear. What ...
Deepam Gupta's user avatar
3 votes
3 answers
283 views

We have a system that allows our clients to coordinate people (shoppers) so that they can delivery groceries within 45 minutes from the order creation. Each client has a set of stores where the ...
Mauricio Rondon's user avatar
-2 votes
2 answers
158 views

I am working on a product that will not be able to be updated once released. Furthermore, if the product malfunctions, the results may include death, serious bodily harm, or major financial setbacks. ...
Demi's user avatar
  • 826
0 votes
0 answers
67 views

I used to work for teams that built software as a service applications. Our requirements, regarding production, were often the same : A complex service (web application, database, daemons, typically) ...
Diane M's user avatar
  • 2,116
0 votes
1 answer
158 views

In our applications we traditionally log events locally into the logging files. As our applications are distributed on multiple server instances, searching for particular events are complicated and ...
Tomazz's user avatar
  • 9
0 votes
3 answers
1k views

I'm learning about system design for the first time and am really intrigued by reliability. Given a setup where you have a master that replicates and writes data through to a slave, how do you persist/...
jonnyd42's user avatar
  • 103
7 votes
3 answers
10k views

We had in our system in the past an external data provider (call it source) sending regular heartbeats to a java application (call it client). If the heartbeat failed, system shut itself down (to ...
senseiwu's user avatar
  • 668

15 30 50 per page