Uncategorized
Preparation Guide for AWS Certified Solutions Engineer – Associate (Released February 2018, Republished February 2018).April 21, 2023
Create an imaginative backyard garden with unique decorations from BackyardGardenershop. Find statues, figurines, fountains & water gardens and more.
British Airways grounded all its flights in May 2017 at London’s busiest airports. An IT failure caused 75,000 passengers to be stranded on British airlines that day. According to airline investigations, the failure was due to poor resilience and inadequate disaster recovery following a power surge at a UK data center.
British Airways CEO recently spoke out about how one IT failure cost them 80 million Pounds.
Failures will happen in all businesses.
“Many unknown factors cannot be anticipated in the regular test scenario for application failure. This raises the ultimate question: Is regular testing enough? What if your system is accidentally rebooted? These types of problems are not easily solved by regular testing.
Suratip Banerjee (Solutions Architect at Principal Global Services) at a webinar on Chaos Engineering and AWS Fail Injection Simulator.
These failures can cause companies to experience costly outages. Customers are unable to shop, transact business and get work done due to outages. Even short outages can have a negative impact on a company’s bottom lines, so many engineering teams are now focusing on the cost of downtime.
98% of organizations stated that one hour of downtime would cost them almost a million dollars in 2017. This is a very high risk.
Companies need to find a solution to this problem. Waiting for the next outage is no way to face the challenges head-on.
Hence, you need Chaos Engineering.
Whizlabs hosted a webinar about Chaos Engineering and AWS FIS on 12 September 2021.
Suratip Banerjee was the featured speaker and explained in detail all aspects of Chaos Engineering, AWS FIS and their benefits.
What is Chaos Engineering?
Chaos engineering refers to the stressing of an application in a production or test environment.
It involves creating disruptive events like server outage, API throttleling or latency. The system’s response to these events is then observed. Finally, we implement our improvements and prove or disprove our system’s ability to handle these disruptive elements.
These exercises have the added benefit that they help teams build muscle memory when it comes to resolving outages. It’s akin a fire drill. We uncover unknown issues that could affect our systems or customers by intentionally breaking things.
Instead of making these events happen at 3am or on weekends, they are created in a controlled environment during work hours when all our engineers and teams are available to address the issue.
Chaos Engineering has many benefits
Customer: Service outages are minimal because of the increased availability and durability.
Business: Chaos Engineering can prevent revenue losses and maintenance costs, make engineers happier and more engaged, improve on-call training, and improve the SEV Management Program for the whole company.
Technical: The results of chaos experiments can lead to a reduction or elimination in incidents, on-call burden, better understanding of system failure modes and system design, faster detection of SEVs, and fewer repeated SEVs. Monitors alarm blind spots and observability. This improves recovery time, operation scales, as well as other aspects.
47% of businesses experienced increased availability and 45% experienced a decrease in Mean To Time Ratio (MTR) after using chaos engineering!
Principles of Chaos Engineering
Understanding Chaos Engineering principles is key to Chaos Engineering. Then, plan a well-planned process that follows these principles.
Steady State
It refers to the system’s performance in a normal state.
In the beginning, you need to look for measurable results that connect operating metrics and customer experience. The output should be stable and predictable. However, it should vary greatly if there is failure.
Hypothesis
“What if this Load Balancer fails?”
“What if this databank stops?”
“What if latency rises by 300ms?”
After brainstorming, sit down with your team to pick the scenario that is most likely to occur or should be prioritized. This hypothesis should not be complicated. It should be placed on the part of your system you believe to be resilient.
Design the Experiment
These are the best ways to start with the experiment phase
Start small
It should be as close to the production as possible
Reduce the blast radius
Make an emergency stop!
Verify and learn
This stage is where you analyze the results of the experiment. These points can be used to evaluate your report.
Time to detect
Time for notification and escalation
Public notification within the time limit
Time for graceful decline to begin
Time for self-healing
Time for recovery – Partially and fully
It’s time to make it all clear.
0