From the course: Site Reliability Engineering Essential Training

Course summary

From the course: Site Reliability Engineering Essential Training

Start my 1-month free trial Buy for my team

Course summary

“

Welcome to the course summary. You have learned a lot, let's summarize the key points in this lesson. First of all, congratulations. Well done. This was not an easy class, I know, but you went through it and you have come this far. Congratulations. Well done. Now let's cover the basics. I'm going to summarize all the key points in this lesson. First, the basics. SRE leverages software engineering to improve and streamline production operations. The focus is on observability, incident management, release management, and reliable architecture. SRE aims to reduce toil by minimizing repetitive, manual tasks. SRE is essentially practical implementation of DevOps with strong emphasis on reliability. While platform engineering prioritizes developer productivity and experience, SRE centers on application availability and performance. These are the basics of Site Reliability Engineering. Observability. As you know, observability is one of the foundational capabilities of SRE. It involves generating, collecting, and centralizing telemetry data. Key telemetry signals include logs, metrics, traces, and synthetics. SREs should focus on telemetry signals that correspond to end-user experience. The four golden signals are latency, error rate, throughput, and saturation. If you can monitor only four signals, these should be the four signals that you should be monitoring. Finally, telemetry data must be retained for reasonable duration. Three months recommended for logs and metrics, and traces can be a week or two. Next: SLIs, SLOs, and SLAs. SLI, Service Level Indicator, it's a measurement of a specific system performance, a specific aspect of a system performance. Example, error rate. SLO, Service Level Objective, a target for an SLI, defining the desired reliability of a system. SLA, Service Level Agreement, a formal contract with stakeholders outlining agreed-upon service reliability and penalties for unmet targets. Now SLI, SLO, and SLAs are not widely practiced yet in my experience. If you're serious about SRE, I highly encourage you to at least pick one service and set up at least one SLO to start with and start monitoring them. One important point about SLI, SLO, and SLA, you have to combine these with other business metrics such as CSAT, Customer Satisfaction Score, or Net Promoter Score, NPS, for a holistic view of service performance. Next, incident management. SRE should foster a safe on-call environment, encourage open communication, and prevent burnout. Significant incidents should be reviewed through blameless postmortems. We looked at several blameless postmortem templates, please make use of them. You must monitor patterns among incidents and develop engineering solutions to address recurring issues. It's important to do that in incident management. Change management. Implement automated CI/CD pipelines for consistent and reliable deployments. There should be no manual deployments. Use canary releases and progressive rollouts to minimize risk during updates. Finally, ensure fast and safe rollback procedures to quickly recover from issues. Change management is critical. As I alluded to earlier, majority of incidents, outages happen because of changes. By following these SRE best practices, you can control them. Reliable architecture. Implement effective load balancing to ensure high availability. Load balancing is key in SRE. If there is one technology that you need to master, that's got to be load balancing in architecture when it comes to SRE. Configure auto-scaling to handle traffic surges. Apply graceful degradation and circuit breaker techniques to maintain stability during issues. And finally, use advanced load-balancing algorithms such as slow-start to optimize performance. Remember, both fault-tolerant infrastructure architecture and fault-tolerant application architecture are critical for reliable architecture. That I believe summarizes what we have learned in a concise manner. Again, SRE is a very big field, but these points will remind you what is important and what you need to focus on. In the next lesson, let's look at next steps. What can you do in your SRE journey to grow even higher?

Course summary

From the course: Site Reliability Engineering Essential Training

Course summary

Download courses and learn on the go

Contents

Start learning today.

Explore Business Topics

Explore Creative Topics

Explore Technology Topics