From the course: Site Reliability Engineering Essential Training
Core tenets of SRE
From the course: Site Reliability Engineering Essential Training
Core tenets of SRE
What are the core tenets of SRE? First, operations is a software problem, and now that's a loaded statement, too. We are considering operations as a software problem. Meaning, we want to use software to solve operations issues. Second, observability. Observability is a foundational capability that is required by Site Reliability Engineering. This is one of my favorite topics as well. Incident management. You need to be able to manage the incidents when they occur. Outages are bound to happen, you cannot avoid them. But SRE insists on managing these incidents in a way that you reduce the mean time to repair. Release management. We already established that majority of outages happen because of releases, so SRE insists that releases are managed in a better way. Finally, reliable architecture. This is going to be foundational. You need to make sure the applications are architectured in such a way that they withstand failures. Now let's dive into details of these tenets. First, operations is a software problem. One of the things as an SRE that you should be doing is automating everything that you can. It is important that the manual tasks are removed by thorough automation. Infrastructure as code. With the advent of cloud, we are sort of being forced to use infrastructure as code using Terraform, Ansible, or Chef and tools alike. This is critical in Site Reliability Engineering. You do not want to be going to cloud console and then change this or create that on, and all that kind of manual operations. You should be automating it using infrastructure as code. Finally, software projects should be given preference over repetitive operational tasks. This is one key difference in the way SRE operates than a traditional operations teams operate. You actually need to have software development projects if you are running through SRE. Observability. It is a foundational capability of Site Reliability Engineering, and it also happens to be one of my favorite topics. You need to make sure end-to-end monitoring is deployed for your applications. When I say end-to-end monitoring, all the way from user, end-user experience to the backend database servers. It is not always possible, but you should try your best to do end-to-end monitoring. Logs, metrics, and traces. In the observability world, you'll hear this term a lot. These terms are repeatedly used. These are basic telemetry signals that you should be collecting, transporting, and indexing into a central location for better observability. Blackbox and whitebox monitoring. What these terms mean. Blackbox monitoring is basically monitoring your applications from outside using synthetic transactions, testing probes, and so on. Whitebox monitoring, on the other hand, refers to collecting telemetry signals from the application or system itself. You need both of them for complete observability. SLI-first monitoring. Now I know SLI is a new term I'm introducing here. We will dive into details later in this course. SLI refers to Service Level Indicator, for example, error rate that a user must be seeing in his application from the client side. You need to monitor the SLIs regardless of what else you're trying to monitor. Your observability strategy should be SLI-first. Finally, page actionable alerts. The observability is no good if you do not have a good alerting system integrated into your observability platform. And you should be paging only for actionable alerts. Nobody wants to be, being woken up at 2 a.m. in the morning for something that you cannot do anything about it or something that you don't have to do anything about it. So make sure the alerts that you configure are actually actionable. Next, incident management. It goes without saying, things are bound to fail at some point. When the incidents happen, the SRE approach is to manage the incident in such a way that you reduce the mean time to repair. Big portion of incident management would be your emergency response, how well you are prepared and respond to the emergencies when they occur. Being on-call. Nobody likes to be on-call, including myself. I hated it when I was on-call but that's part of our job. As SRE, you will be on-call. There is no way around it, but there are many ways to make the job easier, and I have a separate section to cover making on-call safer and easier. Finally, blameless postmortems. A critical portion of incident management is doing postmortems when incidents occur. With postmortems, you can learn what went wrong and more importantly, what can you do to prevent this in the future. Release management. You need to balance as an SRE between velocity and reliability. What do I mean by velocity? How often you release, and you need to balance that with the reliability. And why reliability is in this equation? Again, we already established that changes can cause outages. You need to make sure you are at the right velocity. You don't want to be releasing every five minutes at the same time. You don't want to push releases to every quarter. That is also bad. So a big portion of SRE is to make sure there is balance between velocity and reliability. Of course, with a goal of increasing the velocity by maintaining reliability. Canary release. Whenever we talk about release in the SRE world, we absolutely need to talk about canary release. Canary release refers to using a smoke test, if you will, or a test case. First, before you release a software to the public. As part of your release management, make sure you have a good plan to do canary release. Progressive rollout. This is sort of in alignment with the canary release. Progressive rollout refers to releasing in batches. You do not want to release all at once. What happens if there is a bug in your release? You will end up affecting the entire user base of your application. Progressive rollout refers to a technique where you will be releasing in batches, preferably in incremental magnitude. We will learn more about progressive rollouts when we learn about change management. Safe rollback. You have to have a plan to roll back your changes if you need to. I have seen in my experience, there have been many instances where we needed to back out a change, but we just simply did not know how to or we did not have a clear way of safely rolling back. This is a problem. Whenever I oversee a change request, for example, I insist on the requester documenting the clear rollback plan. It is critical, and as a matter of fact, you should be testing your rollback plans as often as you can. Reliable architecture. What do I mean by these? What are the tenets associated with the reliable architecture? First, load balancing. If there is one technology that you need to master in terms of reliable architecture, make sure it is load balancing. Load balancing is across the board when it comes to setting up reliable architectures. Load balancing refers to a technique in which a pool of servers on the backend are equally distributed, the incoming requests, using a load balancer. Autoscaling. Autoscaling refers to a technique in which you can automatically increase or decrease the amount of computing resources you have to serve an application. This is critical especially when the user load is unpredictable. We all have heard about stories where during, day after Thanksgiving or during Christmas breaks, the user volume to websites spiking like crazy and actually bringing down systems and sites. That's because they do not have autoscaling fully configured and implemented. Handling failures, and we will actually dive into details later, but handling failures is going to be critical and your architecture will be able to support several number of failures to recover automatically. Reliability in the cloud. Now the good news with the public cloud is, there is a lot of reliability built in. For example, the hardware is actually maintained by the service providers, whether it's AWS, GCP, or Azure. So at least some portion of reliability is actually the responsibility of the public cloud service providers, and in my experience, it works pretty well. With that said, though, you still need to focus on the architecture of your application. For example, when we talk about handling failures, there is some help from the public cloud service providers, but you need to make sure your application is architected with the technologies like circuit breaker, which we will see in detail later. Otherwise, even if you run your application in Azure, AWS, or GCP, you are still bound to have reliability issues. Disaster recovery. Again, it goes without saying disaster can happen anytime. You need to make sure you have the right, high availability, and if you are in public cloud, you need to make sure you are deployed in at least two different zones or regions, which is better to make sure you can recover from a disaster as quickly as possible. Those are the core tenets of SRE. I hope you are starting to get the foundations of SRE here. In the next section, let's look at the benefits of SRE.