Julia Furst Morgado, global technologist at Veeam, discusses Kubernetes edge resilience after a ransomware attack. The mentioned challenges include resource limits, network issues, and security risks. A swift recovery underscored the need for specific backup approaches, write-protected storage, and automated, tested recovery for edge environments to limit disruptions.
Key Takeaways
- For Kubernetes at the edge, reliable operation necessitates local resources due to inconsistent network connectivity, making traditional centralized approaches less effective for backup and recovery.
- Automated backup and restore processes are vital for swiftly recovering from disruptions in edge environments, minimizing downtime for end-users, as demonstrated by rapid recovery capabilities after incidents.
- Using immutable backup storage is a critical defense against tampering or encryption during ransomware attacks, helping ensure clean recovery points.
- A robust backup strategy should include multiple storage locations (on-site and off-site) and adhere to established guidelines like the 3-2-1-1-0 rule to bolster data safety and recovery success.
- Implementing Zero-Trust security principles, including limiting user access permissions, is essential for lessening the attack surface and preventing unauthorized access in distributed edge deployments.
Subscribe on:
Transcript
Olimpiu Pop: Hello, everybody. I'm Olimpiu Pop, an InfoQ editor. Today, we have Julia, who spoke about bringing light into chaos at KubeCon. We were curious to hear more about it. So, Julia, please introduce yourself.
Julia Morgado: Yes, sure. Thank you for having me, it's a pleasure. As you said, my name is Julia, and I currently work as a global technologist on the product strategy team at Veeam. And my background is non-traditional; I worked in law. I'm originally from Brazil, worked in law, then business, and then I found technology, and I transitioned into tech. Nowadays, I work a lot on helping explain tech and break down complex concepts, talk more about the business side of tech, and focus a lot on Cloud Native technologies and data protection strategies.
So, I work with customers and partners to design solutions for resiliency, disaster recovery, and Cloud Native infrastructures. I'm also a CNCF ambassador, an AWS container hero, and an ambassador for other programs. I organise events in New York City, so the KCD, Kubernetes Community Day, the AWS Community Day and a monthly meetup that is part of the CNCF. So I'm very involved with the Cloud Native community. And fun fact, I speak four languages, so I speak English, obviously, Portuguese, French and Spanish.
Olimpiu Pop: That's impressive. Well, today we'll keep it to English, even though I'll be happy to exercise the four words I know in Spanish and Portuguese, but let's keep it to English.
Okay. So what was surprising this year at KubeCon was that this was my fourth European KubeCon, and for me, what was unexpected was that it was the first time the discussions were more high-level. So we heard in the keynotes, people talking about how the platform and infrastructure teams can influence the user experience. And I think that would be the optic that I will try to use today in our discussion. You mentioned chaos, and obviously if your infrastructure goes down, that would be chaos among other stuff, and everybody will just run left and right. Let's start by describing again what was the scenario because you mentioned edge and you mentioned shopping experience, if I remember correctly. Because edge is so broad, let's narrow it down and understand the devices on the edge to better understand how that is framed.
How mini Kubernetes clusters on the edge can help avoid chaos [03:14]
Julia Morgado: Yes, so the talk I gave was called, Chaos to Control: Kubernetes Resiliency at the edge. And it was recorded, probably the recording will be available soon. However, the scenario we shared at KubeCon was from a global retail customer with over 500 edge locations. Each store is running Kubernetes locally, essentially mini clusters at the edge. And you can think of edge computing as having mini data centres in each store instead of sending everything to a prominent central location. And each store was processing right there, restocking alerts of when inventory was running low, personalised promotions based on customers' data, all the operations, so point of sales, everything.
But with edge come many challenges, it's not like running on the cloud on-prem. There are constraints like resource constraints. Devices running in Kubernetes at the edge, are usually lightweight, so they have less CPU, RAM and storage. There are network issues because you're not always connected so that the source can lose connectivity any time. So everything has to work even without the internet. There are a lot of security risks, which is what we focused on at the talk, because the more edge devices you have, the more doors you leave open for attackers, and that was the case.
So the retail customer was hit by a ransomware attack. It started with phishing email, which usually is what we see. Someone clicks on an email, and then it gets access to the whole system and encrypts the system, and many things happen. And we went over a little bit, everything spread through the clusters, and when that happens, usually traditional backup solutions and traditional things that we like to say, keep your system resilient, they don't work for edge locations. Because usually, like I mentioned, you rely on a centralised infrastructure, and with edge locations, you don't have that centralized failover like with traditional data centres. Also, you are in a disconnected environment, so you can't have that fast and local restore. So it's very complicated and you need a backup solution specialized for that. So we went over backup solutions, yeah-
Olimpiu Pop: Let me stop you a bit to consolidate what you had. You're very enthusiastic about this. So that means that everything that you have as a system looks like a spider web. Each of the shops is its own, let's put it like that, box that has everything that theoretically needs to run. And at points, it probably connects to the centre to get more data or to get updates.
Can you share how the information kept flowing? At points, I might imagine that you need to get information from the main repository or something like that. Can you share a bit about how that works?
Julia Morgado: Yes, so it's connected to the cloud and a local data center, but the connectivity is intermittent. It only connects to get the updates, but it's not always connected. That's one of the issues with edge locations.
Olimpiu Pop: Okay, well, that's mainly the definition of it, but this is a more different aspects. And you touch on very interesting aspects, because more than the traditional thing where you'll have to worry only about the connectivity and getting the information upright and so on, so forth. You are dealing also with, probably personnel that has less digital literacy. Because I would expect that people that are working in a supermarket are not exactly digitally literate, so that's probably why the door is open to other stuff.
Okay. So, for me, it's impressive to have the whole system restarted in about 10 minutes. How far along was the system affected? Was it isolated to one shop, or were a couple of shops affected?
Julia Morgado: No, a couple, so several. I don't know the exact number to be precise, but several shops were affected and it was spreading fast, but then we were able to contain it. And because it's a huge process and the talk at KubeCon was focused on the recovery part, the backup and recovery and how to build a more resilient environment. However, disaster response is another aspect of what happened. When a ransomware hits, what do you have to do? Do you have an incident response plan? So several shops were impacted, but I don't know the exact number.
Olimpiu Pop: That's not that important. How did you handle it? I mean, how do you prepare for such a thing? Because I'm sure that these things were prepared in advance, the system was thought to be recoverable in a decent way. So, how should I start doing that if I'm at this point?
Julia Morgado: So actually, we were brought in after the incident happened. They were using other things before. And what we did was build a more resilient environment for them and with that we used some open source tools, part of the CNCF, and some non-open source tools as well. But we focused on backup and recovery solutions, also, S3 compatible object storage for that. And if I can mention names. So they started using Kasten by Veeam, which is the data management platform designed specifically for Kubernetes and edge locations as well, for backup and restore and disaster recovery tasks. And Canister, which is a project from the CNCF, is an open-source framework that allows you to define custom data management tasks through blueprints. So you just write YAML files that know how to handle the application, like if you need to flush a database before a backup, or if you need to run custom logic during a restore. And they work with several databases, several applications, and that helps with that resiliency.
How to prepare for service disruption [10:12]
Olimpiu Pop: Okay. YAML is quite a known language among the Cloud Native Foundation and its users. So what should I have? How should I think about it from an overview? Should I think about frequency? Should I think about particular tables that they want to target, or how would that work?
Julia Morgado: Yes, you should think about what are the most important applications that you have to backup. Ideally, you should backup everything, but also, you need to think about are they stateful, are they stateless? You need to back up stateful applications. This is a common controversy in the Cloud Native space. We are seeing stateful applications on Kubernetes and they need to be backed up. And frequency is something important, but you also have incremental backups, that's something that you can do. So instead of backing up everything every time, which can be slow and resource heavy, you only backup what's changed since the last snapshot. So this speeds up the backup window, reduces the storage needs, and it's perfect for bandwidth-constrained edge locations. So this is something that they implemented as well with Kasten by Veeam.
Olimpiu Pop: Okay. So let's assume that we have a solution that does that for us, it's the time shuttle that moves us left and depending on what we need. How it'll happen? I mean, as you mentioned, it's recording only snapshots. If I need, I don't know, I'm losing the whole database, that means that I will have to apply multiple snapshots. I suppose that one is done automatically anyway.
Julia Morgado: No. So first you do a full image backup, so it's not just snapshots. You have the entire application level backup, and then with the incremental backup, you do only snapshots. But you have a whole backup of your infrastructure, your application, and then you can recover from that. So you have these two types, and the incremental one is only the changes, so you don't have to do a whole image level backup every week, let's say. You can just do the snapshots, if something changes, you have it there. But when you need to do a restore, you still have the full image level backup that you can restore from. So you have several options, yes.
Olimpiu Pop: Okay. So that means that we need to take the things separately. We have stateful systems, and then based on those, we're just ensuring that we have the data. The stateless systems, while they're doing computation and probably they're relying on stateful data, so they're just pulling data out of that and that's mainly it. What else? What else should we bear in mind?
Julia Morgado: So, like I mentioned, the storage. Where are you going to send that data to? So usually people, we would see in the past, file system, et cetera, but now we're seeing a lot of object storage, so you send that to the cloud. And it's great because it's scalable, so it grows with your data automatically. It's off-site, so you can keep them safe in case one of the locations is hit. Usually you store it in the cloud, so all these big providers have their own object storage that is very powerful. You can make it immutable, so for a ransomware attacks, or even accidental deletions, those happen a lot. Like you said, people are not very digital literate, what's the expression that you mentioned?
Olimpiu Pop: Yes, something like that.
Julia Morgado: Yes. There are a lot of accidental deletions, but if you make that data immutable, no one, not even the admin, can change it, encrypt it in case of a malicious actor, or delete it. And it's very cost-effective because you pay as you go.
Use zero-trust and immutable data to reduce disruption’s blast radius [14:25]
Olimpiu Pop: Let me see if I can translate these things for myself. So what you're saying is there are two different things. One of them, whenever we are building the system, we should use, as much as possible, immutable data, because that allows us to be safe and not alter something that's already existing. And then to put another degree of isolation, like probably zero trust, where you're just making sure that always you have to check it and then that will limit the blast radius.
Julia Morgado: Yes. And part of the zero trust, I would say also, only people that need access, or you should limit the amount of permissions that people have to perform a backup, to be able to manage the infrastructure. Because the more permissions people have, the more accidents can happen. And this is part also of zero trust, etc. You can do that with RBAC, and it's a very famous principle in the Cloud Native space.
Olimpiu Pop: Because we discussed that, I was happily surprised by the Cloud Native Foundation's keynotes and the level of the talks. Everybody's talking about the user experience. How was it felt on the user side? I mean, in 10 minutes, they can have a coffee or two if you are really nervous, but theoretically, things should have happened quite fast.
Julia Morgado: Yes. You mean the recovery?
Olimpiu Pop: Yes, the recovery. So now, obviously, during the backup phase, users are not affected, they don't know that's happening. It's behind the scenes, it's happening. But obviously the point when you have a problem with the system, you'll just have probably wrong data if the system is not fully affected. Or if, as you mentioned, the system was encrypted, most probably the guys had more problems, they couldn't work, they couldn't use it. So if I would be a cashier or whatever, something operator for the warehouse, what would be my experience during, or the ideal experience of recovering from a failure?
Julia Morgado: So like you said, ideally, the user, not the backup admins or the technical people, the users that are just dealing at each location, they don't really see any disruption. That's the perfect recovery scenario. And it means your edge location, your environment, is pretty resilient and has a good backup and recovery plan. But for that, you need some things. So I didn't even touch on that, but everything is automated. So not only is the recovery fast, but it's all automated.
So something happened, it already triggers the restore and all the policies are enforced. Also, everything should be tested. So if there is a disaster scenario and you tested it before, you're going to see, you can recover in less than 10 minutes and then the user won't even have to deal with any downtime. That's the goal. Because even five minutes of your system not working, like you said, cashiers at supermarkets, if the system is not working, it means hundreds of money lost into sales et cetera. So if you have everything automated, tested prior to anything happening, your user won't face any downtime. They will have a smooth experience.
Olimpiu Pop: Okay. So that means that, more than just preparing the system, we need to prepare the users. So probably doing something like drills, different type of scenarios where you're just trying to cover as much as possible. Okay, that's a good idea.
Julia Morgado: But when you say users, users is mostly, I mean, technical people, the infrastructure engineers, everyone that's handling those systems. The user, like we said, the cashier, there isn't much that they can do when something like that happens. They can go and open a ticket or something, but it's not in their hands to recover the system. And that's a big problem at edge locations, because they don't have, usually, IT personnel in each store. So it usually takes even more time. But if you have something that's automated and tested and has, for example, GitOps integrated, then it's going to be much faster. And the cashier or other users, they won't even experience anything, they won't feel any disruption.
Olimpiu Pop : So that means that each store individually, or each edge location individually, had its backup mechanism locally.
Julia Morgado: Yes. So each cluster can handle its own backup and restore without needing a central control plane.
Olimpiu Pop: Okay. And that means that these backups are stored in a isolated location or they are in an air-gapped environment, given that you mentioned that-
Julia Morgado: Object storage, yes.
Olimpiu Pop: Good. And do you have any kind of off-site storage?
Julia Morgado: Yes. So at Veeam, we talk a lot about the 3-2-1-1-0 backup rule. I don't know if you've heard of it. But usually you need three different copies of the data in two different type of medias. One of them should be off-site, in another location, just in case something happens with your production copy. And one of these copies should be offline, air-gapped or immutable, which is what I meant. No one can tamper it, encrypt it, delete it. And then the 0 at the end, is for zero errors. So you should be always testing it and make sure that there are no errors after the backup recoverability verification. So it's an easy rule to remember, 3-2-1-1-0. And if you have that, you are probably in a great place, you have a very resilient system.
Exercise for outages [20:41]
Olimpiu Pop: Okay. That's good to know. And how often should you have drills? Is it related to the size of the data? What will be another choreography rule?
Julia Morgado: Yes, so it depends on the RTO and RPO. If it's some type of data that is extremely critical, you should have drills often. Everyone, all personnel should be able to know, oh, in case of anything happens, who's going to be the point of contact, who will deal with restore operations? You should have all of that documented, because when things happen, people get, how can I explain? Like it's chaos, like the title of the talk and no one knows what to do. There is also a lot of blaming, "Oh, they should have done that. They didn't do that". So when you have everything documented, it's easier to say, "Yes, let's do one, two, and three. Person A is going to do one, person B is supposed to do two". If someone is not on call or they are traveling on vacation, you know who to contact if needed. So everything, this is all part of the incident response plan as well, and it should be tested regularly.
Olimpiu Pop: Okay, that makes sense. So that's very enlightening, thank you for that. But during KubeCon, there were a lot of things that were spoken. Any takeaways that you would like to implement moving forward for other potential scenarios?
Julia Morgado: Yes. So, still talking a little about resiliency, that it's the topic of the podcast and also my talk. It's mostly, resiliency is not optional anymore. So everyone's talking about GitOps, we're talking about GitOps, observability, AI, ML, but at the end of the day, if your cluster crashes and your data isn't safe, then none of that matters. And there are a lot of threats like ransomware and especially at edge locations, which is so fragile, resiliency must be built in.
And so also security and backup are being seen as part of the same ecosystem now and not separate concerns. Security, I'm sure you saw that at KubeCon, is always a big thing. So this time and at the last one, North America, I saw that as well, it's always a core focus. And now they're integrating backup into that, which I think is very important, data is the livelihood of any business. But like you mentioned, I saw that they emphasized the importance of integrating security in each layer of the Cloud Native stack, with particular attention to zero trust principles. And also, they talked a lot about supply chain, which I think is also a very important topic, so security in the supply chain. I also saw, not that related to resiliency anymore, observability was also a big thing at KubeCon. So ensuring and having a strong emphasis on observability to meet compliance requirements, such as, there are some outlined at the EU Cyber Resiliency Act, they mentioned that at one of the keynotes, if I'm not mistaken.
Olimpiu Pop: Yes. It was Friday.
Julia Morgado: Yes, exactly. So there was a lot going on at KubeCon. I'm still digesting everything.
Olimpiu Pop: So to conclude, what you're saying is that, now, things are happening in real time, more or less, so observability is the mechanism through which, if we implement it, we'll always know what happens and know exactly where we are. And then more than everything else, that is important, so high availability data is even more than that, because it means that you have servers available, but if you don't have the proper data, your system is not working as expected. And last but not least, even though the things change lately, the Cyber Resilience Act will just ask you to be more aware of what you're doing on the supply chain side. So those things have to be boiled down into the business continuity plan of your company as well.
Okay. That was a interesting deep dive.
Julia Morgado: Yes, thank you for having me. It was a pleasure.
Olimpiu Pop: Same here.
Julia Morgado: I love talking about Cloud Native things, KubeCon experiences. I always encourage and recommend everyone to attend at least one KubeCon if they can. I know not everyone is able to, but for those that don't know, CNCF also provides scholarships for KubeCons. I think you have to apply with at least two months in advance, but you should check out their website. So there are scholarships, usually also if you've been laid off, you can get a free ticket to attend a KubeCon. So there are a lot of opportunities.
And just for those that are more interested in Cloud Native technologies in general, I would say just start, get involved. Go on the CNCF Slack channel, there are several channels for each of the open source projects or there are like first-time contributors and you can just say, "Hey, I'm new here, but I want to get more involved". There are so many ways people can learn and get involved in that community, and I think it's a very welcoming community. I don't know about you, but I always love going to KubeCons and meeting people and learning more about technology.
Olimpiu Pop: Yes, it's, as mentioned and discussed with several people, I just went there without any plans and I always met somebody, old friend or new friend, and that was quite nice. So yes, the Kubernetes is more than just a technology, is a community and everything else that's in its ecosystem.
Okay. Obrigado, Julia.