From the course: Google Cloud Professional Cloud Architect Cert Prep (2025)

Demo: Building Rust deduplication finder

- Let's talk a little bit about data engineering. In my opinion, data engineering is the classic systems programming problem, and Rust is a systems programming language. A lot of my career I've worked with building command line tools that would do some kind of data operation like moving thousands of files somewhere, or using, you know, a petabyte scale file server to build movies, for example when I worked on "Avatar," or Sony movies, or Disney movies. But when I see something like Rust, what's so exciting about it is that you can build things that are multi-threaded, efficient, use low memory, and also build portable responsive technology. So really that's the key thing that you get with Rust versus, you know, a scripting language is your ability to build high performance tools. Now the question is, is it too difficult to do? What I'm going to show you is how to build a deduplication tool. You can actually, you know, replace the deduplication and put in whatever else you want. For example, you know, archiving or transforming data, but this is really a classic data engineering problem, and I'm going to show you how to do it step by step with Rust. All right let's go and get started. Here I have a repo called noahgift/rdedupe. And if we take a look at this, in my opinion this is a pretty good example of a systems programming tool for data engineering, or if you wanted to do some kind of MLOps type tool, you could use a similar pattern. Let's walk through what I would consider are the core competencies necessary. First up, I like to do dev container here. You can see I've got this configured so that if I wanted to share this project with someone else, they could just spin it up, do some kind of testing inside of this container. Next up, I also configured GitHub. If we take a look at this, I have a lint right here, and this shows you that you can just do a make lint. And this will link the code. Every time I make a change we can also do a release, and this is something that's really cool. I can build a high performance binary and share it with the world. I also can format my code so I can check to make sure that my code is formatted properly. And then finally I can run some tests that make sure that I'm not introducing business logic problems. And then inside of my project, look, I put links to all of this. So this is really what I would recommend is this style for building binaries that are high performance, and also that you deliver the binary to other people so they can just download it, right? This is a huge advantage over Python where you can't distribute binaries, and you have to give people explicit instructions on how to install software. Next up, let's take a look at this diagram here. And Rust, as I mentioned, is a system programming language for data engineering. And one of the key things here that's very very different than Python is that if you had, for example, a petabyte file server, which is very common, or you used Amazon EFS, one of the things that you care about is the ability to efficiently use memory and have high performance code. We know that Rust can be, you know, 70 times faster than Python. And it also has extremely low memory usage because it's sharing memory when you spawn thread, so what we want to do is if I'm running something on my 20 core Mac, I want to use all of the cores. I don't want to use one of the cores and just be kind of thrashing back and forth here. And I also want to have some kind of action that is going to use a progress bar. I want to distribute my portable binary. And I also want it to be very very fast and efficient so that I can build high performance data engineering tools. So that's really the goal here. So let's go ahead and now shift gears here. And let's take a look at an environment that I've got it set up locally here. That we can walk through and take a look at the code. So first step, inside of this project I have a makefile. I always like to have makefiles, and if we go through here and we just say make format, I constantly am doing reformatting while I run things and I'll do make lint as well. I constantly want to run through, make sure that linting is passed, and that's a huge advantage of this ecosystem. The other one that's a big one is this one, make build-release, and what happens is it'll build an optimized binary so I can just distribute this to other people, and it's set up to be a very high performance tool. Let's go ahead and do that. Let's go ahead and say make build-release. There we go. It made a high performance binary. Now how do I use this? Easy. All we have to do is go over to the target, right? And we can navigate to the release. And then the executable is right there, right? Rdedupe. And then I can just play around with it, so let's go ahead and try this out. We'll say help. Ah, look at this. It gives me a menu right here. And what's cool about this is that I can actually bring over a shell and just transpose it right here. And we can actually take a look at this thing in action. So if I go through and I say each top for example, and I look at all the cores, you can see things are relatively calm right now. But if I wanted to navigate back and forth here, what I could do is go back and forth between the shell environment and what's happening. So I'm going to go ahead and say, you know, rdedupe here, and we'll say. Dedupe, and this will look through some path that I set. And I'll go ahead and say path. Let's look at my documents directory. And then if I wanted to I could even put, you know, some kind of a delimiter to search for movie files or whatever. But the main takeaway is that I run it. And we've got this really high performance tool that shows me how many files. It gives me a nice little progress bar here, and then it's going to go through and do a bunch of checksums. Now if I go back my environment here, look at this. You can see it's using all of the cores. And we can even see that it's actually extremely memory efficient at the same time. And there we go. I've been able to do that. Now if I want to go through and do a word count dash L, we could even count, you know, how many duplicate files that I had, but again, the key takeaway here is that I'm able to actually use actually all of the cores, but in an extremely efficient way. And extremely low memory because of how powerful and efficient threads are versus processes, and you can see in this particular directory there's 2500 files that are duplicates, and I could do other things if I cared about manipulating those duplicates. But this is really the big takeaway is that you can build high performance system tools that are memory efficient. Let's go ahead and walk through the code real quick. And you can see it's actually very straightforward. I like this pattern of lib where I put in all of my logic in the library, and then I execute it against the command line tool. I think this pattern is ideal for data engineering. So I go through here, and I load some cargo imports here, and if we go through here, we look at cargo, you can see that I have my development environment, and then I have my production environment. And this is a command line tool. This allows me to walk directories. This allows me to checksum. This is an efficient thread pool. And this is a progress bar. That actually interacts with a thread pool. So you can see this pattern is very usable for other tools. If we go to the lib file here. Here's the code that walks the directory. Doesn't look that much different from Python. It accepts a path which is a string and returns back a vector of strings. Next up, I wrote a little bit of code. So if I want to do pattern matching, again, it goes through here and it does pattern matching. Again, very tiny bit of code. Not that much different from Python. We also want to go through and do a checksum. And inside of this code a little more complex, but not too bad. Notice that what I do here is I say build a checksum, but I also have a progress bar. And we see that right there. And then at the very end we actually pass in all of those into the result which is a dictionary that is actually returned. And finally, we look inside of that dictionary and we make sure that if there's more than one file that has the same checksum, then we report it. And this is the end result here as I tee it all together. So how do I then execute this? I go to the main file. I have boilerplate code here at the very beginning. And really the only thing that's important is these mappings. So I just say here's a search command, here's a dedupe command, and here's account command. And then over here I map it all together. So I think this pattern is very very usable to build extremely powerful tools that are memory efficient. And again, we can see that that dedupe command maps exactly to this. So in a nutshell, it's trivial to build high performance data engineering tools with Rust and I hope you give this a try.

Contents