From the course: Problem-Solving Strategies for Data Engineers
The data engineer
From the course: Problem-Solving Strategies for Data Engineers
The data engineer
- [Instructor] Let's start with a macro view of data engineering. So what you usually have is on the left, you have a data source. This is where your data is coming from. And on the right is a person or a system that is actually using the data. And everything in the middle here, that is a pipeline. That part that takes the data from the left and puts it to the right. Of course, there's multiple steps. This pipeline is not just a single thing. From a macro view, that's all that engineering is. So what engineers do is we design, build, and we actually maintain these systems. And the purpose for what we are doing this is we want to collect the data from the source, we want to store it somewhere, make it accessible for people, for systems, that then analyze the data. Pretty straightforward, right? What are examples for this? The first example I have is where I'm actually coming from, what I did a lot in my past, is processing data from IoT devices, from the Internet of Things, where a device has a sensor that's on the internet, it's going to send you data and a user on a user interface might want to then look at that data from that device. Could also be that you have an ecommerce store and the analytics team wants to improve the stock that they hold or the sales optimization that they want to do. So for that, they would need to analyze actually the order data, and the engineer then takes the order data and makes it available to the analyst. The other part would be a bit more complicated, where you want to analyze the web traffic and the sales together to gain insight into, okay, how are people actually behaving on our website, what are they doing, at what point are they buying and what are they buying, to improve the experience for the user and also to make more sales. So included in that, with what we've talked before, design, build, maintain systems, infrastructure and end-to-end pipelines. That's something that the engineer is responsible for. The infrastructure, that is actually like the cloud and the systems on the cloud that are needed for processing the data databases, processing frameworks, scheduling tools, message queues and more, and we are in charge of the end-to-end pipelines usually. Basically, we will use the infrastructure that is there to create complex processes to make the data available, where we query the data from sources, where we process the data, multiple steps, do some transformations, do some data cleaning, do some data modeling, and then store it in a destination. So a pipeline, if you'll then look at it in a bit more complex way, and this is still a very, very simple ETL pipeline where we have our data left in our data source, we have our data integration tool or our data integration process that extracts the data out of the source, then we do 1, 2, 3, n transformation steps, we do data cleaning, and after that we store it into a data store that we've modeled where then later the data can be used by a person. So a person can go into a visualization tool and then request the data from our store. The skills data engineers need very often is first of all software engineering because we're developing a lot of code, so software engineering, coding, we need to collaborate with our colleagues, with the business and so on, and we need to do operations also for our software, for our developed pipelines. Then we need to do data integration. That was the left part in that image of the data engineering process, right? There are some sources and the sources can vary a lot from online tools to databases to message queues to APIs. We need to work on how can we integrate these sources into our pipelines so that we can then later process the data. This might be a streaming data, where the data comes in constantly, it might also be batch processing, where we schedule processes. And then of course databases. That database is one of the basics for an engineer, at least relational database knowledge where we know, okay, this is how we can design a database and this is how we can use the database. Of course, they're not just relational, we know all kinds of stores with experienced, relational, NoSQL databases like Evalue stores and wide-column stores and documents stores. We also know data warehouses or data lakes. Another part is analysis. We're also working on analysis, but not the way that you might think. Not like a scientist where we analyze the data to get some business value, but we are analyzing the data to better understand what's happening in our system, so we most of the time use quality data from our systems, from our processes to analyze what's actually happening. As you can see, engineering is critical. A lot of people think, "Let's have some data and let's have a scientist or an analyst and we are going to have a nice system." That thing in between the source and the thing or the person that is actually using the data, that pipeline, that is super important. That is critical. Without the engineering in between, you will not have automation, you will not have a productive system. Now, who do we work together, the data engineers? Very often we're working with analysts because they sit on the right side, they work with the data, same with scientists, but we're also working with the business because we need to understand what are the business needs, what do we actually solve here, what are our goals, so we work with the business, and fourth, last but not least, we're also going to work together with other engineers. It might be data engineers, might be some hardware engineers, some system engineers, because like in that example on the left side, that database that is the source for our data, that might be owned by another engineer in the company and that engineer is responsible, so we need to talk to this engineer and we need to figure out how we can work together. All right, let's look next at the phases of a data project.