From the course: Apache Spark Essential Training: Big Data Engineering
Spark architecture review - Apache Spark Tutorial
From the course: Apache Spark Essential Training: Big Data Engineering
Spark architecture review
- [Instructor] In order to build an optimal Spark pipeline, it is important to understand how Spark works internally. When design decisions are made, they need to be analyzed on how they impact scalability and performance. In this video, I will review how Spark executes a pipeline and optimizes it. I recommend further reading on this topic to master the internals of Spark. Spark programs run on a driver node which works with a Spark cluster to execute them. A Spark cluster can consist of multiple executor nodes capable of executing the program in parallel. The level of parallelism and performance achieved is dependent upon how the pipeline is designed. Let's review an example pipeline and how it gets executed. First, the source data is read from an external source database into the structure Data 1. Data 1 is then converted to a data frame or its internal representation resilient distributed datasets, RDDs. During this conversion, it is partitioned and individual partitions are assigned and moved to the executor notes that are available. When a transform operation like map or filter is executed, these operations are pushed down to the executors. The executors execute the code locally on their partitions and create new partitions with the result. There is no movement of data between executors, hence, transforms can be executed in parallel. Next, when an action like reduce or group by is performed, the partitions need to be shuffled and aggregated. This results in movement of data between executors and can create I/O and memory bottlenecks. Finally, when the data is collected back to the driver node, the partitions are merged and sent back to the driver. From here, they are stored into external destination databases. This flow needs to be understood and visualized for any pipeline that we are building to understand bottlenecks and design for scalability.
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.