Parallel processing with Spark - Apache Spark Tutorial

From the course: Apache Spark Essential Training: Big Data Engineering

Start my 1-month free trial Buy for my team

Parallel processing with Spark

“

- [Instructor] Big data processing is all about processing data in parallel to achieve high throughput in less time. How does Apache Spark help with this goal? Data processing involves multiple activities, which, in general, can be grouped into the following. First data is read from data sources, like databases. Then various operations, like transformations, data filtering, and validation checks, are performed. Data may then be aggregated to create summary metrics. Finally, transformed data is returned to sinks. When we talk about scaling data processing, we need to scale all these steps. We need the ability to parallelize all these steps involved in data processing/ Steps that cannot be parallelized become bottlenecks, and the speed of the processing pipeline would be based on how fast these bottlenecks can process data. How does Apache's Park help in parallelizing activities and removing bottlenecks? Let's start with reading data from data sources. Spark supports out-of-the-box…

- (Locked)
  
  More about Apache Spark
  
  43s

Unlock the full course today

Join today to access over 24,500 courses taught by industry experts.

Parallel processing with Spark - Apache Spark Tutorial

From the course: Apache Spark Essential Training: Big Data Engineering

Parallel processing with Spark

Practice while you learn with exercise files

Download courses and learn on the go

Contents

Explore Business Topics

Explore Creative Topics

Explore Technology Topics