Batch vs. real-time options - Apache Spark Tutorial

From the course: Apache Spark Essential Training: Big Data Engineering

Start my 1-month free trial Buy for my team

Batch vs. real-time options

“

- [Instructor] When building an Apache Spark pipeline, would you choose a batch or a real-time pipeline? Data engineers and architects have a tendency to build all pipelines as real-time pipelines whenever possible. The key justification is that it is super fast, would generate the required insights instantly, and enable business actions. It is also considered cool in the data engineering world. But before jumping into building realtime pipelines, we need to understand the complexities involved. Realtime pipelines deal with unbounded data sets. This makes it difficult to size compute resources, like memory and clusters. When doing time-based aggregations, we need to deal with windowing and watermarks. Irrespective of how delayed the watermarks are, we do end up with missed events and incorrect aggregations. Real-time state management is another challenge to ensure that state is properly maintained across the Spark cluster. Reprocessing of events is complex when issues happen. It is…

- (Locked)
  
  More about Apache Spark
  
  43s

Unlock the full course today

Join today to access over 24,500 courses taught by industry experts.

Batch vs. real-time options - Apache Spark Tutorial

From the course: Apache Spark Essential Training: Big Data Engineering

Batch vs. real-time options

Practice while you learn with exercise files

Download courses and learn on the go

Contents

Explore Business Topics

Explore Creative Topics

Explore Technology Topics