From the course: Apache Spark Essential Training: Big Data Engineering

Unlock the full course today

Join today to access over 24,800 courses taught by industry experts.

Batch vs. real-time options

Batch vs. real-time options

- [Instructor] When building an Apache Spark pipeline, would you choose a batch or a real-time pipeline? Data engineers and architects have a tendency to build all pipelines as real-time pipelines whenever possible. The key justification is that it is super fast, would generate the required insights instantly, and enable business actions. It is also considered cool in the data engineering world. But before jumping into building realtime pipelines, we need to understand the complexities involved. Realtime pipelines deal with unbounded data sets. This makes it difficult to size compute resources, like memory and clusters. When doing time-based aggregations, we need to deal with windowing and watermarks. Irrespective of how delayed the watermarks are, we do end up with missed events and incorrect aggregations. Real-time state management is another challenge to ensure that state is properly maintained across the Spark cluster. Reprocessing of events is complex when issues happen. It is…

Contents