From the course: Apache Spark Essential Training: Big Data Engineering
Unlock the full course today
Join today to access over 24,800 courses taught by industry experts.
Batch vs. real-time options - Apache Spark Tutorial
From the course: Apache Spark Essential Training: Big Data Engineering
Batch vs. real-time options
- [Instructor] When building an Apache Spark pipeline, would you choose a batch or a real-time pipeline? Data engineers and architects have a tendency to build all pipelines as real-time pipelines whenever possible. The key justification is that it is super fast, would generate the required insights instantly, and enable business actions. It is also considered cool in the data engineering world. But before jumping into building realtime pipelines, we need to understand the complexities involved. Realtime pipelines deal with unbounded data sets. This makes it difficult to size compute resources, like memory and clusters. When doing time-based aggregations, we need to deal with windowing and watermarks. Irrespective of how delayed the watermarks are, we do end up with missed events and incorrect aggregations. Real-time state management is another challenge to ensure that state is properly maintained across the Spark cluster. Reprocessing of events is complex when issues happen. It is…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.