From the course: Apache Spark Essential Training: Big Data Engineering
Unlock the full course today
Join today to access over 24,800 courses taught by industry experts.
Stateful stream processing - Apache Spark Tutorial
From the course: Apache Spark Essential Training: Big Data Engineering
Stateful stream processing
- [Instructor] One of the challenges in stream processing is the need to maintain current state for computing aggregations and managing transitions. It is also needed to resume processing after a pipeline is halted. How does Spark help in this regard? Let's start with the feature of checkpointing in Apache Spark. Checkpointing is the ability to save the state of the pipeline to a persistent data store like HDFS or S3. When a job fails and needs to be restarted, the information saved during checkpointing will be used to resume processing from where it left off. Checkpointing will store a number of metadata elements as well as some RDDs at periodic intervals to the checkpoint store. This includes Kafka offsets, so processing can resume from the last process record. Also, if the state is tracked by keys, then that information can also be checkpointed. Finally, if there are RDDs that require operations like windowing across multiple batches, they will also be stored in the checkpointing…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.