From the course: Apache Spark Essential Training: Big Data Engineering

Unlock the full course today

Join today to access over 24,800 courses taught by industry experts.

Stateful stream processing

Stateful stream processing

- [Instructor] One of the challenges in stream processing is the need to maintain current state for computing aggregations and managing transitions. It is also needed to resume processing after a pipeline is halted. How does Spark help in this regard? Let's start with the feature of checkpointing in Apache Spark. Checkpointing is the ability to save the state of the pipeline to a persistent data store like HDFS or S3. When a job fails and needs to be restarted, the information saved during checkpointing will be used to resume processing from where it left off. Checkpointing will store a number of metadata elements as well as some RDDs at periodic intervals to the checkpoint store. This includes Kafka offsets, so processing can resume from the last process record. Also, if the state is tracked by keys, then that information can also be checkpointed. Finally, if there are RDDs that require operations like windowing across multiple batches, they will also be stored in the checkpointing…

Contents