Stateful stream processing - Apache Spark Tutorial

From the course: Apache Spark Essential Training: Big Data Engineering

Start my 1-month free trial Buy for my team

Stateful stream processing

“

- [Instructor] One of the challenges in stream processing is the need to maintain current state for computing aggregations and managing transitions. It is also needed to resume processing after a pipeline is halted. How does Spark help in this regard? Let's start with the feature of checkpointing in Apache Spark. Checkpointing is the ability to save the state of the pipeline to a persistent data store like HDFS or S3. When a job fails and needs to be restarted, the information saved during checkpointing will be used to resume processing from where it left off. Checkpointing will store a number of metadata elements as well as some RDDs at periodic intervals to the checkpoint store. This includes Kafka offsets, so processing can resume from the last process record. Also, if the state is tracked by keys, then that information can also be checkpointed. Finally, if there are RDDs that require operations like windowing across multiple batches, they will also be stored in the checkpointing…

- (Locked)
  
  More about Apache Spark
  
  43s

Unlock the full course today

Join today to access over 24,500 courses taught by industry experts.

Stateful stream processing - Apache Spark Tutorial

From the course: Apache Spark Essential Training: Big Data Engineering

Stateful stream processing

Practice while you learn with exercise files

Download courses and learn on the go

Contents

Explore Business Topics

Explore Creative Topics

Explore Technology Topics