From the course: Apache Spark Essential Training: Big Data Engineering
Unlock the full course today
Join today to access over 24,800 courses taught by industry experts.
Batch vs. real-time processing - Apache Spark Tutorial
From the course: Apache Spark Essential Training: Big Data Engineering
Batch vs. real-time processing
- [Instructor] When building data pipelines, one of the key decisions to make is whether the pipeline would be batch or real-time. We start with batch processing. In batch processing, we process data in batches. A batch is a collection of records with a well-defined size or window. We can either define batch sizes by the count of records or by a date/time range. In batch processing, the batch of input records is already ready and complete. This input does not change when processing is going on. This is a key differentiator between batch and realtime processing. Batch processing works on bounded datasets. The number of records in the batch do not change after the processing has started. Since batch processing usually waits for all the input from that batch to be ready, there is latency when beginning processing. Also, it may take some time to process all the records. The results of processing are usually available at the end of processing, but intermediate results can also be possible.…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.