From the course: Apache Spark Essential Training: Big Data Engineering

Unlock the full course today

Join today to access over 24,800 courses taught by industry experts.

Batch vs. real-time processing

Batch vs. real-time processing

- [Instructor] When building data pipelines, one of the key decisions to make is whether the pipeline would be batch or real-time. We start with batch processing. In batch processing, we process data in batches. A batch is a collection of records with a well-defined size or window. We can either define batch sizes by the count of records or by a date/time range. In batch processing, the batch of input records is already ready and complete. This input does not change when processing is going on. This is a key differentiator between batch and realtime processing. Batch processing works on bounded datasets. The number of records in the batch do not change after the processing has started. Since batch processing usually waits for all the input from that batch to be ready, there is latency when beginning processing. Also, it may take some time to process all the records. The results of processing are usually available at the end of processing, but intermediate results can also be possible.…

Contents