From the course: Apache Spark Essential Training: Big Data Engineering

Batch processing use case: Problem statement - Apache Spark Tutorial

From the course: Apache Spark Essential Training: Big Data Engineering

Batch processing use case: Problem statement

- [Instructor] Having discussed the capabilities for Apache Spark in the earlier chapter, let's now design and implement a batch processing pipeline in this chapter. Let's start off with discussing the business use case we are trying to solve with this pipeline. We want to build a pipeline that will do stock aggregation across multiple locations for an enterprise. This enterprise has warehouses across the globe. Warehouses maintain stock of items by location, and distribute them to local stores. Each warehouse also has a local data center with all the required hardware, and software deployed in that center. A stock management application runs in each warehouse. It is the same software, but has local independent instances deployed in the local data center. A local MariaDB database keeps track of warehouse stock information. Stock is maintained by each item for each day. We keep track of the opening stock, receipts, and issues for each item for each day. The enterprise wants to create a consolidated central stock database. Item stock information needs to be collected from across warehouses on a daily basis, and then consolidate in a central MariaDB database. We need to set up batch processing to upload local warehouse data into a central data cloud, and then aggregate stock across locations. The solution should be scalable to hundreds of warehouses. This is a simple use case, but representative of a number of real life use cases that require data gathering, consolidation and aggregation.

Contents