From the course: Data Planning, Strategy, and Compliance for AI Initiatives
Batch processing systems
From the course: Data Planning, Strategy, and Compliance for AI Initiatives
Batch processing systems
- [Instructor] We process data in AI using different processing patterns, one of which is called batch processing. This is where we process large volumes of data in predefined groups rather than continuously. Batch processing is optimized for throughput over immediate response, so we often schedule executions using pre allocated resources. And really batch processing is the foundation for training large AI models and processing massive data sets. Now, batch processing has been around for a long time, so it's a fairly mature paradigm and it has some well established tools and frameworks. Some key characteristics of batch processing include the ways that data is collected, stored, and processed, which is in discrete batches. Now, the processing occurs at scheduled intervals or based on certain events or triggered by certain conditions. The nice thing about batch processing and the way it's triggered is that we have some predictability, so we have predictable resource allocation based on batch size, and we know when batches will be run. Now we need to keep in mind that the results of a batch processing are typically only available when the entire batch process completes. This makes batch processing well suited for non-time sensitive AI workloads. Now, AI training is a prime example. We may, for example, have large neural network training on fairly large historical data sets, or we might have periodic model retraining as we accumulate new data. We also use batch processing for hyper parameter optimization, and we do this typically through sequential runs. We also use batch processing when we do cross validation for comprehensive model evaluation. Data preparation is another workload or pattern where batch processing makes sense. We use ETL or extraction transformation and load processes for creating training sets. And we also use batch processing for feature engineering at scale when we're working with say, millions of records. Also, if we have a comprehensive data cleansing and normalization operation, or if we're running training validation and test generation, we would want to use batch processing. We can also use batch processing when we're working with complex engineered feature calculations. Now, when we're talking about the architecture or the technical architecture of batch processing, we may want to use a distributed computing platform like Spark or Hadoop. We also want to take advantage of accelerators like GPUs and TPUs and use those with clusters, especially when we're working with neural network training. Batch processing involves jobs, and so job scheduling system for research management is an important tool that we want to have in our technical architecture. We also want to make sure we're working with storage systems that are optimized for large sequential reads. There are several benefits that come with batch processing, the first of which is cost efficiency, and we get that through optimized resource utilization. We can also take advantage of off peak processing, which sometimes have lower cost and more availability with regards to our computing resources. Batch processing allows us to do fairly predictable budgeting and resource allocation, and there's also some more operational stability because with batch processing, we can have clearly defined maintenance windows. And also batch processing allows for more simplified monitoring and troubleshooting than we might have if we're working with, say, a stream processing model, which we'll talk about shortly. Now, when it comes to implementing batch processing, you want to start with well-defined smaller workloads before scaling them up. You want to be sure as you're defining your workloads that you design for failure recovery, and one of the ways to do this is to use what we call checkpointing or saving results intermittently, so you can roll back or start with one of those checkpoints rather than going all the way back to the beginning of the workload. We also want to implement comprehensive logging and monitoring and be sure that we establish automated quality checks on both the original data and intermittent steps in the process. And then finally, we want to balance batch size against processing time and resources.