From the course: Architecting Big Data Applications: Batch Mode Application Engineering

Unlock this course with a free trial

Join today to access over 25,300 courses taught by industry experts.

Reprocessing

Reprocessing

- [Instructor] Reprocessing of data in a big data pipeline is a critical function that needs to be optimally designed. Why do we need reprocessing of data? There can be infrastructure issues while running a bad job, like out of memory, network timeouts, or process crashes. There can be errors in earlier processing that necessiated a code change. This needs data to be reprocessed with the new code. Sometimes new late data might come in for a given batch that is already processed, the batch that needs to be reprocessed with the new data added. Developers may add some new logic or analytics and the requirement would be to reprocess old batches to implement them on old data. Whatever be the case, reprocessing is an element that is unavoidable and can prove to be a headache. It's recommended to rather architect the pipeline to allow reprocessing of old data. What are some of the best practices to enable reprocessing when…

Contents