From the course: AWS Certified Data Engineer Associate (DEA-C01) Cert Prep

Apache Spark

- [Instructor] Apache Spark was created to address the limitations of MapReduce by replacing frequent disk reads with in-memory processing to support faster processing and analytic queries against data of any size. In this lesson, we'll explain the different components of Spark and how it works. Spark is a versatile framework that supports batch and stream processing as well as interactive queries. Data can be text or Hadoop file formats. It can read data from multiple sources, including S3 and HDFS and other databases. It provides development APIs in Java, Scala, Python, and R. Spark Core is the foundation of the platform. It's responsible for memory management, fault recovery scheduling, distributing, and monitoring jobs, and interacting with storage systems. Spark SQL is a query engine that supports standard SQL or the Hive query language. It supports various data sources and formats, including JDBC, ODBC, JSON, HDFS, Hive, ORC, and Parquet. Spark Streaming ingest data in mini batches for streaming analytics. It allows developers to actually use the same code for batch processing as well as real-time streaming applications. GraphX provides ETL exploratory analysis and iterative graph computation to enable users to interactively build and transform a graph data structure. Spark also includes ML Live, which is a library of machine learning algorithms. Machine learning models can be trained by data scientists using R or Python on any Hadoop data source saved using ML Live and then import it into Java or Scala-based pipelines. When you deploy Spark on Amazon EMR, the Spark processing engine is deployed to each node in the cluster. You interact with the scale that interpreter or with a shell by connecting to the master node over SSH. In this scenario, the Spark framework replaces the MapReduce framework. Spark and Spark SQL may still interact with HDFS, if needed, but ideally the nodes will pull their input data directly from S3 and load it into memory. The Spark programming model uses something called resilient distributed data sets or RDDs. which are basically read-only distributed collections of objects that are cached in memory across the cluster nodes. They're manipulated through various operators and are automatically rebuilt after the failure of a node. RDDs can be reused across operations and provide a way to manipulate and persist intermediate data sets. Spark also uses a special type of RDD called a DataFrame, which actually stores the data in a tabular format with rows and columns. The use of RDDs dramatically lowers processing latency, making Spark many times faster than MapReduce, especially when doing machine learning and interactive analytics.

Contents