From the course: AWS Certified Data Engineer Associate (DEA-C01) Cert Prep
Apache Spark
From the course: AWS Certified Data Engineer Associate (DEA-C01) Cert Prep
Apache Spark
- [Instructor] Apache Spark was created to address the limitations of MapReduce by replacing frequent disk reads with in-memory processing to support faster processing and analytic queries against data of any size. In this lesson, we'll explain the different components of Spark and how it works. Spark is a versatile framework that supports batch and stream processing as well as interactive queries. Data can be text or Hadoop file formats. It can read data from multiple sources, including S3 and HDFS and other databases. It provides development APIs in Java, Scala, Python, and R. Spark Core is the foundation of the platform. It's responsible for memory management, fault recovery scheduling, distributing, and monitoring jobs, and interacting with storage systems. Spark SQL is a query engine that supports standard SQL or the Hive query language. It supports various data sources and formats, including JDBC, ODBC, JSON, HDFS, Hive, ORC, and Parquet. Spark Streaming ingest data in mini batches for streaming analytics. It allows developers to actually use the same code for batch processing as well as real-time streaming applications. GraphX provides ETL exploratory analysis and iterative graph computation to enable users to interactively build and transform a graph data structure. Spark also includes ML Live, which is a library of machine learning algorithms. Machine learning models can be trained by data scientists using R or Python on any Hadoop data source saved using ML Live and then import it into Java or Scala-based pipelines. When you deploy Spark on Amazon EMR, the Spark processing engine is deployed to each node in the cluster. You interact with the scale that interpreter or with a shell by connecting to the master node over SSH. In this scenario, the Spark framework replaces the MapReduce framework. Spark and Spark SQL may still interact with HDFS, if needed, but ideally the nodes will pull their input data directly from S3 and load it into memory. The Spark programming model uses something called resilient distributed data sets or RDDs. which are basically read-only distributed collections of objects that are cached in memory across the cluster nodes. They're manipulated through various operators and are automatically rebuilt after the failure of a node. RDDs can be reused across operations and provide a way to manipulate and persist intermediate data sets. Spark also uses a special type of RDD called a DataFrame, which actually stores the data in a tabular format with rows and columns. The use of RDDs dramatically lowers processing latency, making Spark many times faster than MapReduce, especially when doing machine learning and interactive analytics.
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.
Contents
-
-
-
-
-
-
-
(Locked)
Introduction53s
-
(Locked)
Common scenarios4m 58s
-
(Locked)
Amazon EMR introduction3m 52s
-
(Locked)
Apache Hadoop1m 48s
-
(Locked)
Hadoop frameworks2m 18s
-
Apache Spark3m 12s
-
(Locked)
Amazon EMR architecture7m 48s
-
(Locked)
Hands-on learning: Launch an Amazon EMR cluster13m 7s
-
(Locked)
Amazon EMR serverless3m 16s
-
(Locked)
Amazon EMR on EKS1m 44s
-
(Locked)
Amazon EMR Studio1m 27s
-
(Locked)
Hands-on learning: Use Amazon EMR Studio5m 8s
-
(Locked)
Container services for batch processing8m 7s
-
(Locked)
AWS Batch2m 46s
-
(Locked)
Amazon Managed Service for Apache Flink2m 22s
-
(Locked)
AWS Glue DataBrew3m 1s
-
(Locked)
Hands-on learning: Create an AWS Glue DataBrew project5m 37s
-
(Locked)
-
-
-
-
-