Dataproc Serverless lets you run Spark workloads without requiring you to provision and manage your own Dataproc cluster. There are two ways to run Dataproc Serverless workloads:
Dataproc Serverless batch workloads
Submit a batch workload to the Dataproc Serverless service using the Google Cloud console, Google Cloud CLI, or Dataproc API. The service runs the workload on a managed compute infrastructure, autoscaling resources as needed. Dataproc Serverless charges apply only to the time when the workload is executing.
To get started, see Run an Apache Spark batch workload.
Dataproc Serverless interactive sessions
Write and run code in Jupyter notebooks during a Dataproc Serverless for Spark interactive session. You can create a notebook session in the following ways:
Run PySpark code in BigQuery Studio notebooks. Use the BigQuery Python notebook to create a Spark-Connect-based Dataproc Serverless interactive session. Each BigQuery notebook can have only one active Dataproc Serverless session associated with it.
Use the Dataproc JupyterLab plugin to create multiple Jupyter notebook sessions from templates that you create and manage. When you install the plugin on a local machine or Compute Engine VM, different cards that correspond to different Spark kernel configurations appear on the JupyterLab launcher page. Click a card to create a Dataproc Serverless notebook session, then start writing and testing your code in the notebook.
The Dataproc JupyterLab plugin also lets you use the JupyterLab launcher page to take the following actions:
- Create Dataproc on Compute Engine clusters.
- Submit jobs to Dataproc on Compute Engine clusters.
- View Google Cloud and Spark logs.
Dataproc Serverless compared to Dataproc on Compute Engine
If you want to provision and manage infrastructure, and then execute workloads on Spark and other open source processing frameworks, use Dataproc on Compute Engine. The following table lists key differences between the Dataproc on Compute Engine and Dataproc Serverless.
Capability | Dataproc Serverless | Dataproc on Compute Engine |
---|---|---|
Processing frameworks | Batch workloads: Spark 3.5 and earlier versions Interactive sessions: Spark 3.5 and earlier versions |
Spark 3.5 and earlier versions. Other open source frameworks, such as Hive, Flink, Trino, and Kafka |
Serverless | Yes | No |
Startup time | 60s | 90s |
Infrastructure control | No | Yes |
Resource management | Spark based | YARN based |
GPU support | Yes | Yes |
Interactive sessions | Yes | No |
Custom containers | Yes | No |
VM access (for example, SSH) | No | Yes |
Java versions | Java 17, 11 | Previous versions supported |
OS Login
support * |
No | Yes |
Notes:
- An OS Login policy is not applicable to or supported by Dataproc Serverless.
If your organization enforces an
OS Login
policy, its Dataproc Serverless workloads will fail.
Dataproc Serverless security compliance
Dataproc Serverless adheres to all data residency, CMEK, VPC-SC, and other security requirements that Dataproc is compliant with.
Dataproc Serverless batch workload capabilities
You can run the following Dataproc Serverless batch workload types:
- PySpark
- Spark SQL
- Spark R
- Spark (Java or Scala)
You can specify Spark properties when you submit a Dataproc Serverless batch workload.