Newest 'google-cloud-dataproc' Questions

1 vote

0 answers

30 views

How to authenticate with Service Account in dataproc cluster for duckdb connection to BigQuery

I'm trying to authenticate with an already pre-signed-in service account (SA) in a Dataproc cluster. I'm configuring a DuckDB connection with the BigQuery extension and I can't seem to reuse the ...

Aleksander Lipka

538

asked Dec 22, 2025 at 14:37

1 vote

1 answer

51 views

DataprocSparkSession package in python error - "RuntimeError: Error while creating Dataproc Session"

I am using below code to create Dataproc Spark Session to run a job from google.cloud.dataproc_spark_connect import DataprocSparkSession from google.cloud.dataproc_v1 import Session session = Session(...

Siddiq Syed

11

asked Oct 2, 2025 at 8:07

0 votes

0 answers

56 views

What connector can be used for Google Cloud pubsub along with Cloud dataproc ( Spark 3.5.x )

I am using Spark 3.5.x and would like to use readStream() API to read structured streaming using Java . I don't see any pubsub connector available. Couldn't try pubsub lite because it is deprecated ...

Sunil

441

asked Sep 24, 2025 at 13:07

0 votes

1 answer

65 views

ModuleNotFoundError in GCP after trying to sumbit a job

new to GCP, I am trying to submit a job inside Dataproc with a .py file & attached also pythonproject.zip file (it is a project) but I am getting the below error ModuleNotFoundError: No module ...

SofiaNiki

1

asked Sep 3, 2025 at 11:16

3 votes

1 answer

124 views

Facing bigquery write failure after upgrading spark and dataproc. "Schema mismatch : corresponding field path to Parquet column has 0 repeated field"s

We are currently undergoing migration from spark 2.4 to spark 3.5 (and dataproc 1 to 2), and our workflows are failing with the following error Caused by: com.google.cloud.spark.bigquery.repackaged....

Anshul Dubey

305

asked Jul 7, 2025 at 18:05

1 vote

0 answers

48 views

Google Cloud Dataproc Cluster Creation Fails: "Failed to validate permissions for default service account"

Despite the Default Compute Engine Service Account having the necessary roles and being explicitly specified in my cluster creation command, I am still encountering the "Failed to validate ...

Lê Văn Đức

11

asked Jun 7, 2025 at 11:00

2 votes

1 answer

162 views

Spark memory error in thread spark-listener-group-eventLog

I have a pyspark application which is using Graphframes to compute connected components on a DataFrame. The edges DataFrame I generate has 2.7M records. When I run the code it is slow, but slowly ...

Jesus Diaz Rivero

332

asked May 7, 2025 at 12:16

1 vote

0 answers

74 views

Out of memory for a smaller dataset

I have a pyspark job reading the input data volume of just ~50-55GB Parquet data from a delta table. Job is using n2-highmem-4 GCP VM and 1-15 worker with autoscaling. Each workerVM of type n2-highmem-...

user16798185

377

asked May 5, 2025 at 16:07

1 vote

2 answers

83 views

How do you run Python Hadoop Jobs on Dataproc?

I am trying to run my Python code for Hadoop job on Dataproc. I have a mapper.py and a reducer.py file. I am running this command on the terminal - gcloud dataproc jobs submit hadoop \ --cluster=my-...

The Beast

163

asked Apr 29, 2025 at 11:03

2 votes

0 answers

54 views

Why does Spark raises an IOException while running a aggregation on a streaming dataframe in Dataproc 2.2

I try to migrate a job that is running on Dataproc 2.1 images (Spark 3.3, Python 3.10) to Dataproc 2.2 images (Spark 3.5, Python 3.11). However I encounter an error on one of my queries. After further ...

AlexisBRENON

3,189

asked Mar 17, 2025 at 11:48

1 vote

0 answers

41 views

Paritial records being read in Pyspark through Dataproc

I have a Google Dataproc job that reads a CSV file from Google Cloud Storage, containing the following headers Content-type : application/octet-stream Content-encoding : gzip FileName: gs://...

Bob

383

asked Mar 15, 2025 at 22:26

2 votes

1 answer

127 views

Default behavior of spark3.5.1 when writing Numeric/Bignumeric to BigQuery

As per documentation of spark 3.5.1 using latest spark-bigquery-connector, Spark Decimal(38,0) should be written as Numeric in BigQuery. https://github.com/GoogleCloudDataproc/spark-bigquery-connector?...

Abhilash

31

asked Mar 10, 2025 at 18:15

1 vote

2 answers

147 views

How to pass arguments from GCP Workflows into Dataproc

I'm using GCP Workflows to define steps for a data engineering project. The input of the workflow consists out of multiple parameters which are provided from through the workflow API. I defined a GCP ...

54m

777

asked Mar 5, 2025 at 16:36

1 vote

0 answers

108 views

error "google.api_core.exceptions.InvalidArgument: 400 Cluster name is required" while trying to use airflow DataprocSubmitJobOperator

I'm trying to use the Dataproc submit job operator from Airflow (https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_api/airflow/providers/google/cloud/operators/dataproc/index....

Abhijit Aravind

11

asked Mar 4, 2025 at 22:11

3 votes

2 answers

407 views

GoogleHadoopOutputStream: hflush(): No-op due to rate limit: Increase in class A operation for gcs bucket

We are running our spark ingestion jobs which process multiple files in batches. We read csv or tsv files in batches and create a dataframe and do some transformations before loading it into big query ...

Vikrant Singh Rana

4,749

asked Feb 28, 2025 at 3:34

Collectives™ on Stack Overflow

How to authenticate with Service Account in dataproc cluster for duckdb connection to BigQuery

DataprocSparkSession package in python error - "RuntimeError: Error while creating Dataproc Session"

What connector can be used for Google Cloud pubsub along with Cloud dataproc ( Spark 3.5.x )

ModuleNotFoundError in GCP after trying to sumbit a job

Facing bigquery write failure after upgrading spark and dataproc. "Schema mismatch : corresponding field path to Parquet column has 0 repeated field"s

Google Cloud Dataproc Cluster Creation Fails: "Failed to validate permissions for default service account"

Spark memory error in thread spark-listener-group-eventLog

Out of memory for a smaller dataset

How do you run Python Hadoop Jobs on Dataproc?

Why does Spark raises an IOException while running a aggregation on a streaming dataframe in Dataproc 2.2

Paritial records being read in Pyspark through Dataproc

Default behavior of spark3.5.1 when writing Numeric/Bignumeric to BigQuery

How to pass arguments from GCP Workflows into Dataproc

error "google.api_core.exceptions.InvalidArgument: 400 Cluster name is required" while trying to use airflow DataprocSubmitJobOperator

GoogleHadoopOutputStream: hflush(): No-op due to rate limit: Increase in class A operation for gcs bucket

Hot Network Questions