1,674 questions
1
vote
0
answers
30
views
How to authenticate with Service Account in dataproc cluster for duckdb connection to BigQuery
I'm trying to authenticate with an already pre-signed-in service account (SA) in a Dataproc cluster.
I'm configuring a DuckDB connection with the BigQuery extension and I can't seem to reuse the ...
1
vote
1
answer
51
views
DataprocSparkSession package in python error - "RuntimeError: Error while creating Dataproc Session"
I am using below code to create Dataproc Spark Session to run a job
from google.cloud.dataproc_spark_connect import DataprocSparkSession
from google.cloud.dataproc_v1 import Session
session = Session(...
0
votes
0
answers
56
views
What connector can be used for Google Cloud pubsub along with Cloud dataproc ( Spark 3.5.x )
I am using Spark 3.5.x and would like to use readStream() API to read structured streaming using Java .
I don't see any pubsub connector available. Couldn't try pubsub lite because it is deprecated ...
0
votes
1
answer
65
views
ModuleNotFoundError in GCP after trying to sumbit a job
new to GCP, I am trying to submit a job inside Dataproc with a .py file & attached also pythonproject.zip file (it is a project) but I am getting the below error ModuleNotFoundError: No module ...
3
votes
1
answer
124
views
Facing bigquery write failure after upgrading spark and dataproc. "Schema mismatch : corresponding field path to Parquet column has 0 repeated field"s
We are currently undergoing migration from spark 2.4 to spark 3.5 (and dataproc 1 to 2), and our workflows are failing with the following error
Caused by: com.google.cloud.spark.bigquery.repackaged....
1
vote
0
answers
48
views
Google Cloud Dataproc Cluster Creation Fails: "Failed to validate permissions for default service account"
Despite the Default Compute Engine Service Account having the necessary roles and being explicitly specified in my cluster creation command, I am still encountering the "Failed to validate ...
2
votes
1
answer
162
views
Spark memory error in thread spark-listener-group-eventLog
I have a pyspark application which is using Graphframes to compute connected components on a DataFrame.
The edges DataFrame I generate has 2.7M records.
When I run the code it is slow, but slowly ...
1
vote
0
answers
74
views
Out of memory for a smaller dataset
I have a pyspark job reading the input data volume of just ~50-55GB Parquet data from a delta table. Job is using n2-highmem-4 GCP VM and 1-15 worker with autoscaling. Each workerVM of type n2-highmem-...
1
vote
2
answers
83
views
How do you run Python Hadoop Jobs on Dataproc?
I am trying to run my Python code for Hadoop job on Dataproc. I have a mapper.py and a reducer.py file. I am running this command on the terminal -
gcloud dataproc jobs submit hadoop \
--cluster=my-...
2
votes
0
answers
54
views
Why does Spark raises an IOException while running a aggregation on a streaming dataframe in Dataproc 2.2
I try to migrate a job that is running on Dataproc 2.1 images (Spark 3.3, Python 3.10) to Dataproc 2.2 images (Spark 3.5, Python 3.11).
However I encounter an error on one of my queries. After further ...
1
vote
0
answers
41
views
Paritial records being read in Pyspark through Dataproc
I have a Google Dataproc job that reads a CSV file from Google Cloud Storage, containing the following headers
Content-type : application/octet-stream
Content-encoding : gzip
FileName: gs://...
2
votes
1
answer
127
views
Default behavior of spark3.5.1 when writing Numeric/Bignumeric to BigQuery
As per documentation of spark 3.5.1 using latest spark-bigquery-connector, Spark Decimal(38,0) should be written as Numeric in BigQuery.
https://github.com/GoogleCloudDataproc/spark-bigquery-connector?...
1
vote
2
answers
147
views
How to pass arguments from GCP Workflows into Dataproc
I'm using GCP Workflows to define steps for a data engineering project. The input of the workflow consists out of multiple parameters which are provided from through the workflow API.
I defined a GCP ...
1
vote
0
answers
108
views
error "google.api_core.exceptions.InvalidArgument: 400 Cluster name is required" while trying to use airflow DataprocSubmitJobOperator
I'm trying to use the Dataproc submit job operator from Airflow (https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_api/airflow/providers/google/cloud/operators/dataproc/index....
3
votes
2
answers
407
views
GoogleHadoopOutputStream: hflush(): No-op due to rate limit: Increase in class A operation for gcs bucket
We are running our spark ingestion jobs which process multiple files in batches.
We read csv or tsv files in batches and create a dataframe and do some transformations before loading it into big query ...