Skip to main content
1 vote
0 answers
30 views

I'm trying to authenticate with an already pre-signed-in service account (SA) in a Dataproc cluster. I'm configuring a DuckDB connection with the BigQuery extension and I can't seem to reuse the ...
Aleksander Lipka's user avatar
1 vote
1 answer
51 views

I am using below code to create Dataproc Spark Session to run a job from google.cloud.dataproc_spark_connect import DataprocSparkSession from google.cloud.dataproc_v1 import Session session = Session(...
Siddiq Syed's user avatar
0 votes
0 answers
56 views

I am using Spark 3.5.x and would like to use readStream() API to read structured streaming using Java . I don't see any pubsub connector available. Couldn't try pubsub lite because it is deprecated ...
Sunil's user avatar
  • 441
0 votes
1 answer
65 views

new to GCP, I am trying to submit a job inside Dataproc with a .py file & attached also pythonproject.zip file (it is a project) but I am getting the below error ModuleNotFoundError: No module ...
SofiaNiki's user avatar
3 votes
1 answer
124 views

We are currently undergoing migration from spark 2.4 to spark 3.5 (and dataproc 1 to 2), and our workflows are failing with the following error Caused by: com.google.cloud.spark.bigquery.repackaged....
Anshul Dubey's user avatar
1 vote
0 answers
48 views

Despite the Default Compute Engine Service Account having the necessary roles and being explicitly specified in my cluster creation command, I am still encountering the "Failed to validate ...
Lê Văn Đức's user avatar
2 votes
1 answer
162 views

I have a pyspark application which is using Graphframes to compute connected components on a DataFrame. The edges DataFrame I generate has 2.7M records. When I run the code it is slow, but slowly ...
Jesus Diaz Rivero's user avatar
1 vote
0 answers
74 views

I have a pyspark job reading the input data volume of just ~50-55GB Parquet data from a delta table. Job is using n2-highmem-4 GCP VM and 1-15 worker with autoscaling. Each workerVM of type n2-highmem-...
user16798185's user avatar
1 vote
2 answers
83 views

I am trying to run my Python code for Hadoop job on Dataproc. I have a mapper.py and a reducer.py file. I am running this command on the terminal - gcloud dataproc jobs submit hadoop \ --cluster=my-...
The Beast's user avatar
  • 163
2 votes
0 answers
54 views

I try to migrate a job that is running on Dataproc 2.1 images (Spark 3.3, Python 3.10) to Dataproc 2.2 images (Spark 3.5, Python 3.11). However I encounter an error on one of my queries. After further ...
AlexisBRENON's user avatar
  • 3,189
1 vote
0 answers
41 views

I have a Google Dataproc job that reads a CSV file from Google Cloud Storage, containing the following headers Content-type : application/octet-stream Content-encoding : gzip FileName: gs://...
Bob's user avatar
  • 383
2 votes
1 answer
127 views

As per documentation of spark 3.5.1 using latest spark-bigquery-connector, Spark Decimal(38,0) should be written as Numeric in BigQuery. https://github.com/GoogleCloudDataproc/spark-bigquery-connector?...
Abhilash's user avatar
1 vote
2 answers
147 views

I'm using GCP Workflows to define steps for a data engineering project. The input of the workflow consists out of multiple parameters which are provided from through the workflow API. I defined a GCP ...
54m's user avatar
  • 777
1 vote
0 answers
108 views

I'm trying to use the Dataproc submit job operator from Airflow (https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_api/airflow/providers/google/cloud/operators/dataproc/index....
Abhijit Aravind's user avatar
3 votes
2 answers
407 views

We are running our spark ingestion jobs which process multiple files in batches. We read csv or tsv files in batches and create a dataframe and do some transformations before loading it into big query ...
Vikrant Singh Rana's user avatar

15 30 50 per page
1
2 3 4 5
112