41,013 questions
1
vote
0
answers
49
views
Pytest spark fixture failing on startup
I have been trying hard to test my PySpark transformation on my local Windows machine.
Here is what I have done so far.
I installed the latest version of Spark, downloaded hadoop.dll and winutils, ...
2
votes
0
answers
59
views
'JavaPackage' object is not callable error when trying to getOrCreate() local spark session
I have set up a small Xubuntu machine with intention of making it my single-node playaround Spark cluster. The cluster seems to be set up correctly - I can access the WebUI at port 8080, it shows a ...
1
vote
1
answer
56
views
AWS Glue PySpark job taking 4 hours to process small JSON files from S3
I have an AWS Glue job that processes thousands of small JSON files from S3 (historical data load for Adobe Experience Platform). The job is taking approximately 4 hours to complete, which is ...
0
votes
1
answer
101
views
Optimize code to flatten meta ads metrics data in spark
I have two spark scripts, first as a bronze script need to data form kafka topics each topic have ads platform data ( tiktok_insights, meta_insights, google_insights ). Structure are same,
( id, ...
0
votes
0
answers
70
views
Spark (Databricks) fails to read SPSS .sav files extracted from ZIP
I’m reading various file types in Databricks using Spark — including PDF, DOCX, PPTX, XLSX, and CSV.
Some inputs are ZIP archives that contain multiple files, including SPSS .sav files.
My workflow is:...
1
vote
0
answers
120
views
Warning and performance issues when scanning delta tables
Why do I get multiple warnings WARN delta_kernel::engine::default::json] read_json receiver end of channel dropped before sending completed when scanning (pl.scan_delta(temp_path) a delta table that ...
-1
votes
0
answers
54
views
PySpark - Azure Databricks Writing from Dataframe to saveAsTable string values are different from dataframe to table?
My team is new to Pyspark and we are specifically using Azure Databricks. We have a piece of code where we are essentially
Displaying a dataframe
Saving it to a table
Displaying the output of the ...
2
votes
1
answer
90
views
Utilizing a loop or automated approach to join all possible elements in one dataframes together based on defined criteria
Summary: I want to be able to recreate my SQL code via Python such that I dont have to manually type out each join for situations when the combinations get too large to handle
I have one table
import ...
-1
votes
2
answers
65
views
How to Join two RDDs in pyspark with nested tuples
I need to join two RDDs as part of my programming assignment. The problem is that the first RDD is nested, while the other is flat. I tried different things, but nothing seemed to work. Is there any ...
0
votes
1
answer
76
views
Pyspark - Flatten nested structure
I have MongoDB collections forms and submissions where forms define dynamic UI components (textfield, checkbox, radio, selectboxes, columns, tables, datagrids) and submissions contain the user data in ...
2
votes
0
answers
64
views
How log model in mlflow using Spark Connect
I have the following setup:
Kubernetes cluster with Spark Connect 4.0.1 and
MLflow tracking server 3.5.0
MLFlow tracking server should serve all artifacts and is configured this way:
--backend-store-...
0
votes
1
answer
74
views
Handle corrupted files in spark load()
I have a spark job that runs daily to load data from S3.
These data are composed of thousands of gzip files. However, in some cases, there is one or two corrupted files in S3, and it causes the whole ...
0
votes
0
answers
35
views
Why do I get List index out of range error when writing a sharepoint list to azure delta lake using pyspark on Azure Databricks?
Writing a SharePoint list to delta file format and I get this error- list index out of range. I have included all the required columns to be fetched from sharepoint and check the datatype when writing ...
-1
votes
2
answers
63
views
Connectivity issues in standalone Spark 4.0
In Azure VM, I have installed standalone Spark 4.0. On the same VM I have Python 3.11 with Jupyter deployed. In my notebook I submitted the following program:
from pyspark.sql import SparkSession
...
1
vote
1
answer
144
views
PicklingError: Could not serialize object: RecursionError in pyspark code in Jupyter Notebook
I am very new in Spark (specifically, have just started with learning), and I have encountered a recursion error in a very simple code.
Background:
Spark Version 3.5.7
Java Version 11.0.29 (Eclipse ...