Newest 'pyspark' Questions

1 vote

0 answers

49 views

Pytest spark fixture failing on startup

I have been trying hard to test my PySpark transformation on my local Windows machine. Here is what I have done so far. I installed the latest version of Spark, downloaded hadoop.dll and winutils, ...

Rishabh

88

asked 2 days ago

2 votes

0 answers

59 views

'JavaPackage' object is not callable error when trying to getOrCreate() local spark session

I have set up a small Xubuntu machine with intention of making it my single-node playaround Spark cluster. The cluster seems to be set up correctly - I can access the WebUI at port 8080, it shows a ...

Paweł Sopel

558

asked Dec 25, 2025 at 10:22

1 vote

1 answer

56 views

AWS Glue PySpark job taking 4 hours to process small JSON files from S3

I have an AWS Glue job that processes thousands of small JSON files from S3 (historical data load for Adobe Experience Platform). The job is taking approximately 4 hours to complete, which is ...

Jayron Soares

461

asked Dec 20, 2025 at 12:08

0 votes

1 answer

101 views

Optimize code to flatten meta ads metrics data in spark

I have two spark scripts, first as a bronze script need to data form kafka topics each topic have ads platform data ( tiktok_insights, meta_insights, google_insights ). Structure are same, ( id, ...

Kuldeep KV

21

asked Dec 17, 2025 at 6:41

0 votes

0 answers

70 views

Spark (Databricks) fails to read SPSS .sav files extracted from ZIP

I’m reading various file types in Databricks using Spark — including PDF, DOCX, PPTX, XLSX, and CSV. Some inputs are ZIP archives that contain multiple files, including SPSS .sav files. My workflow is:...

BarzanHayati

996

asked Dec 10, 2025 at 17:36

1 vote

0 answers

120 views

Warning and performance issues when scanning delta tables

Why do I get multiple warnings WARN delta_kernel::engine::default::json] read_json receiver end of channel dropped before sending completed when scanning (pl.scan_delta(temp_path) a delta table that ...

gaut

6,048

asked Dec 6, 2025 at 1:45

-1 votes

0 answers

54 views

PySpark - Azure Databricks Writing from Dataframe to saveAsTable string values are different from dataframe to table?

My team is new to Pyspark and we are specifically using Azure Databricks. We have a piece of code where we are essentially Displaying a dataframe Saving it to a table Displaying the output of the ...

chilly8063

309

asked Dec 3, 2025 at 20:52

2 votes

1 answer

90 views

Utilizing a loop or automated approach to join all possible elements in one dataframes together based on defined criteria

Summary: I want to be able to recreate my SQL code via Python such that I dont have to manually type out each join for situations when the combinations get too large to handle I have one table import ...

qwerty12

65

asked Dec 3, 2025 at 20:00

-1 votes

2 answers

65 views

How to Join two RDDs in pyspark with nested tuples

I need to join two RDDs as part of my programming assignment. The problem is that the first RDD is nested, while the other is flat. I tried different things, but nothing seemed to work. Is there any ...

Ahmed Sohail Aslam PhDCS 2025

1

asked Dec 1, 2025 at 3:46

0 votes

1 answer

76 views

Pyspark - Flatten nested structure

I have MongoDB collections forms and submissions where forms define dynamic UI components (textfield, checkbox, radio, selectboxes, columns, tables, datagrids) and submissions contain the user data in ...

Aniruth N

106

asked Nov 29, 2025 at 12:12

2 votes

0 answers

64 views

How log model in mlflow using Spark Connect

I have the following setup: Kubernetes cluster with Spark Connect 4.0.1 and MLflow tracking server 3.5.0 MLFlow tracking server should serve all artifacts and is configured this way: --backend-store-...

hage

6,213

asked Nov 26, 2025 at 13:39

0 votes

1 answer

74 views

Handle corrupted files in spark load()

I have a spark job that runs daily to load data from S3. These data are composed of thousands of gzip files. However, in some cases, there is one or two corrupted files in S3, and it causes the whole ...

Nakeuh

1,933

asked Nov 26, 2025 at 7:17

0 votes

0 answers

35 views

Why do I get List index out of range error when writing a sharepoint list to azure delta lake using pyspark on Azure Databricks?

Writing a SharePoint list to delta file format and I get this error- list index out of range. I have included all the required columns to be fetched from sharepoint and check the datatype when writing ...

Sruthi Gopalakrishnan

99

asked Nov 25, 2025 at 9:38

-1 votes

2 answers

63 views

Connectivity issues in standalone Spark 4.0

In Azure VM, I have installed standalone Spark 4.0. On the same VM I have Python 3.11 with Jupyter deployed. In my notebook I submitted the following program: from pyspark.sql import SparkSession ...

Ziggy

43

asked Nov 24, 2025 at 16:16

1 vote

1 answer

144 views

PicklingError: Could not serialize object: RecursionError in pyspark code in Jupyter Notebook

I am very new in Spark (specifically, have just started with learning), and I have encountered a recursion error in a very simple code. Background: Spark Version 3.5.7 Java Version 11.0.29 (Eclipse ...

GINzzZ100

11

asked Nov 24, 2025 at 1:47

Collectives™ on Stack Overflow

Pytest spark fixture failing on startup

'JavaPackage' object is not callable error when trying to getOrCreate() local spark session

AWS Glue PySpark job taking 4 hours to process small JSON files from S3

Optimize code to flatten meta ads metrics data in spark

Spark (Databricks) fails to read SPSS .sav files extracted from ZIP

Warning and performance issues when scanning delta tables

PySpark - Azure Databricks Writing from Dataframe to saveAsTable string values are different from dataframe to table?

Utilizing a loop or automated approach to join all possible elements in one dataframes together based on defined criteria

How to Join two RDDs in pyspark with nested tuples

Pyspark - Flatten nested structure

How log model in mlflow using Spark Connect

Handle corrupted files in spark load()

Why do I get List index out of range error when writing a sharepoint list to azure delta lake using pyspark on Azure Databricks?

Connectivity issues in standalone Spark 4.0

PicklingError: Could not serialize object: RecursionError in pyspark code in Jupyter Notebook

Hot Network Questions