Skip to content

[GLUTEN-9163][VL] Separate compression buffer and disk write buffer configuration#9356

Merged
marin-ma merged 3 commits into
apache:mainfrom
marin-ma:shuffle-compression-config
Apr 23, 2025
Merged

[GLUTEN-9163][VL] Separate compression buffer and disk write buffer configuration#9356
marin-ma merged 3 commits into
apache:mainfrom
marin-ma:shuffle-compression-config

Conversation

@marin-ma

@marin-ma marin-ma commented Apr 17, 2025

Copy link
Copy Markdown
Contributor

A follow-up to #9278

spark.shuffle.spill.diskWriteBufferSize is used for setting the buffer size before spill to store the sorted rows. Spiller will write the data in this buffer to the output stream.

spark.io.compression.lz4.blockSize,spark.io.compression.zstd.bufferSize are used to set the compression buffer size in the compressed output stream, depending on which compression codec is set.

The memory allocation of two buffers in spark are counted into overhead memory, so we use arrow::default_memory_pool to allocate them.

Add spark.gluten.sql.columnar.shuffle.sort.deserializerBufferSize: Buffer size in bytes for sort-based shuffle reader deserializing raw input to columnar batch.

@github-actions github-actions Bot added CORE works for Gluten Core VELOX RSS CLICKHOUSE labels Apr 17, 2025
@github-actions

Copy link
Copy Markdown
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@marin-ma marin-ma force-pushed the shuffle-compression-config branch from 22ad109 to b514a4d Compare April 17, 2025 16:41
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@marin-ma

Copy link
Copy Markdown
Contributor Author

@zhouyuan Could you help to review? Thanks!

@github-actions github-actions Bot added the DOCS label Apr 22, 2025
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@marin-ma marin-ma requested a review from zhouyuan April 23, 2025 17:52
@marin-ma

Copy link
Copy Markdown
Contributor Author

@zhouyuan Could you help to review? Thanks!

GlutenShuffleUtils.getSortEvictBufferSize(sparkConf, compressionCodec);
GlutenShuffleUtils.getCompressionBufferSize(sparkConf, compressionCodec);
diskWriteBufferSize =
(int) (long) sparkConf.get(package$.MODULE$.SHUFFLE_DISK_WRITE_BUFFER_SIZE());

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code is little diffcult to understand, is it necessary to cast to long and then cast to int

@marin-ma marin-ma Apr 23, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Spark's source code, the configurations are converted in this way. Here's an explanation apache/spark#24187 (comment)

If we don't convert to long first , it will encounter exception like this:
Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@zhouyuan zhouyuan left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@marin-ma marin-ma merged commit d077f93 into apache:main Apr 23, 2025
marin-ma added a commit to marin-ma/gluten that referenced this pull request Jul 16, 2025
warrenzhu25 pushed a commit to warrenzhu25/gluten that referenced this pull request Jan 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2 participants