Skip to content

Conversation

@auden-woolfson
Copy link
Contributor

@auden-woolfson auden-woolfson commented Oct 27, 2025

…hive

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.
  • If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
@auden-woolfson auden-woolfson requested a review from a team as a code owner October 27, 2025 21:32
@prestodb-ci prestodb-ci added the from:IBM PR from IBM label Oct 27, 2025
@prestodb-ci prestodb-ci requested review from a team, bibith4 and wanglinsong and removed request for a team October 27, 2025 21:32
@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Oct 27, 2025

Reviewer's Guide

This PR implements skip_header_line_count and skip_footer_line_count support in the Hive connector by adding metadata definitions and validation, extending the write pipeline to emit header lines for CSV/text formats, and enriching integration tests to cover expected behaviors and error cases.

Sequence diagram for CSV/text file write with header line support

sequenceDiagram
    participant "RecordFileWriter"
    participant "HiveWriteUtils"
    participant "TextCSVHeaderWriter"
    participant "OutputStream"
    "RecordFileWriter"->>"HiveWriteUtils": createRecordWriter(..., textHeaderWriter)
    "HiveWriteUtils"->>"TextCSVHeaderWriter": write(compressedOutput, rowSeparatorByte)
    "TextCSVHeaderWriter"->>"OutputStream": write header line
    "HiveWriteUtils"->>"OutputStream": write data rows
Loading

Entity relationship diagram for new table properties

erDiagram
    HIVE_TABLE_PROPERTIES {
        INTEGER skip_header_line_count
        INTEGER skip_footer_line_count
    }
    HIVE_TABLE_PROPERTIES ||--o| HIVE_METADATA : "used in"
    HIVE_METADATA {
        STRING skip_header_line_count
        STRING skip_footer_line_count
    }
Loading

Class diagram for new and updated Hive connector classes

classDiagram
    class HiveTableProperties {
        +CSV_SEPARATOR : String
        +CSV_QUOTE : String
        +CSV_ESCAPE : String
        +SKIP_HEADER_LINE_COUNT : String
        +SKIP_FOOTER_LINE_COUNT : String
        +getHeaderSkipCount(Map<String, Object>) : Optional<Integer>
        +getFooterSkipCount(Map<String, Object>) : Optional<Integer>
    }
    class HiveMetadata {
        +SKIP_HEADER_COUNT_KEY : String
        +SKIP_FOOTER_COUNT_KEY : String
        +getTableMetadata(...)
        +getEmptyTableProperties(...)
        +beginCreateTable(...)
        +beginInsertInternal(...)
        +checkFormatForProperty(...)
    }
    class HiveWriteUtils {
        +createRecordWriter(...)
        +createTextCsvFileWriter(...)
    }
    class TextCSVHeaderWriter {
        +serializer : Serializer
        +headerType : Type
        +fileColumnNames : List<String>
        +write(OutputStream, int)
    }
    class RecordFileWriter {
        +recordWriter : RecordWriter
    }
    HiveTableProperties --> HiveMetadata : used by
    HiveWriteUtils --> TextCSVHeaderWriter : uses
    RecordFileWriter --> HiveWriteUtils : uses
    RecordFileWriter --> TextCSVHeaderWriter : uses
Loading

File-Level Changes

Change Details Files
Definition and validation of header/footer skip properties in Hive metadata
  • Defined SKIP_HEADER_LINE_COUNT and SKIP_FOOTER_LINE_COUNT constants and table property metadata
  • Added getters (getHeaderSkipCount/getFooterSkipCount) and integrated into HiveTableProperties
  • Validated non-negative values, format compatibility, and enforced restrictions in CTAS/INSERT paths
HiveMetadata.java
HiveTableProperties.java
Extended record writing to handle header skipping for CSV/text formats
  • Added createRecordWriter overload that accepts an optional TextCSVHeaderWriter
  • Implemented createTextCsvFileWriter to emit header line when skip_header_line_count=1
  • Introduced TextCSVHeaderWriter and wired it into RecordFileWriter for text output formats
HiveWriteUtils.java
TextCSVHeaderWriter.java
RecordFileWriter.java
Enhanced Hive integration tests for header/footer properties
  • Added tests for invalid negative skip_header_line_count and skip_footer_line_count values
  • Covered CREATE TABLE, SHOW CREATE TABLE, CTAS, and INSERT behaviors with header/footer properties
  • Verified INSERT restrictions and corresponding error messages for CSV format
TestHiveIntegrationSmokeTest.java

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • There are now two sets of constants for header/footer properties (Serde keys vs table metadata keys) that differ only by a dot vs underscore; consider consolidating these into a single source of truth to avoid naming mismatches.
  • The delimiter parsing and header‐writing logic in createTextCsvFileWriter is fairly dense—extracting that into a small, well-tested utility method would improve readability and make edge‐case handling (e.g. multi‐char delimiters) easier to verify.
  • The new tests for skip_header_line_count and skip_footer_line_count repeat very similar patterns for different formats; parameterizing or data-driven testing could reduce boilerplate and help ensure consistent coverage.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- There are now two sets of constants for header/footer properties (Serde keys vs table metadata keys) that differ only by a dot vs underscore; consider consolidating these into a single source of truth to avoid naming mismatches.
- The delimiter parsing and header‐writing logic in createTextCsvFileWriter is fairly dense—extracting that into a small, well-tested utility method would improve readability and make edge‐case handling (e.g. multi‐char delimiters) easier to verify.
- The new tests for skip_header_line_count and skip_footer_line_count repeat very similar patterns for different formats; parameterizing or data-driven testing could reduce boilerplate and help ensure consistent coverage.

## Individual Comments

### Comment 1
<location> `presto-hive/src/main/java/com/facebook/presto/hive/HiveWriteUtils.java:243-177` </location>
<code_context>
+    {
+        String rowSeparatorString = properties.getProperty(serdeConstants.LINE_DELIM, "\n");
+
+        int rowSeparatorByte;
+        try {
+            rowSeparatorByte = Byte.parseByte(rowSeparatorString);
+        }
</code_context>

<issue_to_address>
**issue:** Row separator parsing may not handle multi-byte delimiters correctly.

The logic only uses the first character of the delimiter, so multi-byte values like "\r\n" will not be handled as intended. Please validate or restrict input to single-byte delimiters, or document this limitation.
</issue_to_address>

### Comment 2
<location> `presto-hive/src/main/java/com/facebook/presto/hive/HiveWriteUtils.java:255-257` </location>
<code_context>
+        OutputStream compressedOutput = createCompressedStream(conf, output, compress);
+        TextRecordWriter writer = new TextRecordWriter();
+        writer.initialize(compressedOutput, conf);
+        Optional<String> skipHeaderLine = Optional.ofNullable(properties.getProperty("skip.header.line.count"));
+        if (skipHeaderLine.isPresent()) {
+            if (parseInt(skipHeaderLine.get()) == 1) {
+                textCSVHeaderWriter
+                                .orElseThrow(() -> new IllegalArgumentException("TextHeaderWriter must not be empty when skip.header.line.count is set to 1"))
</code_context>

<issue_to_address>
**issue (bug_risk):** Direct use of parseInt on property value may throw NumberFormatException.

Validate that the property value is a valid integer before calling parseInt, or catch NumberFormatException to handle invalid input and provide a clear error message.
</issue_to_address>

### Comment 3
<location> `presto-hive/src/main/java/com/facebook/presto/hive/HiveWriteUtils.java:284-290` </location>
<code_context>
+            public void close(boolean abort)
+                            throws IOException
+            {
+                writer.close();
+                if (!abort) {
+                    length = target.getFileSystem(conf).getFileStatus(target).getLen();
+                }
</code_context>

<issue_to_address>
**suggestion (bug_risk):** File length is only updated on non-abort close, which may lead to inaccurate byte count if abort is called.

After an abort, getWrittenBytes() may return a stale value. To prevent this, reset or invalidate the length field when abort is true.

```suggestion
            public void close(boolean abort)
                            throws IOException
            {
                writer.close();
                if (!abort) {
                    length = target.getFileSystem(conf).getFileStatus(target).getLen();
                } else {
                    length = -1; // Invalidate length on abort to prevent stale value
                }
```
</issue_to_address>

### Comment 4
<location> `presto-hive/src/main/java/com/facebook/presto/hive/HiveMetadata.java:764-765` </location>
<code_context>
         }

+        // Textfile and CSV specific properties
+        getSerdeProperty(table.get(), SKIP_HEADER_COUNT_KEY)
+                        .ifPresent(skipHeaderCount -> properties.put(SKIP_HEADER_LINE_COUNT, Integer.valueOf(skipHeaderCount)));
+        getSerdeProperty(table.get(), SKIP_FOOTER_COUNT_KEY)
+                        .ifPresent(skipFooterCount -> properties.put(SKIP_FOOTER_LINE_COUNT, Integer.valueOf(skipFooterCount)));
</code_context>

<issue_to_address>
**suggestion:** Inconsistent use of property key types for skip header/footer count.

Standardize the use of property keys for header/footer counts to prevent confusion and potential errors.

Suggested implementation:

```java
        getSerdeProperty(table.get(), SKIP_HEADER_COUNT_KEY)
                        .ifPresent(skipHeaderCount -> properties.put(SKIP_HEADER_COUNT_KEY, Integer.valueOf(skipHeaderCount)));
        getSerdeProperty(table.get(), SKIP_FOOTER_COUNT_KEY)
                        .ifPresent(skipFooterCount -> properties.put(SKIP_FOOTER_COUNT_KEY, Integer.valueOf(skipFooterCount)));

```

If `SKIP_HEADER_LINE_COUNT` and `SKIP_FOOTER_LINE_COUNT` are used elsewhere in the codebase, you should refactor those usages to use `SKIP_HEADER_COUNT_KEY` and `SKIP_FOOTER_COUNT_KEY` for consistency. Also, ensure that any documentation or comments referring to these keys are updated accordingly.
</issue_to_address>

### Comment 5
<location> `presto-hive/src/test/java/com/facebook/presto/hive/TestHiveIntegrationSmokeTest.java:6862-292` </location>
<code_context>
+    private void testCreateTableWithHeaderAndFooter(String format)
</code_context>

<issue_to_address>
**suggestion (testing):** Missing tests for skip_header_line_count and skip_footer_line_count with value zero and with values greater than supported.

Add tests for skip_header_line_count and skip_footer_line_count set to zero and values above the supported range to confirm correct validation behavior.

Suggested implementation:

```java
        assertThatThrownBy(() -> assertUpdate("CREATE TABLE test_invalid_skip_header (col1 varchar) WITH (format = 'CSV', skip_header_line_count = -1)"))
                .hasMessageMatching("Invalid value for skip_header_line_count property: -1");
        assertThatThrownBy(() -> assertUpdate("CREATE TABLE test_invalid_skip_footer (col1 varchar) WITH (format = 'CSV', skip_footer_line_count = -1)"))
                .hasMessageMatching("Invalid value for skip_footer_line_count property: -1");

        // Test skip_header_line_count = 0 (boundary value)
        assertUpdate("CREATE TABLE test_skip_header_zero (col1 varchar) WITH (format = 'CSV', skip_header_line_count = 0)");
        // Test skip_footer_line_count = 0 (boundary value)
        assertUpdate("CREATE TABLE test_skip_footer_zero (col1 varchar) WITH (format = 'CSV', skip_footer_line_count = 0)");

        // Test skip_header_line_count above supported range
        assertThatThrownBy(() -> assertUpdate("CREATE TABLE test_invalid_skip_header_high (col1 varchar) WITH (format = 'CSV', skip_header_line_count = 1000000)"))
                .hasMessageMatching("Invalid value for skip_header_line_count property: 1000000");
        // Test skip_footer_line_count above supported range
        assertThatThrownBy(() -> assertUpdate("CREATE TABLE test_invalid_skip_footer_high (col1 varchar) WITH (format = 'CSV', skip_footer_line_count = 1000000)"))
                .hasMessageMatching("Invalid value for skip_footer_line_count property: 1000000");
    }

```

If the supported range for skip_header_line_count and skip_footer_line_count is different, adjust the value `1000000` to the correct upper bound + 1 for your codebase.
If zero is not a valid value, change `assertUpdate` to `assertThatThrownBy` and check for the correct error message.
</issue_to_address>

### Comment 6
<location> `presto-hive/src/test/java/com/facebook/presto/hive/TestHiveIntegrationSmokeTest.java:6916-6924` </location>
<code_context>
+                        ") AS SELECT CAST(1 AS VARCHAR) AS col_name1, CAST(2 AS VARCHAR) as col_name2",
+                catalog, schema, name, format);
+
+        assertUpdate(createTableSql, 1);
+        assertUpdate(format("INSERT INTO %s.%s.%s_table_skip_header VALUES('3', '4')", catalog, schema, format), 1);
+        MaterializedResult materializedRows = computeActual(format("SELECT * FROM %s_table_skip_header", format));
+        assertEqualsIgnoreOrder(materializedRows, resultBuilder(getSession(), VARCHAR, VARCHAR)
</code_context>

<issue_to_address>
**suggestion (testing):** Test for header/footer line skipping does not verify actual data skipping behavior.

Please add tests that confirm header/footer lines are skipped by inserting such lines and verifying that only valid data rows are returned.

```suggestion
        assertUpdate(createTableSql, 1);
        // Insert a row that simulates a header line (should be skipped)
        assertUpdate(format("INSERT INTO %s.%s.%s_table_skip_header VALUES('header1', 'header2')", catalog, schema, format), 1);
        // Insert valid data row
        assertUpdate(format("INSERT INTO %s.%s.%s_table_skip_header VALUES('3', '4')", catalog, schema, format), 1);

        // Query the table and verify that the header line is skipped and only valid data rows are returned
        MaterializedResult materializedRows = computeActual(format("SELECT * FROM %s_table_skip_header", format));
        assertEqualsIgnoreOrder(materializedRows, resultBuilder(getSession(), VARCHAR, VARCHAR)
                .row("1", "2")
                .row("3", "4")
                .build()
                .getMaterializedRows());

        // Additional check: ensure header line is not present in results
        List<List<Object>> actualRows = materializedRows.getMaterializedRows();
        for (List<Object> row : actualRows) {
            assertNotEquals(row, Arrays.asList("header1", "header2"), "Header line should be skipped");
        }

        assertUpdate(format("DROP TABLE %s_table_skip_header", format));
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

public static RecordWriter createRecordWriter(Path target, JobConf conf, Properties properties, String outputFormatName, ConnectorSession session, Optional<TextCSVHeaderWriter> textCSVHeaderWriter)
{
try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Row separator parsing may not handle multi-byte delimiters correctly.

The logic only uses the first character of the delimiter, so multi-byte values like "\r\n" will not be handled as intended. Please validate or restrict input to single-byte delimiters, or document this limitation.

Comment on lines +255 to +256
Optional<String> skipHeaderLine = Optional.ofNullable(properties.getProperty("skip.header.line.count"));
if (skipHeaderLine.isPresent()) {
if (parseInt(skipHeaderLine.get()) == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Direct use of parseInt on property value may throw NumberFormatException.

Validate that the property value is a valid integer before calling parseInt, or catch NumberFormatException to handle invalid input and provide a clear error message.

Comment on lines +284 to +289
public void close(boolean abort)
throws IOException
{
writer.close();
if (!abort) {
length = target.getFileSystem(conf).getFileStatus(target).getLen();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): File length is only updated on non-abort close, which may lead to inaccurate byte count if abort is called.

After an abort, getWrittenBytes() may return a stale value. To prevent this, reset or invalidate the length field when abort is true.

Suggested change
public void close(boolean abort)
throws IOException
{
writer.close();
if (!abort) {
length = target.getFileSystem(conf).getFileStatus(target).getLen();
}
public void close(boolean abort)
throws IOException
{
writer.close();
if (!abort) {
length = target.getFileSystem(conf).getFileStatus(target).getLen();
} else {
length = -1; // Invalidate length on abort to prevent stale value
}
Comment on lines +764 to +765
getSerdeProperty(table.get(), SKIP_HEADER_COUNT_KEY)
.ifPresent(skipHeaderCount -> properties.put(SKIP_HEADER_LINE_COUNT, Integer.valueOf(skipHeaderCount)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Inconsistent use of property key types for skip header/footer count.

Standardize the use of property keys for header/footer counts to prevent confusion and potential errors.

Suggested implementation:

        getSerdeProperty(table.get(), SKIP_HEADER_COUNT_KEY)
                        .ifPresent(skipHeaderCount -> properties.put(SKIP_HEADER_COUNT_KEY, Integer.valueOf(skipHeaderCount)));
        getSerdeProperty(table.get(), SKIP_FOOTER_COUNT_KEY)
                        .ifPresent(skipFooterCount -> properties.put(SKIP_FOOTER_COUNT_KEY, Integer.valueOf(skipFooterCount)));

If SKIP_HEADER_LINE_COUNT and SKIP_FOOTER_LINE_COUNT are used elsewhere in the codebase, you should refactor those usages to use SKIP_HEADER_COUNT_KEY and SKIP_FOOTER_COUNT_KEY for consistency. Also, ensure that any documentation or comments referring to these keys are updated accordingly.

@auden-woolfson auden-woolfson force-pushed the fix_skip_header_line_count_hive branch from c77d2a9 to 4603e95 Compare October 28, 2025 17:58
@auden-woolfson auden-woolfson changed the title Add support for skip_header_line_count and skip_footer_line_count in … Oct 28, 2025
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the doc! A couple of formatting nits, and a question.

Metastore.

``skip_header_line_count`` Number of header lines to skip when reading CSV or TEXTFILE tables. None (ignored if not set). Must be non-negative. Only valid for
When set to `1`, a header row will be written when creating new `CSV` and `TEXTFILE` formats. Values greater than `0` are not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When set to `1`, a header row will be written when creating new `CSV` and `TEXTFILE` formats. Values greater than `0` are not
When set to ``1``, a header row will be written when creating new CSV and TEXTFILE formats. Values greater than ``0`` are not

``skip_header_line_count`` Number of header lines to skip when reading CSV or TEXTFILE tables. None (ignored if not set). Must be non-negative. Only valid for
When set to `1`, a header row will be written when creating new `CSV` and `TEXTFILE` formats. Values greater than `0` are not
CSV or TEXTFILE tables. supported for `CREATE TABLE AS` or `INSERT` operations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CSV or TEXTFILE tables. supported for `CREATE TABLE AS` or `INSERT` operations.
CSV or TEXTFILE tables. supported for ``CREATE TABLE AS`` or ``INSERT`` operations.
CSV or TEXTFILE tables. supported for `CREATE TABLE AS` or `INSERT` operations.

``skip_footer_line_count`` Number of footer lines to skip when reading CSV or TEXTFILE tables. None (ignored if not set). Must be non-negative. Only valid for
Cannot be used when inserting into a table. `CSV` and `TEXTFILE` formats. Values greater than `0` are not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Cannot be used when inserting into a table. `CSV` and `TEXTFILE` formats. Values greater than `0` are not
Cannot be used when inserting into a table. CSV and TEXTFILE formats. Values greater than ``0`` are not

Values greater than 0 are not what?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:IBM PR from IBM

4 participants