-
Notifications
You must be signed in to change notification settings - Fork 5.5k
feat: Add support for skip_header_line_count and skip_footer_line_count in … #26446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat: Add support for skip_header_line_count and skip_footer_line_count in … #26446
Conversation
Reviewer's GuideThis PR implements skip_header_line_count and skip_footer_line_count support in the Hive connector by adding metadata definitions and validation, extending the write pipeline to emit header lines for CSV/text formats, and enriching integration tests to cover expected behaviors and error cases. Sequence diagram for CSV/text file write with header line supportsequenceDiagram
participant "RecordFileWriter"
participant "HiveWriteUtils"
participant "TextCSVHeaderWriter"
participant "OutputStream"
"RecordFileWriter"->>"HiveWriteUtils": createRecordWriter(..., textHeaderWriter)
"HiveWriteUtils"->>"TextCSVHeaderWriter": write(compressedOutput, rowSeparatorByte)
"TextCSVHeaderWriter"->>"OutputStream": write header line
"HiveWriteUtils"->>"OutputStream": write data rows
Entity relationship diagram for new table propertieserDiagram
HIVE_TABLE_PROPERTIES {
INTEGER skip_header_line_count
INTEGER skip_footer_line_count
}
HIVE_TABLE_PROPERTIES ||--o| HIVE_METADATA : "used in"
HIVE_METADATA {
STRING skip_header_line_count
STRING skip_footer_line_count
}
Class diagram for new and updated Hive connector classesclassDiagram
class HiveTableProperties {
+CSV_SEPARATOR : String
+CSV_QUOTE : String
+CSV_ESCAPE : String
+SKIP_HEADER_LINE_COUNT : String
+SKIP_FOOTER_LINE_COUNT : String
+getHeaderSkipCount(Map<String, Object>) : Optional<Integer>
+getFooterSkipCount(Map<String, Object>) : Optional<Integer>
}
class HiveMetadata {
+SKIP_HEADER_COUNT_KEY : String
+SKIP_FOOTER_COUNT_KEY : String
+getTableMetadata(...)
+getEmptyTableProperties(...)
+beginCreateTable(...)
+beginInsertInternal(...)
+checkFormatForProperty(...)
}
class HiveWriteUtils {
+createRecordWriter(...)
+createTextCsvFileWriter(...)
}
class TextCSVHeaderWriter {
+serializer : Serializer
+headerType : Type
+fileColumnNames : List<String>
+write(OutputStream, int)
}
class RecordFileWriter {
+recordWriter : RecordWriter
}
HiveTableProperties --> HiveMetadata : used by
HiveWriteUtils --> TextCSVHeaderWriter : uses
RecordFileWriter --> HiveWriteUtils : uses
RecordFileWriter --> TextCSVHeaderWriter : uses
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- There are now two sets of constants for header/footer properties (Serde keys vs table metadata keys) that differ only by a dot vs underscore; consider consolidating these into a single source of truth to avoid naming mismatches.
- The delimiter parsing and header‐writing logic in createTextCsvFileWriter is fairly dense—extracting that into a small, well-tested utility method would improve readability and make edge‐case handling (e.g. multi‐char delimiters) easier to verify.
- The new tests for skip_header_line_count and skip_footer_line_count repeat very similar patterns for different formats; parameterizing or data-driven testing could reduce boilerplate and help ensure consistent coverage.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- There are now two sets of constants for header/footer properties (Serde keys vs table metadata keys) that differ only by a dot vs underscore; consider consolidating these into a single source of truth to avoid naming mismatches.
- The delimiter parsing and header‐writing logic in createTextCsvFileWriter is fairly dense—extracting that into a small, well-tested utility method would improve readability and make edge‐case handling (e.g. multi‐char delimiters) easier to verify.
- The new tests for skip_header_line_count and skip_footer_line_count repeat very similar patterns for different formats; parameterizing or data-driven testing could reduce boilerplate and help ensure consistent coverage.
## Individual Comments
### Comment 1
<location> `presto-hive/src/main/java/com/facebook/presto/hive/HiveWriteUtils.java:243-177` </location>
<code_context>
+ {
+ String rowSeparatorString = properties.getProperty(serdeConstants.LINE_DELIM, "\n");
+
+ int rowSeparatorByte;
+ try {
+ rowSeparatorByte = Byte.parseByte(rowSeparatorString);
+ }
</code_context>
<issue_to_address>
**issue:** Row separator parsing may not handle multi-byte delimiters correctly.
The logic only uses the first character of the delimiter, so multi-byte values like "\r\n" will not be handled as intended. Please validate or restrict input to single-byte delimiters, or document this limitation.
</issue_to_address>
### Comment 2
<location> `presto-hive/src/main/java/com/facebook/presto/hive/HiveWriteUtils.java:255-257` </location>
<code_context>
+ OutputStream compressedOutput = createCompressedStream(conf, output, compress);
+ TextRecordWriter writer = new TextRecordWriter();
+ writer.initialize(compressedOutput, conf);
+ Optional<String> skipHeaderLine = Optional.ofNullable(properties.getProperty("skip.header.line.count"));
+ if (skipHeaderLine.isPresent()) {
+ if (parseInt(skipHeaderLine.get()) == 1) {
+ textCSVHeaderWriter
+ .orElseThrow(() -> new IllegalArgumentException("TextHeaderWriter must not be empty when skip.header.line.count is set to 1"))
</code_context>
<issue_to_address>
**issue (bug_risk):** Direct use of parseInt on property value may throw NumberFormatException.
Validate that the property value is a valid integer before calling parseInt, or catch NumberFormatException to handle invalid input and provide a clear error message.
</issue_to_address>
### Comment 3
<location> `presto-hive/src/main/java/com/facebook/presto/hive/HiveWriteUtils.java:284-290` </location>
<code_context>
+ public void close(boolean abort)
+ throws IOException
+ {
+ writer.close();
+ if (!abort) {
+ length = target.getFileSystem(conf).getFileStatus(target).getLen();
+ }
</code_context>
<issue_to_address>
**suggestion (bug_risk):** File length is only updated on non-abort close, which may lead to inaccurate byte count if abort is called.
After an abort, getWrittenBytes() may return a stale value. To prevent this, reset or invalidate the length field when abort is true.
```suggestion
public void close(boolean abort)
throws IOException
{
writer.close();
if (!abort) {
length = target.getFileSystem(conf).getFileStatus(target).getLen();
} else {
length = -1; // Invalidate length on abort to prevent stale value
}
```
</issue_to_address>
### Comment 4
<location> `presto-hive/src/main/java/com/facebook/presto/hive/HiveMetadata.java:764-765` </location>
<code_context>
}
+ // Textfile and CSV specific properties
+ getSerdeProperty(table.get(), SKIP_HEADER_COUNT_KEY)
+ .ifPresent(skipHeaderCount -> properties.put(SKIP_HEADER_LINE_COUNT, Integer.valueOf(skipHeaderCount)));
+ getSerdeProperty(table.get(), SKIP_FOOTER_COUNT_KEY)
+ .ifPresent(skipFooterCount -> properties.put(SKIP_FOOTER_LINE_COUNT, Integer.valueOf(skipFooterCount)));
</code_context>
<issue_to_address>
**suggestion:** Inconsistent use of property key types for skip header/footer count.
Standardize the use of property keys for header/footer counts to prevent confusion and potential errors.
Suggested implementation:
```java
getSerdeProperty(table.get(), SKIP_HEADER_COUNT_KEY)
.ifPresent(skipHeaderCount -> properties.put(SKIP_HEADER_COUNT_KEY, Integer.valueOf(skipHeaderCount)));
getSerdeProperty(table.get(), SKIP_FOOTER_COUNT_KEY)
.ifPresent(skipFooterCount -> properties.put(SKIP_FOOTER_COUNT_KEY, Integer.valueOf(skipFooterCount)));
```
If `SKIP_HEADER_LINE_COUNT` and `SKIP_FOOTER_LINE_COUNT` are used elsewhere in the codebase, you should refactor those usages to use `SKIP_HEADER_COUNT_KEY` and `SKIP_FOOTER_COUNT_KEY` for consistency. Also, ensure that any documentation or comments referring to these keys are updated accordingly.
</issue_to_address>
### Comment 5
<location> `presto-hive/src/test/java/com/facebook/presto/hive/TestHiveIntegrationSmokeTest.java:6862-292` </location>
<code_context>
+ private void testCreateTableWithHeaderAndFooter(String format)
</code_context>
<issue_to_address>
**suggestion (testing):** Missing tests for skip_header_line_count and skip_footer_line_count with value zero and with values greater than supported.
Add tests for skip_header_line_count and skip_footer_line_count set to zero and values above the supported range to confirm correct validation behavior.
Suggested implementation:
```java
assertThatThrownBy(() -> assertUpdate("CREATE TABLE test_invalid_skip_header (col1 varchar) WITH (format = 'CSV', skip_header_line_count = -1)"))
.hasMessageMatching("Invalid value for skip_header_line_count property: -1");
assertThatThrownBy(() -> assertUpdate("CREATE TABLE test_invalid_skip_footer (col1 varchar) WITH (format = 'CSV', skip_footer_line_count = -1)"))
.hasMessageMatching("Invalid value for skip_footer_line_count property: -1");
// Test skip_header_line_count = 0 (boundary value)
assertUpdate("CREATE TABLE test_skip_header_zero (col1 varchar) WITH (format = 'CSV', skip_header_line_count = 0)");
// Test skip_footer_line_count = 0 (boundary value)
assertUpdate("CREATE TABLE test_skip_footer_zero (col1 varchar) WITH (format = 'CSV', skip_footer_line_count = 0)");
// Test skip_header_line_count above supported range
assertThatThrownBy(() -> assertUpdate("CREATE TABLE test_invalid_skip_header_high (col1 varchar) WITH (format = 'CSV', skip_header_line_count = 1000000)"))
.hasMessageMatching("Invalid value for skip_header_line_count property: 1000000");
// Test skip_footer_line_count above supported range
assertThatThrownBy(() -> assertUpdate("CREATE TABLE test_invalid_skip_footer_high (col1 varchar) WITH (format = 'CSV', skip_footer_line_count = 1000000)"))
.hasMessageMatching("Invalid value for skip_footer_line_count property: 1000000");
}
```
If the supported range for skip_header_line_count and skip_footer_line_count is different, adjust the value `1000000` to the correct upper bound + 1 for your codebase.
If zero is not a valid value, change `assertUpdate` to `assertThatThrownBy` and check for the correct error message.
</issue_to_address>
### Comment 6
<location> `presto-hive/src/test/java/com/facebook/presto/hive/TestHiveIntegrationSmokeTest.java:6916-6924` </location>
<code_context>
+ ") AS SELECT CAST(1 AS VARCHAR) AS col_name1, CAST(2 AS VARCHAR) as col_name2",
+ catalog, schema, name, format);
+
+ assertUpdate(createTableSql, 1);
+ assertUpdate(format("INSERT INTO %s.%s.%s_table_skip_header VALUES('3', '4')", catalog, schema, format), 1);
+ MaterializedResult materializedRows = computeActual(format("SELECT * FROM %s_table_skip_header", format));
+ assertEqualsIgnoreOrder(materializedRows, resultBuilder(getSession(), VARCHAR, VARCHAR)
</code_context>
<issue_to_address>
**suggestion (testing):** Test for header/footer line skipping does not verify actual data skipping behavior.
Please add tests that confirm header/footer lines are skipped by inserting such lines and verifying that only valid data rows are returned.
```suggestion
assertUpdate(createTableSql, 1);
// Insert a row that simulates a header line (should be skipped)
assertUpdate(format("INSERT INTO %s.%s.%s_table_skip_header VALUES('header1', 'header2')", catalog, schema, format), 1);
// Insert valid data row
assertUpdate(format("INSERT INTO %s.%s.%s_table_skip_header VALUES('3', '4')", catalog, schema, format), 1);
// Query the table and verify that the header line is skipped and only valid data rows are returned
MaterializedResult materializedRows = computeActual(format("SELECT * FROM %s_table_skip_header", format));
assertEqualsIgnoreOrder(materializedRows, resultBuilder(getSession(), VARCHAR, VARCHAR)
.row("1", "2")
.row("3", "4")
.build()
.getMaterializedRows());
// Additional check: ensure header line is not present in results
List<List<Object>> actualRows = materializedRows.getMaterializedRows();
for (List<Object> row : actualRows) {
assertNotEquals(row, Arrays.asList("header1", "header2"), "Header line should be skipped");
}
assertUpdate(format("DROP TABLE %s_table_skip_header", format));
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
|
||
| public static RecordWriter createRecordWriter(Path target, JobConf conf, Properties properties, String outputFormatName, ConnectorSession session, Optional<TextCSVHeaderWriter> textCSVHeaderWriter) | ||
| { | ||
| try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: Row separator parsing may not handle multi-byte delimiters correctly.
The logic only uses the first character of the delimiter, so multi-byte values like "\r\n" will not be handled as intended. Please validate or restrict input to single-byte delimiters, or document this limitation.
| Optional<String> skipHeaderLine = Optional.ofNullable(properties.getProperty("skip.header.line.count")); | ||
| if (skipHeaderLine.isPresent()) { | ||
| if (parseInt(skipHeaderLine.get()) == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Direct use of parseInt on property value may throw NumberFormatException.
Validate that the property value is a valid integer before calling parseInt, or catch NumberFormatException to handle invalid input and provide a clear error message.
| public void close(boolean abort) | ||
| throws IOException | ||
| { | ||
| writer.close(); | ||
| if (!abort) { | ||
| length = target.getFileSystem(conf).getFileStatus(target).getLen(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): File length is only updated on non-abort close, which may lead to inaccurate byte count if abort is called.
After an abort, getWrittenBytes() may return a stale value. To prevent this, reset or invalidate the length field when abort is true.
| public void close(boolean abort) | |
| throws IOException | |
| { | |
| writer.close(); | |
| if (!abort) { | |
| length = target.getFileSystem(conf).getFileStatus(target).getLen(); | |
| } | |
| public void close(boolean abort) | |
| throws IOException | |
| { | |
| writer.close(); | |
| if (!abort) { | |
| length = target.getFileSystem(conf).getFileStatus(target).getLen(); | |
| } else { | |
| length = -1; // Invalidate length on abort to prevent stale value | |
| } |
| getSerdeProperty(table.get(), SKIP_HEADER_COUNT_KEY) | ||
| .ifPresent(skipHeaderCount -> properties.put(SKIP_HEADER_LINE_COUNT, Integer.valueOf(skipHeaderCount))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Inconsistent use of property key types for skip header/footer count.
Standardize the use of property keys for header/footer counts to prevent confusion and potential errors.
Suggested implementation:
getSerdeProperty(table.get(), SKIP_HEADER_COUNT_KEY)
.ifPresent(skipHeaderCount -> properties.put(SKIP_HEADER_COUNT_KEY, Integer.valueOf(skipHeaderCount)));
getSerdeProperty(table.get(), SKIP_FOOTER_COUNT_KEY)
.ifPresent(skipFooterCount -> properties.put(SKIP_FOOTER_COUNT_KEY, Integer.valueOf(skipFooterCount)));If SKIP_HEADER_LINE_COUNT and SKIP_FOOTER_LINE_COUNT are used elsewhere in the codebase, you should refactor those usages to use SKIP_HEADER_COUNT_KEY and SKIP_FOOTER_COUNT_KEY for consistency. Also, ensure that any documentation or comments referring to these keys are updated accordingly.
c77d2a9 to
4603e95
Compare
|
Should these be documented in https://github.com/prestodb/presto/blob/master/presto-docs/src/main/sphinx/connector/hive.rst#hive-configuration-properties ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the doc! A couple of formatting nits, and a question.
| Metastore. | ||
|
|
||
| ``skip_header_line_count`` Number of header lines to skip when reading CSV or TEXTFILE tables. None (ignored if not set). Must be non-negative. Only valid for | ||
| When set to `1`, a header row will be written when creating new `CSV` and `TEXTFILE` formats. Values greater than `0` are not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| When set to `1`, a header row will be written when creating new `CSV` and `TEXTFILE` formats. Values greater than `0` are not | |
| When set to ``1``, a header row will be written when creating new CSV and TEXTFILE formats. Values greater than ``0`` are not |
|
|
||
| ``skip_header_line_count`` Number of header lines to skip when reading CSV or TEXTFILE tables. None (ignored if not set). Must be non-negative. Only valid for | ||
| When set to `1`, a header row will be written when creating new `CSV` and `TEXTFILE` formats. Values greater than `0` are not | ||
| CSV or TEXTFILE tables. supported for `CREATE TABLE AS` or `INSERT` operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| CSV or TEXTFILE tables. supported for `CREATE TABLE AS` or `INSERT` operations. | |
| CSV or TEXTFILE tables. supported for ``CREATE TABLE AS`` or ``INSERT`` operations. |
| CSV or TEXTFILE tables. supported for `CREATE TABLE AS` or `INSERT` operations. | ||
|
|
||
| ``skip_footer_line_count`` Number of footer lines to skip when reading CSV or TEXTFILE tables. None (ignored if not set). Must be non-negative. Only valid for | ||
| Cannot be used when inserting into a table. `CSV` and `TEXTFILE` formats. Values greater than `0` are not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Cannot be used when inserting into a table. `CSV` and `TEXTFILE` formats. Values greater than `0` are not | |
| Cannot be used when inserting into a table. CSV and TEXTFILE formats. Values greater than ``0`` are not |
Values greater than 0 are not what?
…hive
Description
Motivation and Context
Impact
Test Plan
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.