-
Notifications
You must be signed in to change notification settings - Fork 5k
feat(md5): support md5 in inserting #33416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 3.0
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @stephenkgu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances data manipulation flexibility for Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for md5() and from_base64() functions during data insertion, along with a new to_base64() function. The changes include a new base64 implementation and updates to the SQL parser to handle these functions for BINARY data types. The overall implementation is good, but I've identified a critical issue with error handling in the new tbase64_decode function that could lead to silent data corruption. I've also suggested refactoring to reduce code duplication in the parser and fixing some type-mismatches for safer type handling. Addressing these points will improve the robustness and maintainability of the new features.
| tbase64_decode(*pData, input, inputBytes, outputBytes); | ||
| *nData = outputBytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return value of tbase64_decode should be checked. It can fail for invalid input, and the current implementation doesn't handle this, potentially leading to incorrect data. Assuming tbase64_decode is modified to return a status code, you should check it and return an error. Also, tbase64_decode expects const uint8_t* for input, but char* is passed. It's safer to cast it.
if (tbase64_decode(*pData, (const uint8_t *)input, inputBytes, outputBytes) != TSDB_CODE_SUCCESS) {
return TSDB_CODE_INVALID_DATA_FMT;
}
*nData = outputBytes;
source/libs/scalar/src/sclfunc.c
Outdated
| tbase64_decode(out, varDataVal(input), inputLen, outputLength); | ||
| varDataSetLen(outputBuf, outputLength); | ||
|
|
||
| SCL_ERR_JRET(colDataSetVal(pOutputData, i, outputBuf, false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to tbase64_decode has two issues:
- It passes
char*whereuint8_t*is expected. This can cause issues with signed characters. Explicit casting to(uint8_t*)is recommended. - The return value is not checked.
tbase64_decodecan fail on invalid input. Assumingtbase64_decodeis modified to return a status, this status should be checked to prevent processing of invalid data. For example, on failure, the column value could be set to NULL.
if (tbase64_decode((uint8_t *)out, (const uint8_t *)varDataVal(input), inputLen, outputLength) != TSDB_CODE_SUCCESS) {
colDataSetNULL(pOutputData, i);
continue;
}
varDataSetLen(outputBuf, outputLength);
SCL_ERR_JRET(colDataSetVal(pOutputData, i, outputBuf, false));
source/util/src/tbase64.c
Outdated
| void tbase64_decode(uint8_t *out, const uint8_t *input, size_t in_len, VarDataLenT out_len) { | ||
| (void)taosThreadOnce(&tbase64_decoding_table_building, tbase64_build_decoding_table); | ||
|
|
||
| if (in_len % 4 != 0) { | ||
| out[0] = 0; | ||
| return; | ||
| } | ||
|
|
||
| if (input[in_len - 1] == '=') out_len--; | ||
| if (input[in_len - 2] == '=') out_len--; | ||
|
|
||
| for (int i = 0, j = 0; i < in_len;) { | ||
| uint32_t sextet_a = input[i] == '=' ? 0 & i++ : tbase64_decoding_table[input[i++]]; | ||
| uint32_t sextet_b = input[i] == '=' ? 0 & i++ : tbase64_decoding_table[input[i++]]; | ||
| uint32_t sextet_c = input[i] == '=' ? 0 & i++ : tbase64_decoding_table[input[i++]]; | ||
| uint32_t sextet_d = input[i] == '=' ? 0 & i++ : tbase64_decoding_table[input[i++]]; | ||
|
|
||
| uint32_t triple = (sextet_a << 3 * 6) + (sextet_b << 2 * 6) + (sextet_c << 1 * 6) + (sextet_d << 0 * 6); | ||
|
|
||
| if (j < out_len) out[j++] = (triple >> 2 * 8) & 0xFF; | ||
| if (j < out_len) out[j++] = (triple >> 1 * 8) & 0xFF; | ||
| if (j < out_len) out[j++] = (triple >> 0 * 8) & 0xFF; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tbase64_decode function should return a status to indicate success or failure. Currently, it returns void and handles invalid input length (in_len % 4 != 0) by writing a zero to out[0] and returning. This silent failure can lead to callers processing invalid data. Returning a status code would allow for proper error handling. The function signature should be changed to int32_t tbase64_decode(...) and return TSDB_CODE_INVALID_DATA_FMT on error and TSDB_CODE_SUCCESS on success.
int32_t tbase64_decode(uint8_t *out, const uint8_t *input, size_t in_len, VarDataLenT out_len) {
(void)taosThreadOnce(&tbase64_decoding_table_building, tbase64_build_decoding_table);
if (in_len % 4 != 0) {
return TSDB_CODE_INVALID_DATA_FMT;
}
if (input[in_len - 1] == '=') out_len--;
if (input[in_len - 2] == '=') out_len--;
for (int i = 0, j = 0; i < in_len;) {
uint32_t sextet_a = input[i] == '=' ? 0 & i++ : tbase64_decoding_table[input[i++]];
uint32_t sextet_b = input[i] == '=' ? 0 & i++ : tbase64_decoding_table[input[i++]];
uint32_t sextet_c = input[i] == '=' ? 0 & i++ : tbase64_decoding_table[input[i++]];
uint32_t sextet_d = input[i] == '=' ? 0 & i++ : tbase64_decoding_table[input[i++]];
uint32_t triple = (sextet_a << 3 * 6) + (sextet_b << 2 * 6) + (sextet_c << 1 * 6) + (sextet_d << 0 * 6);
if (j < out_len) out[j++] = (triple >> 2 * 8) & 0xFF;
if (j < out_len) out[j++] = (triple >> 1 * 8) & 0xFF;
if (j < out_len) out[j++] = (triple >> 0 * 8) & 0xFF;
}
return TSDB_CODE_SUCCESS;
}| static int32_t parseBinary(SInsertParseContext* pCxt, const char** ppSql, SToken* pToken, SColVal* pVal, | ||
| SSchema* pSchema) { | ||
| int32_t bytes = pSchema->bytes; | ||
| uint8_t** pData = &pVal->value.pData; | ||
| uint32_t* nData = &pVal->value.nData; | ||
|
|
||
| if (TK_NK_ID == pToken->type) { | ||
| char* input = NULL; | ||
| int32_t inputBytes = 0; | ||
| char tmpTokenBuf[TSDB_MAX_BYTES_PER_ROW]; | ||
|
|
||
| if (0 == strncasecmp(pToken->z, "from_base64(", 12)) { | ||
| NEXT_VALID_TOKEN(*ppSql, *pToken); | ||
| if (TK_NK_LP != pToken->type) { | ||
| return buildSyntaxErrMsg(&pCxt->msg, "( expected", pToken->z); | ||
| } | ||
|
|
||
| NEXT_VALID_TOKEN(*ppSql, *pToken); | ||
| if (TK_NULL == pToken->type) { | ||
| NEXT_VALID_TOKEN(*ppSql, *pToken); | ||
| if (TK_NK_RP == pToken->type) { | ||
| pVal->flag = CV_FLAG_NULL; | ||
|
|
||
| return TSDB_CODE_SUCCESS; | ||
| } | ||
| } else if (TK_NK_STRING != pToken->type) { | ||
| return buildSyntaxErrMsg(&pCxt->msg, "string expected", pToken->z); | ||
| } | ||
|
|
||
| inputBytes = trimString(pToken->z, pToken->n, tmpTokenBuf, TSDB_MAX_BYTES_PER_ROW); | ||
| input = tmpTokenBuf; | ||
|
|
||
| NEXT_VALID_TOKEN(*ppSql, *pToken); | ||
| if (TK_NK_RP != pToken->type) { | ||
| return buildSyntaxErrMsg(&pCxt->msg, ") expected", pToken->z); | ||
| } | ||
|
|
||
| int32_t outputBytes = tbase64_decode_len(inputBytes); | ||
| if (outputBytes + VARSTR_HEADER_SIZE > bytes) { | ||
| return generateSyntaxErrMsg(&pCxt->msg, TSDB_CODE_PAR_VALUE_TOO_LONG, pSchema->name); | ||
| } | ||
|
|
||
| *pData = taosMemoryMalloc(outputBytes); | ||
| if (NULL == *pData) { | ||
| return terrno; | ||
| } | ||
|
|
||
| tbase64_decode(*pData, input, inputBytes, outputBytes); | ||
| *nData = outputBytes; | ||
| } else if (0 == strncasecmp(pToken->z, "to_base64(", 10)) { | ||
| NEXT_VALID_TOKEN(*ppSql, *pToken); | ||
| if (TK_NK_LP != pToken->type) { | ||
| return buildSyntaxErrMsg(&pCxt->msg, "( expected", pToken->z); | ||
| } | ||
|
|
||
| NEXT_VALID_TOKEN(*ppSql, *pToken); | ||
| if (TK_NULL == pToken->type) { | ||
| NEXT_VALID_TOKEN(*ppSql, *pToken); | ||
| if (TK_NK_RP == pToken->type) { | ||
| pVal->flag = CV_FLAG_NULL; | ||
|
|
||
| return TSDB_CODE_SUCCESS; | ||
| } | ||
| } else if (TK_NK_STRING != pToken->type) { | ||
| return buildSyntaxErrMsg(&pCxt->msg, "string expected", pToken->z); | ||
| } | ||
|
|
||
| inputBytes = trimString(pToken->z, pToken->n, tmpTokenBuf, TSDB_MAX_BYTES_PER_ROW); | ||
| input = tmpTokenBuf; | ||
|
|
||
| NEXT_VALID_TOKEN(*ppSql, *pToken); | ||
| if (TK_NK_RP != pToken->type) { | ||
| return buildSyntaxErrMsg(&pCxt->msg, ") expected", pToken->z); | ||
| } | ||
|
|
||
| int32_t outputBytes = tbase64_encode_len(inputBytes); | ||
| if (outputBytes + VARSTR_HEADER_SIZE > bytes) { | ||
| return generateSyntaxErrMsg(&pCxt->msg, TSDB_CODE_PAR_VALUE_TOO_LONG, pSchema->name); | ||
| } | ||
|
|
||
| *pData = taosMemoryMalloc(outputBytes); | ||
| if (NULL == *pData) { | ||
| return terrno; | ||
| } | ||
|
|
||
| tbase64_encode(*pData, input, inputBytes, outputBytes); | ||
| *nData = outputBytes; | ||
| } else if (0 == strncasecmp(pToken->z, "md5(", 4)) { | ||
| NEXT_VALID_TOKEN(*ppSql, *pToken); | ||
| if (TK_NK_LP != pToken->type) { | ||
| return buildSyntaxErrMsg(&pCxt->msg, "( expected", pToken->z); | ||
| } | ||
|
|
||
| NEXT_VALID_TOKEN(*ppSql, *pToken); | ||
| if (TK_NULL == pToken->type) { | ||
| NEXT_VALID_TOKEN(*ppSql, *pToken); | ||
| if (TK_NK_RP == pToken->type) { | ||
| pVal->flag = CV_FLAG_NULL; | ||
|
|
||
| return TSDB_CODE_SUCCESS; | ||
| } | ||
| } else if (TK_NK_STRING != pToken->type) { | ||
| return buildSyntaxErrMsg(&pCxt->msg, "string expected", pToken->z); | ||
| } | ||
|
|
||
| inputBytes = trimString(pToken->z, pToken->n, tmpTokenBuf, TSDB_MAX_BYTES_PER_ROW); | ||
| input = tmpTokenBuf; | ||
|
|
||
| NEXT_VALID_TOKEN(*ppSql, *pToken); | ||
| if (TK_NK_RP != pToken->type) { | ||
| return buildSyntaxErrMsg(&pCxt->msg, ") expected", pToken->z); | ||
| } | ||
|
|
||
| if (MD5_OUTPUT_LEN + VARSTR_HEADER_SIZE > bytes) { | ||
| return generateSyntaxErrMsg(&pCxt->msg, TSDB_CODE_PAR_VALUE_TOO_LONG, pSchema->name); | ||
| } | ||
|
|
||
| int32_t bufLen = TMAX(MD5_OUTPUT_LEN + VARSTR_HEADER_SIZE + 1, inputBytes); | ||
| *pData = taosMemoryMalloc(bufLen); | ||
| if (NULL == *pData) { | ||
| return terrno; | ||
| } | ||
|
|
||
| (void)memcpy(*pData, input, inputBytes); | ||
| int32_t len = taosCreateMD5Hash(*pData, inputBytes); | ||
| *nData = len; | ||
| } else { | ||
| return buildSyntaxErrMsg(&pCxt->msg, "invalid identifier", pToken->z); | ||
| } | ||
| } else { | ||
| if (pToken->n + VARSTR_HEADER_SIZE > bytes) { | ||
| return generateSyntaxErrMsg(&pCxt->msg, TSDB_CODE_PAR_VALUE_TOO_LONG, pSchema->name); | ||
| } | ||
|
|
||
| *pData = taosMemoryMalloc(pToken->n); | ||
| if (NULL == *pData) { | ||
| return terrno; | ||
| } | ||
| memcpy(*pData, pToken->z, pToken->n); | ||
| *nData = pToken->n; | ||
| } | ||
|
|
||
| pVal->flag = CV_FLAG_VALUE; | ||
| return TSDB_CODE_SUCCESS; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function parseBinary contains duplicated code for parsing function arguments for from_base64, to_base64, and md5. This logic for parsing a single string argument from a function call can be extracted into a helper function to improve maintainability and reduce code duplication.
A helper function could look like this:
static int32_t parseStringArgument(SInsertParseContext* pCxt, const char** ppSql, SToken* pToken, char* outBuf, int32_t bufSize, int32_t* outBytes, bool* isNull) {
*isNull = false;
NEXT_VALID_TOKEN(*ppSql, *pToken);
if (TK_NK_LP != pToken->type) {
return buildSyntaxErrMsg(&pCxt->msg, "( expected", pToken->z);
}
NEXT_VALID_TOKEN(*ppSql, *pToken);
if (TK_NULL == pToken->type) {
NEXT_VALID_TOKEN(*ppSql, *pToken);
if (TK_NK_RP == pToken->type) {
*isNull = true;
return TSDB_CODE_SUCCESS;
}
} else if (TK_NK_STRING != pToken->type) {
return buildSyntaxErrMsg(&pCxt->msg, "string expected", pToken->z);
}
*outBytes = trimString(pToken->z, pToken->n, outBuf, bufSize);
NEXT_VALID_TOKEN(*ppSql, *pToken);
if (TK_NK_RP != pToken->type) {
return buildSyntaxErrMsg(&pCxt->msg, ") expected", pToken->z);
}
return TSDB_CODE_SUCCESS;
}Then you can use this helper in parseBinary to simplify the code.
source/libs/scalar/src/sclfunc.c
Outdated
| base64Out[j++] = base64Table[(triple >> 6) & 0x3F]; | ||
| base64Out[j++] = base64Table[triple & 0x3F]; | ||
| } | ||
| tbase64_encode(out, varDataVal(input), inputLen, outputLength); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a type mismatch in the call to tbase64_encode. It expects uint8_t* and const uint8_t*, but char* is passed. This can lead to issues with signed characters on some platforms. Please cast the pointers to the correct types to ensure correctness and improve code clarity.
tbase64_encode((uint8_t *)out, (const uint8_t *)varDataVal(input), inputLen, outputLength);
Description
Please briefly describe the code changes in this pull request.
Checklist
Please check the items in the checklist if applicable.