Skip to content

Conversation

@stephenkgu
Copy link
Contributor

Description

Please briefly describe the code changes in this pull request.

Checklist

Please check the items in the checklist if applicable.

  • Is the user manual updated?
  • Are the test cases passed and automated?
  • Is there no significant decrease in test coverage?
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @stephenkgu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances data manipulation flexibility for BINARY column insertions. It integrates new functionalities that allow users to apply cryptographic hashing (MD5) and Base64 encoding/decoding directly within INSERT statements. This streamlines data preparation and storage by performing these transformations at the point of data ingestion, leveraging a newly introduced and more robust internal Base64 utility.

Highlights

  • MD5 Support in INSERT Statements: The system now supports the direct use of the MD5() function when inserting data into BINARY columns, allowing for on-the-fly hashing of input values.
  • Base64 Encoding/Decoding in INSERT Statements: New TO_BASE64() and FROM_BASE64() functions are introduced, enabling users to encode or decode Base64 strings directly during INSERT operations for BINARY columns.
  • Refactored Base64 Utility: A new, self-contained tbase64 utility has been implemented and integrated, replacing previous ad-hoc Base64 logic and providing more robust encoding and decoding capabilities.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for md5() and from_base64() functions during data insertion, along with a new to_base64() function. The changes include a new base64 implementation and updates to the SQL parser to handle these functions for BINARY data types. The overall implementation is good, but I've identified a critical issue with error handling in the new tbase64_decode function that could lead to silent data corruption. I've also suggested refactoring to reduce code duplication in the parser and fixing some type-mismatches for safer type handling. Addressing these points will improve the robustness and maintainability of the new features.

Comment on lines 557 to 558
tbase64_decode(*pData, input, inputBytes, outputBytes);
*nData = outputBytes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The return value of tbase64_decode should be checked. It can fail for invalid input, and the current implementation doesn't handle this, potentially leading to incorrect data. Assuming tbase64_decode is modified to return a status code, you should check it and return an error. Also, tbase64_decode expects const uint8_t* for input, but char* is passed. It's safer to cast it.

      if (tbase64_decode(*pData, (const uint8_t *)input, inputBytes, outputBytes) != TSDB_CODE_SUCCESS) {
        return TSDB_CODE_INVALID_DATA_FMT;
      }
      *nData = outputBytes;
Comment on lines 2330 to 2333
tbase64_decode(out, varDataVal(input), inputLen, outputLength);
varDataSetLen(outputBuf, outputLength);

SCL_ERR_JRET(colDataSetVal(pOutputData, i, outputBuf, false));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The call to tbase64_decode has two issues:

  1. It passes char* where uint8_t* is expected. This can cause issues with signed characters. Explicit casting to (uint8_t*) is recommended.
  2. The return value is not checked. tbase64_decode can fail on invalid input. Assuming tbase64_decode is modified to return a status, this status should be checked to prevent processing of invalid data. For example, on failure, the column value could be set to NULL.
    if (tbase64_decode((uint8_t *)out, (const uint8_t *)varDataVal(input), inputLen, outputLength) != TSDB_CODE_SUCCESS) {
      colDataSetNULL(pOutputData, i);
      continue;
    }
    varDataSetLen(outputBuf, outputLength);

    SCL_ERR_JRET(colDataSetVal(pOutputData, i, outputBuf, false));
Comment on lines 143 to 166
void tbase64_decode(uint8_t *out, const uint8_t *input, size_t in_len, VarDataLenT out_len) {
(void)taosThreadOnce(&tbase64_decoding_table_building, tbase64_build_decoding_table);

if (in_len % 4 != 0) {
out[0] = 0;
return;
}

if (input[in_len - 1] == '=') out_len--;
if (input[in_len - 2] == '=') out_len--;

for (int i = 0, j = 0; i < in_len;) {
uint32_t sextet_a = input[i] == '=' ? 0 & i++ : tbase64_decoding_table[input[i++]];
uint32_t sextet_b = input[i] == '=' ? 0 & i++ : tbase64_decoding_table[input[i++]];
uint32_t sextet_c = input[i] == '=' ? 0 & i++ : tbase64_decoding_table[input[i++]];
uint32_t sextet_d = input[i] == '=' ? 0 & i++ : tbase64_decoding_table[input[i++]];

uint32_t triple = (sextet_a << 3 * 6) + (sextet_b << 2 * 6) + (sextet_c << 1 * 6) + (sextet_d << 0 * 6);

if (j < out_len) out[j++] = (triple >> 2 * 8) & 0xFF;
if (j < out_len) out[j++] = (triple >> 1 * 8) & 0xFF;
if (j < out_len) out[j++] = (triple >> 0 * 8) & 0xFF;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The tbase64_decode function should return a status to indicate success or failure. Currently, it returns void and handles invalid input length (in_len % 4 != 0) by writing a zero to out[0] and returning. This silent failure can lead to callers processing invalid data. Returning a status code would allow for proper error handling. The function signature should be changed to int32_t tbase64_decode(...) and return TSDB_CODE_INVALID_DATA_FMT on error and TSDB_CODE_SUCCESS on success.

int32_t tbase64_decode(uint8_t *out, const uint8_t *input, size_t in_len, VarDataLenT out_len) {
  (void)taosThreadOnce(&tbase64_decoding_table_building, tbase64_build_decoding_table);

  if (in_len % 4 != 0) {
    return TSDB_CODE_INVALID_DATA_FMT;
  }

  if (input[in_len - 1] == '=') out_len--;
  if (input[in_len - 2] == '=') out_len--;

  for (int i = 0, j = 0; i < in_len;) {
    uint32_t sextet_a = input[i] == '=' ? 0 & i++ : tbase64_decoding_table[input[i++]];
    uint32_t sextet_b = input[i] == '=' ? 0 & i++ : tbase64_decoding_table[input[i++]];
    uint32_t sextet_c = input[i] == '=' ? 0 & i++ : tbase64_decoding_table[input[i++]];
    uint32_t sextet_d = input[i] == '=' ? 0 & i++ : tbase64_decoding_table[input[i++]];

    uint32_t triple = (sextet_a << 3 * 6) + (sextet_b << 2 * 6) + (sextet_c << 1 * 6) + (sextet_d << 0 * 6);

    if (j < out_len) out[j++] = (triple >> 2 * 8) & 0xFF;
    if (j < out_len) out[j++] = (triple >> 1 * 8) & 0xFF;
    if (j < out_len) out[j++] = (triple >> 0 * 8) & 0xFF;
  }
  return TSDB_CODE_SUCCESS;
}
Comment on lines 510 to 654
static int32_t parseBinary(SInsertParseContext* pCxt, const char** ppSql, SToken* pToken, SColVal* pVal,
SSchema* pSchema) {
int32_t bytes = pSchema->bytes;
uint8_t** pData = &pVal->value.pData;
uint32_t* nData = &pVal->value.nData;

if (TK_NK_ID == pToken->type) {
char* input = NULL;
int32_t inputBytes = 0;
char tmpTokenBuf[TSDB_MAX_BYTES_PER_ROW];

if (0 == strncasecmp(pToken->z, "from_base64(", 12)) {
NEXT_VALID_TOKEN(*ppSql, *pToken);
if (TK_NK_LP != pToken->type) {
return buildSyntaxErrMsg(&pCxt->msg, "( expected", pToken->z);
}

NEXT_VALID_TOKEN(*ppSql, *pToken);
if (TK_NULL == pToken->type) {
NEXT_VALID_TOKEN(*ppSql, *pToken);
if (TK_NK_RP == pToken->type) {
pVal->flag = CV_FLAG_NULL;

return TSDB_CODE_SUCCESS;
}
} else if (TK_NK_STRING != pToken->type) {
return buildSyntaxErrMsg(&pCxt->msg, "string expected", pToken->z);
}

inputBytes = trimString(pToken->z, pToken->n, tmpTokenBuf, TSDB_MAX_BYTES_PER_ROW);
input = tmpTokenBuf;

NEXT_VALID_TOKEN(*ppSql, *pToken);
if (TK_NK_RP != pToken->type) {
return buildSyntaxErrMsg(&pCxt->msg, ") expected", pToken->z);
}

int32_t outputBytes = tbase64_decode_len(inputBytes);
if (outputBytes + VARSTR_HEADER_SIZE > bytes) {
return generateSyntaxErrMsg(&pCxt->msg, TSDB_CODE_PAR_VALUE_TOO_LONG, pSchema->name);
}

*pData = taosMemoryMalloc(outputBytes);
if (NULL == *pData) {
return terrno;
}

tbase64_decode(*pData, input, inputBytes, outputBytes);
*nData = outputBytes;
} else if (0 == strncasecmp(pToken->z, "to_base64(", 10)) {
NEXT_VALID_TOKEN(*ppSql, *pToken);
if (TK_NK_LP != pToken->type) {
return buildSyntaxErrMsg(&pCxt->msg, "( expected", pToken->z);
}

NEXT_VALID_TOKEN(*ppSql, *pToken);
if (TK_NULL == pToken->type) {
NEXT_VALID_TOKEN(*ppSql, *pToken);
if (TK_NK_RP == pToken->type) {
pVal->flag = CV_FLAG_NULL;

return TSDB_CODE_SUCCESS;
}
} else if (TK_NK_STRING != pToken->type) {
return buildSyntaxErrMsg(&pCxt->msg, "string expected", pToken->z);
}

inputBytes = trimString(pToken->z, pToken->n, tmpTokenBuf, TSDB_MAX_BYTES_PER_ROW);
input = tmpTokenBuf;

NEXT_VALID_TOKEN(*ppSql, *pToken);
if (TK_NK_RP != pToken->type) {
return buildSyntaxErrMsg(&pCxt->msg, ") expected", pToken->z);
}

int32_t outputBytes = tbase64_encode_len(inputBytes);
if (outputBytes + VARSTR_HEADER_SIZE > bytes) {
return generateSyntaxErrMsg(&pCxt->msg, TSDB_CODE_PAR_VALUE_TOO_LONG, pSchema->name);
}

*pData = taosMemoryMalloc(outputBytes);
if (NULL == *pData) {
return terrno;
}

tbase64_encode(*pData, input, inputBytes, outputBytes);
*nData = outputBytes;
} else if (0 == strncasecmp(pToken->z, "md5(", 4)) {
NEXT_VALID_TOKEN(*ppSql, *pToken);
if (TK_NK_LP != pToken->type) {
return buildSyntaxErrMsg(&pCxt->msg, "( expected", pToken->z);
}

NEXT_VALID_TOKEN(*ppSql, *pToken);
if (TK_NULL == pToken->type) {
NEXT_VALID_TOKEN(*ppSql, *pToken);
if (TK_NK_RP == pToken->type) {
pVal->flag = CV_FLAG_NULL;

return TSDB_CODE_SUCCESS;
}
} else if (TK_NK_STRING != pToken->type) {
return buildSyntaxErrMsg(&pCxt->msg, "string expected", pToken->z);
}

inputBytes = trimString(pToken->z, pToken->n, tmpTokenBuf, TSDB_MAX_BYTES_PER_ROW);
input = tmpTokenBuf;

NEXT_VALID_TOKEN(*ppSql, *pToken);
if (TK_NK_RP != pToken->type) {
return buildSyntaxErrMsg(&pCxt->msg, ") expected", pToken->z);
}

if (MD5_OUTPUT_LEN + VARSTR_HEADER_SIZE > bytes) {
return generateSyntaxErrMsg(&pCxt->msg, TSDB_CODE_PAR_VALUE_TOO_LONG, pSchema->name);
}

int32_t bufLen = TMAX(MD5_OUTPUT_LEN + VARSTR_HEADER_SIZE + 1, inputBytes);
*pData = taosMemoryMalloc(bufLen);
if (NULL == *pData) {
return terrno;
}

(void)memcpy(*pData, input, inputBytes);
int32_t len = taosCreateMD5Hash(*pData, inputBytes);
*nData = len;
} else {
return buildSyntaxErrMsg(&pCxt->msg, "invalid identifier", pToken->z);
}
} else {
if (pToken->n + VARSTR_HEADER_SIZE > bytes) {
return generateSyntaxErrMsg(&pCxt->msg, TSDB_CODE_PAR_VALUE_TOO_LONG, pSchema->name);
}

*pData = taosMemoryMalloc(pToken->n);
if (NULL == *pData) {
return terrno;
}
memcpy(*pData, pToken->z, pToken->n);
*nData = pToken->n;
}

pVal->flag = CV_FLAG_VALUE;
return TSDB_CODE_SUCCESS;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function parseBinary contains duplicated code for parsing function arguments for from_base64, to_base64, and md5. This logic for parsing a single string argument from a function call can be extracted into a helper function to improve maintainability and reduce code duplication.

A helper function could look like this:

static int32_t parseStringArgument(SInsertParseContext* pCxt, const char** ppSql, SToken* pToken, char* outBuf, int32_t bufSize, int32_t* outBytes, bool* isNull) {
  *isNull = false;
  NEXT_VALID_TOKEN(*ppSql, *pToken);
  if (TK_NK_LP != pToken->type) {
    return buildSyntaxErrMsg(&pCxt->msg, "( expected", pToken->z);
  }

  NEXT_VALID_TOKEN(*ppSql, *pToken);
  if (TK_NULL == pToken->type) {
    NEXT_VALID_TOKEN(*ppSql, *pToken);
    if (TK_NK_RP == pToken->type) {
      *isNull = true;
      return TSDB_CODE_SUCCESS;
    }
  } else if (TK_NK_STRING != pToken->type) {
    return buildSyntaxErrMsg(&pCxt->msg, "string expected", pToken->z);
  }

  *outBytes = trimString(pToken->z, pToken->n, outBuf, bufSize);

  NEXT_VALID_TOKEN(*ppSql, *pToken);
  if (TK_NK_RP != pToken->type) {
    return buildSyntaxErrMsg(&pCxt->msg, ") expected", pToken->z);
  }

  return TSDB_CODE_SUCCESS;
}

Then you can use this helper in parseBinary to simplify the code.

base64Out[j++] = base64Table[(triple >> 6) & 0x3F];
base64Out[j++] = base64Table[triple & 0x3F];
}
tbase64_encode(out, varDataVal(input), inputLen, outputLength);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a type mismatch in the call to tbase64_encode. It expects uint8_t* and const uint8_t*, but char* is passed. This can lead to issues with signed characters on some platforms. Please cast the pointers to the correct types to ensure correctness and improve code clarity.

    tbase64_encode((uint8_t *)out, (const uint8_t *)varDataVal(input), inputLen, outputLength);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

2 participants