feat(md5): support md5 in inserting #33416

stephenkgu · 2025-10-29T09:05:43Z

Description

Please briefly describe the code changes in this pull request.

Checklist

Please check the items in the checklist if applicable.

Is the user manual updated?
Are the test cases passed and automated?
Is there no significant decrease in test coverage?

gemini-code-assist · 2025-10-29T09:06:05Z

Summary of Changes

Hello @stephenkgu, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances data manipulation flexibility for BINARY column insertions. It integrates new functionalities that allow users to apply cryptographic hashing (MD5) and Base64 encoding/decoding directly within INSERT statements. This streamlines data preparation and storage by performing these transformations at the point of data ingestion, leveraging a newly introduced and more robust internal Base64 utility.

Highlights

MD5 Support in INSERT Statements: The system now supports the direct use of the MD5() function when inserting data into BINARY columns, allowing for on-the-fly hashing of input values.
Base64 Encoding/Decoding in INSERT Statements: New TO_BASE64() and FROM_BASE64() functions are introduced, enabling users to encode or decode Base64 strings directly during INSERT operations for BINARY columns.
Refactored Base64 Utility: A new, self-contained tbase64 utility has been implemented and integrated, replacing previous ad-hoc Base64 logic and providing more robust encoding and decoding capabilities.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for md5() and from_base64() functions during data insertion, along with a new to_base64() function. The changes include a new base64 implementation and updates to the SQL parser to handle these functions for BINARY data types. The overall implementation is good, but I've identified a critical issue with error handling in the new tbase64_decode function that could lead to silent data corruption. I've also suggested refactoring to reduce code duplication in the parser and fixing some type-mismatches for safer type handling. Addressing these points will improve the robustness and maintainability of the new features.

gemini-code-assist · 2025-10-29T09:09:53Z

source/libs/parser/src/parInsertSql.c

+      tbase64_decode(*pData, input, inputBytes, outputBytes);
+      *nData = outputBytes;


The return value of tbase64_decode should be checked. It can fail for invalid input, and the current implementation doesn't handle this, potentially leading to incorrect data. Assuming tbase64_decode is modified to return a status code, you should check it and return an error. Also, tbase64_decode expects const uint8_t* for input, but char* is passed. It's safer to cast it.

if (tbase64_decode(*pData, (const uint8_t *)input, inputBytes, outputBytes) != TSDB_CODE_SUCCESS) { return TSDB_CODE_INVALID_DATA_FMT; } *nData = outputBytes;

gemini-code-assist · 2025-10-29T09:09:53Z

source/libs/scalar/src/sclfunc.c

+    tbase64_decode(out, varDataVal(input), inputLen, outputLength);
    varDataSetLen(outputBuf, outputLength);
+
    SCL_ERR_JRET(colDataSetVal(pOutputData, i, outputBuf, false));


The call to tbase64_decode has two issues:

It passes char* where uint8_t* is expected. This can cause issues with signed characters. Explicit casting to (uint8_t*) is recommended.

The return value is not checked. tbase64_decode can fail on invalid input. Assuming tbase64_decode is modified to return a status, this status should be checked to prevent processing of invalid data. For example, on failure, the column value could be set to NULL.

if (tbase64_decode((uint8_t *)out, (const uint8_t *)varDataVal(input), inputLen, outputLength) != TSDB_CODE_SUCCESS) { colDataSetNULL(pOutputData, i); continue; } varDataSetLen(outputBuf, outputLength); SCL_ERR_JRET(colDataSetVal(pOutputData, i, outputBuf, false));

gemini-code-assist · 2025-10-29T09:09:54Z

source/util/src/tbase64.c

+void tbase64_decode(uint8_t *out, const uint8_t *input, size_t in_len, VarDataLenT out_len) {
+  (void)taosThreadOnce(&tbase64_decoding_table_building, tbase64_build_decoding_table);
+
+  if (in_len % 4 != 0) {
+    out[0] = 0;
+    return;
+  }
+
+  if (input[in_len - 1] == '=') out_len--;
+  if (input[in_len - 2] == '=') out_len--;
+
+  for (int i = 0, j = 0; i < in_len;) {
+    uint32_t sextet_a = input[i] == '=' ? 0 & i++ : tbase64_decoding_table[input[i++]];
+    uint32_t sextet_b = input[i] == '=' ? 0 & i++ : tbase64_decoding_table[input[i++]];
+    uint32_t sextet_c = input[i] == '=' ? 0 & i++ : tbase64_decoding_table[input[i++]];
+    uint32_t sextet_d = input[i] == '=' ? 0 & i++ : tbase64_decoding_table[input[i++]];
+
+    uint32_t triple = (sextet_a << 3 * 6) + (sextet_b << 2 * 6) + (sextet_c << 1 * 6) + (sextet_d << 0 * 6);
+
+    if (j < out_len) out[j++] = (triple >> 2 * 8) & 0xFF;
+    if (j < out_len) out[j++] = (triple >> 1 * 8) & 0xFF;
+    if (j < out_len) out[j++] = (triple >> 0 * 8) & 0xFF;
+  }
+}


The tbase64_decode function should return a status to indicate success or failure. Currently, it returns void and handles invalid input length (in_len % 4 != 0) by writing a zero to out[0] and returning. This silent failure can lead to callers processing invalid data. Returning a status code would allow for proper error handling. The function signature should be changed to int32_t tbase64_decode(...) and return TSDB_CODE_INVALID_DATA_FMT on error and TSDB_CODE_SUCCESS on success.

int32_t tbase64_decode(uint8_t *out, const uint8_t *input, size_t in_len, VarDataLenT out_len) { (void)taosThreadOnce(&tbase64_decoding_table_building, tbase64_build_decoding_table); if (in_len % 4 != 0) { return TSDB_CODE_INVALID_DATA_FMT; } if (input[in_len - 1] == '=') out_len--; if (input[in_len - 2] == '=') out_len--; for (int i = 0, j = 0; i < in_len;) { uint32_t sextet_a = input[i] == '=' ? 0 & i++ : tbase64_decoding_table[input[i++]]; uint32_t sextet_b = input[i] == '=' ? 0 & i++ : tbase64_decoding_table[input[i++]]; uint32_t sextet_c = input[i] == '=' ? 0 & i++ : tbase64_decoding_table[input[i++]]; uint32_t sextet_d = input[i] == '=' ? 0 & i++ : tbase64_decoding_table[input[i++]]; uint32_t triple = (sextet_a << 3 * 6) + (sextet_b << 2 * 6) + (sextet_c << 1 * 6) + (sextet_d << 0 * 6); if (j < out_len) out[j++] = (triple >> 2 * 8) & 0xFF; if (j < out_len) out[j++] = (triple >> 1 * 8) & 0xFF; if (j < out_len) out[j++] = (triple >> 0 * 8) & 0xFF; } return TSDB_CODE_SUCCESS; }

gemini-code-assist · 2025-10-29T09:09:54Z

source/libs/parser/src/parInsertSql.c

+static int32_t parseBinary(SInsertParseContext* pCxt, const char** ppSql, SToken* pToken, SColVal* pVal,
+                           SSchema* pSchema) {
+  int32_t   bytes = pSchema->bytes;
+  uint8_t** pData = &pVal->value.pData;
+  uint32_t* nData = &pVal->value.nData;
+
+  if (TK_NK_ID == pToken->type) {
+    char*   input = NULL;
+    int32_t inputBytes = 0;
+    char    tmpTokenBuf[TSDB_MAX_BYTES_PER_ROW];
+
+    if (0 == strncasecmp(pToken->z, "from_base64(", 12)) {
+      NEXT_VALID_TOKEN(*ppSql, *pToken);
+      if (TK_NK_LP != pToken->type) {
+        return buildSyntaxErrMsg(&pCxt->msg, "( expected", pToken->z);
+      }
+
+      NEXT_VALID_TOKEN(*ppSql, *pToken);
+      if (TK_NULL == pToken->type) {
+        NEXT_VALID_TOKEN(*ppSql, *pToken);
+        if (TK_NK_RP == pToken->type) {
+          pVal->flag = CV_FLAG_NULL;
+
+          return TSDB_CODE_SUCCESS;
+        }
+      } else if (TK_NK_STRING != pToken->type) {
+        return buildSyntaxErrMsg(&pCxt->msg, "string expected", pToken->z);
+      }
+
+      inputBytes = trimString(pToken->z, pToken->n, tmpTokenBuf, TSDB_MAX_BYTES_PER_ROW);
+      input = tmpTokenBuf;
+
+      NEXT_VALID_TOKEN(*ppSql, *pToken);
+      if (TK_NK_RP != pToken->type) {
+        return buildSyntaxErrMsg(&pCxt->msg, ") expected", pToken->z);
+      }
+
+      int32_t outputBytes = tbase64_decode_len(inputBytes);
+      if (outputBytes + VARSTR_HEADER_SIZE > bytes) {
+        return generateSyntaxErrMsg(&pCxt->msg, TSDB_CODE_PAR_VALUE_TOO_LONG, pSchema->name);
+      }
+
+      *pData = taosMemoryMalloc(outputBytes);
+      if (NULL == *pData) {
+        return terrno;
+      }
+
+      tbase64_decode(*pData, input, inputBytes, outputBytes);
+      *nData = outputBytes;
+    } else if (0 == strncasecmp(pToken->z, "to_base64(", 10)) {
+      NEXT_VALID_TOKEN(*ppSql, *pToken);
+      if (TK_NK_LP != pToken->type) {
+        return buildSyntaxErrMsg(&pCxt->msg, "( expected", pToken->z);
+      }
+
+      NEXT_VALID_TOKEN(*ppSql, *pToken);
+      if (TK_NULL == pToken->type) {
+        NEXT_VALID_TOKEN(*ppSql, *pToken);
+        if (TK_NK_RP == pToken->type) {
+          pVal->flag = CV_FLAG_NULL;
+
+          return TSDB_CODE_SUCCESS;
+        }
+      } else if (TK_NK_STRING != pToken->type) {
+        return buildSyntaxErrMsg(&pCxt->msg, "string expected", pToken->z);
+      }
+
+      inputBytes = trimString(pToken->z, pToken->n, tmpTokenBuf, TSDB_MAX_BYTES_PER_ROW);
+      input = tmpTokenBuf;
+
+      NEXT_VALID_TOKEN(*ppSql, *pToken);
+      if (TK_NK_RP != pToken->type) {
+        return buildSyntaxErrMsg(&pCxt->msg, ") expected", pToken->z);
+      }
+
+      int32_t outputBytes = tbase64_encode_len(inputBytes);
+      if (outputBytes + VARSTR_HEADER_SIZE > bytes) {
+        return generateSyntaxErrMsg(&pCxt->msg, TSDB_CODE_PAR_VALUE_TOO_LONG, pSchema->name);
+      }
+
+      *pData = taosMemoryMalloc(outputBytes);
+      if (NULL == *pData) {
+        return terrno;
+      }
+
+      tbase64_encode(*pData, input, inputBytes, outputBytes);
+      *nData = outputBytes;
+    } else if (0 == strncasecmp(pToken->z, "md5(", 4)) {
+      NEXT_VALID_TOKEN(*ppSql, *pToken);
+      if (TK_NK_LP != pToken->type) {
+        return buildSyntaxErrMsg(&pCxt->msg, "( expected", pToken->z);
+      }
+
+      NEXT_VALID_TOKEN(*ppSql, *pToken);
+      if (TK_NULL == pToken->type) {
+        NEXT_VALID_TOKEN(*ppSql, *pToken);
+        if (TK_NK_RP == pToken->type) {
+          pVal->flag = CV_FLAG_NULL;
+
+          return TSDB_CODE_SUCCESS;
+        }
+      } else if (TK_NK_STRING != pToken->type) {
+        return buildSyntaxErrMsg(&pCxt->msg, "string expected", pToken->z);
+      }
+
+      inputBytes = trimString(pToken->z, pToken->n, tmpTokenBuf, TSDB_MAX_BYTES_PER_ROW);
+      input = tmpTokenBuf;
+
+      NEXT_VALID_TOKEN(*ppSql, *pToken);
+      if (TK_NK_RP != pToken->type) {
+        return buildSyntaxErrMsg(&pCxt->msg, ") expected", pToken->z);
+      }
+
+      if (MD5_OUTPUT_LEN + VARSTR_HEADER_SIZE > bytes) {
+        return generateSyntaxErrMsg(&pCxt->msg, TSDB_CODE_PAR_VALUE_TOO_LONG, pSchema->name);
+      }
+
+      int32_t bufLen = TMAX(MD5_OUTPUT_LEN + VARSTR_HEADER_SIZE + 1, inputBytes);
+      *pData = taosMemoryMalloc(bufLen);
+      if (NULL == *pData) {
+        return terrno;
+      }
+
+      (void)memcpy(*pData, input, inputBytes);
+      int32_t len = taosCreateMD5Hash(*pData, inputBytes);
+      *nData = len;
+    } else {
+      return buildSyntaxErrMsg(&pCxt->msg, "invalid identifier", pToken->z);
+    }
+  } else {
+    if (pToken->n + VARSTR_HEADER_SIZE > bytes) {
+      return generateSyntaxErrMsg(&pCxt->msg, TSDB_CODE_PAR_VALUE_TOO_LONG, pSchema->name);
+    }
+
+    *pData = taosMemoryMalloc(pToken->n);
+    if (NULL == *pData) {
+      return terrno;
+    }
+    memcpy(*pData, pToken->z, pToken->n);
+    *nData = pToken->n;
+  }
+
+  pVal->flag = CV_FLAG_VALUE;
+  return TSDB_CODE_SUCCESS;
+}


The function parseBinary contains duplicated code for parsing function arguments for from_base64, to_base64, and md5. This logic for parsing a single string argument from a function call can be extracted into a helper function to improve maintainability and reduce code duplication.

A helper function could look like this:

static int32_t parseStringArgument(SInsertParseContext* pCxt, const char** ppSql, SToken* pToken, char* outBuf, int32_t bufSize, int32_t* outBytes, bool* isNull) { *isNull = false; NEXT_VALID_TOKEN(*ppSql, *pToken); if (TK_NK_LP != pToken->type) { return buildSyntaxErrMsg(&pCxt->msg, "( expected", pToken->z); } NEXT_VALID_TOKEN(*ppSql, *pToken); if (TK_NULL == pToken->type) { NEXT_VALID_TOKEN(*ppSql, *pToken); if (TK_NK_RP == pToken->type) { *isNull = true; return TSDB_CODE_SUCCESS; } } else if (TK_NK_STRING != pToken->type) { return buildSyntaxErrMsg(&pCxt->msg, "string expected", pToken->z); } *outBytes = trimString(pToken->z, pToken->n, outBuf, bufSize); NEXT_VALID_TOKEN(*ppSql, *pToken); if (TK_NK_RP != pToken->type) { return buildSyntaxErrMsg(&pCxt->msg, ") expected", pToken->z); } return TSDB_CODE_SUCCESS; }

Then you can use this helper in parseBinary to simplify the code.

gemini-code-assist · 2025-10-29T09:09:54Z

source/libs/scalar/src/sclfunc.c

-    base64Out[j++] = base64Table[(triple >> 6) & 0x3F];
-    base64Out[j++] = base64Table[triple & 0x3F];
-  }
+    tbase64_encode(out, varDataVal(input), inputLen, outputLength);


There is a type mismatch in the call to tbase64_encode. It expects uint8_t* and const uint8_t*, but char* is passed. This can lead to issues with signed characters on some platforms. Please cast the pointers to the correct types to ensure correctness and improve code clarity.

tbase64_encode((uint8_t *)out, (const uint8_t *)varDataVal(input), inputLen, outputLength);

stephenkgu added 4 commits October 27, 2025 16:47

feat(encoding): new security funcs

8b3608b

Merge branch '3.0' into fix/TS-7235

36b1751

fix null with {from, to}_base64

c154911

feat(md5): support md5 in inserting

25876aa

stephenkgu requested review from dapan1121, guanshengliang and hzcheng as code owners October 29, 2025 09:05

taosdata-bot bot added the team engine label Oct 29, 2025

gemini-code-assist bot reviewed Oct 29, 2025

View reviewed changes

stephenkgu and others added 4 commits October 29, 2025 18:23

fix failed decoding cases

096c37a

new sha/sha1 digesting funcs

d6c20c2

fix CI compiling issue

1174581

Merge branch '3.0' into fix/TS-7235-md5

da093d8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(md5): support md5 in inserting #33416

feat(md5): support md5 in inserting #33416

stephenkgu commented Oct 29, 2025

gemini-code-assist bot commented Oct 29, 2025

gemini-code-assist bot left a comment

gemini-code-assist bot Oct 29, 2025

gemini-code-assist bot Oct 29, 2025

gemini-code-assist bot Oct 29, 2025

gemini-code-assist bot Oct 29, 2025

gemini-code-assist bot Oct 29, 2025

Labels

2 participants

		tbase64_decode(*pData, input, inputBytes, outputBytes);
		*nData = outputBytes;

feat(md5): support md5 in inserting #33416

Are you sure you want to change the base?

feat(md5): support md5 in inserting #33416

Conversation

stephenkgu commented Oct 29, 2025

Description

Checklist

gemini-code-assist bot commented Oct 29, 2025

Summary of Changes

Highlights

Footnotes

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

gemini-code-assist bot Oct 29, 2025

Choose a reason for hiding this comment

gemini-code-assist bot Oct 29, 2025

Choose a reason for hiding this comment

gemini-code-assist bot Oct 29, 2025

Choose a reason for hiding this comment

gemini-code-assist bot Oct 29, 2025

Choose a reason for hiding this comment

gemini-code-assist bot Oct 29, 2025

Choose a reason for hiding this comment

Labels

2 participants