-
Notifications
You must be signed in to change notification settings - Fork 1.5k
fix(bigquery): retry on tcp timeout errors #13269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Summary of ChangesHello @MartinSahlen, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the resilience of the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request provides a valuable fix for an important resilience issue in the BigQuery client. The detailed description accurately identifies that dial tcp: i/o timeout errors were not being automatically retried. The implemented change, which adds a check for timeout errors using the Timeout() method in the retryableError function, is the correct and modern approach in Go for handling such transient network issues. This small and targeted modification will significantly improve the client's robustness and user experience by handling common network timeouts automatically. The change is well-implemented and I have no further suggestions. It is ready for merging.
@MartinSahlen can you sign the CLA ? |
Yes I did shortly after submitting the PR and also retriggered the test which passed already. |
|
Thanks for the PR and the detailed writeup! One minor request: Could you add a testcase for this to TestRetryableErrors in bigquery_test.go? It looks like https://pkg.go.dev/net#DNSError gives us an easy to use example error for this case. |
Thanks! I can give it a go. Firstly I need to understand the test structure and how to set up some mock errors/responses, will give a shout if I get stuck. |
Description
The cloud.google.com/go/bigquery client does not automatically retry API calls that fail with a dial tcp: i/o timeout error. This type of error is a common transient network failure, especially in distributed cloud environments, and often occurs when initiating a connection.
The underlying Go error wrapper correctly identifies this as a retryable error (as seen by retryable: true in the error message), but the BigQuery client's internal retry predicate fails to catch it, immediately propagating the error to the user. This forces developers to build their own complex retry wrappers around the client, which should ideally be handled by the library's built-in resilience mechanisms.
Expected Behavior
When an API call (such as
jobs.insertorjobs.query) fails with adial tcp: i/o timeout, the client library should recognize this as a transient, retryable error and automatically retry the operation using its built-in exponential backoff strategy.Actual Behavior
The API call fails immediately and returns the i/o timeout error directly to the caller. No retry is attempted by the library.
The full error message is similar to the following:
Code Snippet
The issue can be observed with any standard API call that initiates a network request. For example, when using a Loader to start a job:
Additional Context & Analysis
The root cause appears to be in the library's internal
retryableErrorpredicate. This function does not check for errors that satisfy thenet.Error Timeout()method.The current implementation checks for
interface{ Temporary() bool }:However, a
dial tcp: i/o timeoutis anet.ErrorwhereTimeout()returnstrue, butTemporary()may not. TheTemporary()method was deprecated in Go 1.18 because its definition was ambiguous and ill-defined. Most errors that were once "temporary" are now more accurately classified as timeouts.Because the library's predicate relies on this deprecated method and omits a check for the
Timeout()method, it fails to identify one of the most common types of transient network errors.The proposed fix in this PR is to update the
retryableErrorpredicate to also include a check for timeout errors, for example:Adding this case will improve the client's resilience and align its behavior with the expectation that transient network timeouts are handled automatically.
Hoping for positive feedback on this one and that we can get it merged quickly. Cheers!