This repository was archived by the owner on Jul 18, 2025. It is now read-only.
Add backoff mechanism to google driver operations checks #4600
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes a problem, that can be observed when Google driver is being used in a highly concurrent environment.
Background
After each operation scheduled by Docker Machine on GCE (e.g. instance creation, instance deletion, disk deletion etc.) a loop requesting Operations API is started and it works until the relevant operation is marked as
DONE
.When creating instances manually this is mostly not a problem. But when creating instances automatically, especially in highly concurrent environments, this becomes an issue.
From tests in GitLab.com Shared Runners environment we've found, that simple deletion of disk ends with a >50 repeats of the loop - each one repeated in a second. Looking only on today, we schedule up to 200 machine creation and 200 machine deletion events during a minute (with average of ~90 creation and ~90 deletion events each minute). Each event starts several operations (e.g. machine deletion first requests GCE instance deletion and next disk deletion).
Finally this ends with thousands of Operation Read Requests executed via GCP API. By default these API requests are sharing a 4000 requests per 100 seconds limit. In an environment like ours, current implementation causes exceeding API rate limits multiple times a day, when only the CI jobs load growths over some minimum level.
Detailed investigation and explanation on why current implementation is bad can be found at https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5375.
During this investigation we've managed to prepare a patch, that is being proposed in this PR.
Patched version of Docker Machine was deployed to our CI Runners fleet and it resolved the problem of API rate limiting immediately. Measurements after the patch being deployed can be also found at he linked page.
What this PR does
The change is simple - an infinite, repeated each second loop that requests Operations API is replaced with a loop, that:
vendor/
.Used backoff can be configured with command line options, so every user may adjust it to his own needs.
I know about the maintenance mode and I've read that new features, drivers or provisioners will be not merged. But looking at our findings at https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5375, I consider this change as a bug fix, not a new feature. Current implementation, with a slightly bigger scale, just doesn't work :)
Signed-off-by: Tomasz Maczukin tomasz@maczukin.pl