Add backoff mechanism to google driver operations checks #4600

tmaczukin · 2018-10-29T21:00:18Z

This PR fixes a problem, that can be observed when Google driver is being used in a highly concurrent environment.

Background

After each operation scheduled by Docker Machine on GCE (e.g. instance creation, instance deletion, disk deletion etc.) a loop requesting Operations API is started and it works until the relevant operation is marked as DONE.

When creating instances manually this is mostly not a problem. But when creating instances automatically, especially in highly concurrent environments, this becomes an issue.

From tests in GitLab.com Shared Runners environment we've found, that simple deletion of disk ends with a >50 repeats of the loop - each one repeated in a second. Looking only on today, we schedule up to 200 machine creation and 200 machine deletion events during a minute (with average of ~90 creation and ~90 deletion events each minute). Each event starts several operations (e.g. machine deletion first requests GCE instance deletion and next disk deletion).

Finally this ends with thousands of Operation Read Requests executed via GCP API. By default these API requests are sharing a 4000 requests per 100 seconds limit. In an environment like ours, current implementation causes exceeding API rate limits multiple times a day, when only the CI jobs load growths over some minimum level.

Detailed investigation and explanation on why current implementation is bad can be found at https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5375.

During this investigation we've managed to prepare a patch, that is being proposed in this PR.

Patched version of Docker Machine was deployed to our CI Runners fleet and it resolved the problem of API rate limiting immediately. Measurements after the patch being deployed can be also found at he linked page.

What this PR does

The change is simple - an infinite, repeated each second loop that requests Operations API is replaced with a loop, that:

has a finite (and configurable) maximum duration,
1s delay is replaced by backoff mechanism, that reduces the number of requests, when the operation takes a longer time,
uses a backoff library, that was already part of Docker Machine's vendor/.

Used backoff can be configured with command line options, so every user may adjust it to his own needs.

I know about the maintenance mode and I've read that new features, drivers or provisioners will be not merged. But looking at our findings at https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5375, I consider this change as a bug fix, not a new feature. Current implementation, with a slightly bigger scale, just doesn't work :)

Signed-off-by: Tomasz Maczukin tomasz@maczukin.pl

Signed-off-by: Tomasz Maczukin <tomasz@maczukin.pl>

Add backoff mechanism to google driver operations checks

99ee007

Signed-off-by: Tomasz Maczukin <tomasz@maczukin.pl>

tmaczukin force-pushed the add-backoff-mechanism-to-google-driver-operations-checks branch from 9248719 to 99ee007 Compare December 5, 2018 19:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add backoff mechanism to google driver operations checks #4600

Add backoff mechanism to google driver operations checks #4600

Uh oh!

tmaczukin commented Oct 29, 2018

Add backoff mechanism to google driver operations checks #4600

Are you sure you want to change the base?

Add backoff mechanism to google driver operations checks #4600

Uh oh!

Conversation

tmaczukin commented Oct 29, 2018

Background

What this PR does