Add growing member test scenario #20239

nwnt · 2025-06-28T03:21:13Z

For #20137

k8s-ci-robot · 2025-06-28T03:21:16Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nwnt
Once this PR has been reviewed and has the lgtm label, please assign ahrtr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

nwnt · 2025-06-28T03:22:15Z

@serathius Not sure if this is the same as what you have in mind. I probably missed a lot of things so definitely would love your feedback about it.

codecov · 2025-06-28T04:02:01Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@26dc1bc). Learn more about missing BASE report.
⚠️ Report is 366 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main   #20239   +/-   ##
=======================================
  Coverage        ?   69.44%           
=======================================
  Files           ?      418           
  Lines           ?    34696           
  Branches        ?        0           
=======================================
  Hits            ?    24096           
  Misses          ?     9219           
  Partials        ?     1381

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 26dc1bc...17b929e. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

serathius · 2025-06-30T09:50:05Z

tests/robustness/failpoint/cluster.go

 	MemberReplace          Failpoint = memberReplace{}
 	MemberDowngrade        Failpoint = memberDowngrade{}
 	MemberDowngradeUpgrade Failpoint = memberDowngradeUpgrade{}
+	GrowMember             Failpoint = &growMember{


We are not growing members here :P

Usually talk about "cluster" and not "members". And we call the process "scale up". Maybe ClusterScaleUp?

How about ClusterScaleInAndOut because we're doing both? :)

serathius · 2025-06-30T09:51:10Z

tests/robustness/failpoint/cluster.go

 }
+
+type growMember struct {
+	currentSize int


Which "current" you mean? Maybe initialSize.

I no longer need this initialSize. It's now scalingTarget.

serathius · 2025-06-30T09:55:32Z

tests/robustness/failpoint/cluster.go

+}
+
+func (f *growMember) Available(config e2e.EtcdProcessClusterConfig, member e2e.EtcdProcess, profile traffic.Profile) bool {
+	return f.currentSize < f.targetSize


Here is the tricky part. If you start from cluster of size 1, then the traffic will talk to the 1 initial member. I would recommend instead to start to from 3 members, remove 2 and add back 2. This is still tricky because StartNewProc will allocate unique port numbers so we will need to work around that.

Yep. How about the current one?

tests/robustness/scenarios/scenarios.go

serathius · 2025-06-30T10:02:58Z

Good work on the implementation, implementing a new failpoint is pretty tricky.

Got an error main_test.go:114: failed to read WAL, err: wal: slice bounds out of range, snapshot[Index: 0, Term: 0], current entry[Index: 4093, Term: 3], len(ents): 0

Problem we are hitting with PersistedRequestsCluster not being able to read WAL files from added members. Current implementation of PersistedRequestsCluster assumes that quorum members have all entries in the WAL. In this case only the initial member has the full WAL, so it fails. I think we could rewrite it to consider each index individually and accepting it as long as there is no conflict.

Unfortunately to merge this feature we will need to rewrite PersistedRequestsCluster first.

serathius · 2025-08-11T10:00:47Z

With #20435 we might be able test it again. Can you rebase and see what errors we get? I assume we might need to change how errors are handled when reading WAL.

Signed-off-by: Nont <nont@duck.com>

serathius · 2025-08-19T06:43:38Z

From pull-etcd-robustness-arm64 might need to debug watch.

    traffic.go:126: Traffic finished before failure was injected: context canceled
    main_test.go:175: Client didn't collect all events, max revision not set
    main_test.go:180: context canceled
    main_test.go:107: stat .: no such file or directory

serathius · 2025-08-19T06:43:59Z

From pull-etcd-integration-1-cpu-amd64 looks like a flake

serathius · 2025-08-19T06:47:59Z

Interesting no flakes on postsubmit https://testgrid.k8s.io/sig-etcd-postsubmits#post-etcd-integration-4-cpu-arm64&include-filter-by-regex=integration.TestCacheWatchOldRevisionCompacted

serathius · 2025-08-19T08:32:32Z

/retest

serathius · 2025-08-19T09:10:30Z

/retest

serathius · 2025-08-19T10:03:44Z

/retest pull-etcd-robustness-arm64
/retest pull-etcd-robustness-amd64

k8s-ci-robot · 2025-08-19T10:03:47Z

@serathius: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

/test pull-etcd-build

/test pull-etcd-contrib-mixin

/test pull-etcd-coverage-report

/test pull-etcd-e2e-386

/test pull-etcd-e2e-amd64

/test pull-etcd-e2e-arm64

/test pull-etcd-fuzzing-v3rpc

/test pull-etcd-govulncheck-main

/test pull-etcd-grpcproxy-e2e-amd64

/test pull-etcd-grpcproxy-e2e-arm64

/test pull-etcd-grpcproxy-integration-amd64

/test pull-etcd-grpcproxy-integration-arm64

/test pull-etcd-integration-1-cpu-amd64

/test pull-etcd-integration-1-cpu-arm64

/test pull-etcd-integration-2-cpu-amd64

/test pull-etcd-integration-2-cpu-arm64

/test pull-etcd-integration-4-cpu-amd64

/test pull-etcd-integration-4-cpu-arm64

/test pull-etcd-markdown-lint

/test pull-etcd-release-tests

/test pull-etcd-robustness-amd64

/test pull-etcd-robustness-arm64

/test pull-etcd-unit-test-386

/test pull-etcd-unit-test-amd64

/test pull-etcd-unit-test-arm64

/test pull-etcd-verify

Use /test all to run the following jobs that were automatically triggered:

pull-etcd-build

pull-etcd-contrib-mixin

pull-etcd-coverage-report

pull-etcd-e2e-386

pull-etcd-e2e-amd64

pull-etcd-e2e-arm64

pull-etcd-fuzzing-v3rpc

pull-etcd-govulncheck-main

pull-etcd-grpcproxy-e2e-amd64

pull-etcd-grpcproxy-e2e-arm64

pull-etcd-grpcproxy-integration-amd64

pull-etcd-grpcproxy-integration-arm64

pull-etcd-integration-1-cpu-amd64

pull-etcd-integration-1-cpu-arm64

pull-etcd-integration-2-cpu-amd64

pull-etcd-integration-2-cpu-arm64

pull-etcd-integration-4-cpu-amd64

pull-etcd-integration-4-cpu-arm64

pull-etcd-release-tests

pull-etcd-robustness-amd64

pull-etcd-robustness-arm64

pull-etcd-unit-test-386

pull-etcd-unit-test-amd64

pull-etcd-unit-test-arm64

pull-etcd-verify

In response to this:

/retest pull-etcd-robustness-arm64
/retest pull-etcd-robustness-amd64

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

serathius · 2025-08-19T10:04:00Z

/test pull-etcd-robustness-arm64
/test pull-etcd-robustness-amd64

k8s-ci-robot · 2025-08-19T10:30:34Z

@nwnt: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-etcd-coverage-report	`17b929e`	link	true	`/test pull-etcd-coverage-report`
pull-etcd-robustness-arm64	`17b929e`	link	true	`/test pull-etcd-robustness-arm64`
pull-etcd-robustness-amd64	`17b929e`	link	true	`/test pull-etcd-robustness-amd64`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

nwnt · 2025-08-20T03:33:27Z

From https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/etcd-io_etcd/20239/pull-etcd-robustness-arm64/1957745366933704704

{"level":"warn","ts":"2025-08-19T10:08:20.850392Z","caller":"etcdserver/server.go:854","msg":"server error","error":"the member has been permanently removed from the cluster"}
{"level":"warn","ts":"2025-08-19T10:08:20.850415Z","caller":"etcdserver/server.go:855","msg":"data-dir used by this member must be removed"}

Should I remove each process's data directory after scaling down during Inject?

nwnt · 2025-08-20T03:38:43Z

From https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/etcd-io_etcd/20239/pull-etcd-robustness-amd64/1957745366849818624

/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd (TestRobustnessExploratoryEtcdTrafficDeleteLeasesClusterOfSize3-test-2) (24040): {"level":"warn","ts":"2025-08-19T10:20:00.008061Z","caller":"etcdserver/server.go:1639","msg":"rejecting member remove request; local member has not been connected to all peers, reconfigure breaks active quorum","local-member-id":"bf19ae4419db00dc","requested-member-remove":"eabdbb777cf498cb","active-peers":1,"error":"etcdserver: unhealthy cluster"}

This looks like the condition I should add to Available. What do you think @serathius?

serathius · 2025-08-20T09:54:12Z

Should I remove each process's data directory after scaling down during Inject?

No, please don't read random logs.

This looks like the condition I should add to Available. What do you think @serathius?

No, this log is related to fact that membership are rejected if there is suspicion that cluster member are unstable. It requires at least 5 seconds of being connected to all members. This could be mitigated by adding 5 seconds sleep before membership changes.

serathius · 2025-08-20T09:55:59Z

Please read MemberReplace failpoint, seems there is a some overlap with your case.

github-actions · 2025-10-27T00:13:29Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

k8s-ci-robot added area/robustness-testing area/testing size/M labels Jun 28, 2025

serathius reviewed Jun 30, 2025

View reviewed changes

tests/robustness/scenarios/scenarios.go Outdated Show resolved Hide resolved

nwnt mentioned this pull request Jul 6, 2025

Allow partial entries for PersistedRequestsCluster #20289

Closed

serathius mentioned this pull request Aug 4, 2025

Use raft index to merge WAL files allowing to merge incomplete histories #20435

Merged

nwnt force-pushed the nwnt/test-growing-clusters branch from 1d4a309 to 86a87df Compare August 19, 2025 02:23

Add growing member test scenario

17b929e

Signed-off-by: Nont <nont@duck.com>

nwnt force-pushed the nwnt/test-growing-clusters branch from 86a87df to 17b929e Compare August 19, 2025 03:37

github-actions bot added stale and removed stale labels Oct 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add growing member test scenario #20239

Add growing member test scenario #20239

Uh oh!

nwnt commented Jun 28, 2025 •

edited

Loading

k8s-ci-robot commented Jun 28, 2025

nwnt commented Jun 28, 2025

codecov bot commented Jun 28, 2025 •

edited

Loading

serathius Jun 30, 2025

nwnt Aug 19, 2025

serathius Jun 30, 2025

nwnt Aug 19, 2025

serathius Jun 30, 2025

nwnt Aug 19, 2025

Uh oh!

serathius commented Jun 30, 2025 •

edited

Loading

serathius commented Aug 11, 2025 •

edited

Loading

serathius commented Aug 19, 2025

serathius commented Aug 19, 2025

serathius commented Aug 19, 2025

serathius commented Aug 19, 2025

serathius commented Aug 19, 2025

serathius commented Aug 19, 2025

k8s-ci-robot commented Aug 19, 2025

serathius commented Aug 19, 2025

k8s-ci-robot commented Aug 19, 2025

nwnt commented Aug 20, 2025 •

edited

Loading

nwnt commented Aug 20, 2025

serathius commented Aug 20, 2025

serathius commented Aug 20, 2025

github-actions bot commented Oct 27, 2025

Labels

3 participants

Add growing member test scenario #20239

Are you sure you want to change the base?

Add growing member test scenario #20239

Uh oh!

Conversation

nwnt commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

k8s-ci-robot commented Jun 28, 2025

nwnt commented Jun 28, 2025

codecov bot commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

serathius Jun 30, 2025

Choose a reason for hiding this comment

nwnt Aug 19, 2025

Choose a reason for hiding this comment

serathius Jun 30, 2025

Choose a reason for hiding this comment

nwnt Aug 19, 2025

Choose a reason for hiding this comment

serathius Jun 30, 2025

Choose a reason for hiding this comment

nwnt Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

serathius commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

serathius commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

serathius commented Aug 19, 2025

serathius commented Aug 19, 2025

serathius commented Aug 19, 2025

serathius commented Aug 19, 2025

serathius commented Aug 19, 2025

serathius commented Aug 19, 2025

k8s-ci-robot commented Aug 19, 2025

serathius commented Aug 19, 2025

k8s-ci-robot commented Aug 19, 2025

nwnt commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

nwnt commented Aug 20, 2025

serathius commented Aug 20, 2025

serathius commented Aug 20, 2025

github-actions bot commented Oct 27, 2025

Labels

3 participants

nwnt commented Jun 28, 2025 •

edited

Loading

codecov bot commented Jun 28, 2025 •

edited

Loading

serathius commented Jun 30, 2025 •

edited

Loading

serathius commented Aug 11, 2025 •

edited

Loading

nwnt commented Aug 20, 2025 •

edited

Loading