Skip to content

Conversation

@nwnt
Copy link
Member

@nwnt nwnt commented Jun 28, 2025

For #20137

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nwnt
Once this PR has been reviewed and has the lgtm label, please assign ahrtr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@nwnt
Copy link
Member Author

nwnt commented Jun 28, 2025

@serathius Not sure if this is the same as what you have in mind. I probably missed a lot of things so definitely would love your feedback about it.

@codecov
Copy link

codecov bot commented Jun 28, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@26dc1bc). Learn more about missing BASE report.
⚠️ Report is 366 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #20239   +/-   ##
=======================================
  Coverage        ?   69.44%           
=======================================
  Files           ?      418           
  Lines           ?    34696           
  Branches        ?        0           
=======================================
  Hits            ?    24096           
  Misses          ?     9219           
  Partials        ?     1381           

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 26dc1bc...17b929e. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
MemberReplace Failpoint = memberReplace{}
MemberDowngrade Failpoint = memberDowngrade{}
MemberDowngradeUpgrade Failpoint = memberDowngradeUpgrade{}
GrowMember Failpoint = &growMember{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not growing members here :P

Usually talk about "cluster" and not "members". And we call the process "scale up". Maybe ClusterScaleUp?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about ClusterScaleInAndOut because we're doing both? :)

}

type growMember struct {
currentSize int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which "current" you mean? Maybe initialSize.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I no longer need this initialSize. It's now scalingTarget.

}

func (f *growMember) Available(config e2e.EtcdProcessClusterConfig, member e2e.EtcdProcess, profile traffic.Profile) bool {
return f.currentSize < f.targetSize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the tricky part. If you start from cluster of size 1, then the traffic will talk to the 1 initial member. I would recommend instead to start to from 3 members, remove 2 and add back 2. This is still tricky because StartNewProc will allocate unique port numbers so we will need to work around that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. How about the current one?

@serathius
Copy link
Member

serathius commented Jun 30, 2025

Good work on the implementation, implementing a new failpoint is pretty tricky.

Got an error main_test.go:114: failed to read WAL, err: wal: slice bounds out of range, snapshot[Index: 0, Term: 0], current entry[Index: 4093, Term: 3], len(ents): 0

Problem we are hitting with PersistedRequestsCluster not being able to read WAL files from added members. Current implementation of PersistedRequestsCluster assumes that quorum members have all entries in the WAL. In this case only the initial member has the full WAL, so it fails. I think we could rewrite it to consider each index individually and accepting it as long as there is no conflict.

Unfortunately to merge this feature we will need to rewrite PersistedRequestsCluster first.

@serathius
Copy link
Member

serathius commented Aug 11, 2025

With #20435 we might be able test it again. Can you rebase and see what errors we get? I assume we might need to change how errors are handled when reading WAL.

@nwnt nwnt force-pushed the nwnt/test-growing-clusters branch from 1d4a309 to 86a87df Compare August 19, 2025 02:23
Signed-off-by: Nont <nont@duck.com>
@nwnt nwnt force-pushed the nwnt/test-growing-clusters branch from 86a87df to 17b929e Compare August 19, 2025 03:37
@serathius
Copy link
Member

From pull-etcd-robustness-arm64 might need to debug watch.

    traffic.go:126: Traffic finished before failure was injected: context canceled
    main_test.go:175: Client didn't collect all events, max revision not set
    main_test.go:180: context canceled
    main_test.go:107: stat .: no such file or directory
@serathius
Copy link
Member

From pull-etcd-integration-1-cpu-amd64 looks like a flake

@serathius
Copy link
Member

/retest

1 similar comment
@serathius
Copy link
Member

/retest

@serathius
Copy link
Member

/retest pull-etcd-robustness-arm64
/retest pull-etcd-robustness-amd64

@k8s-ci-robot
Copy link

@serathius: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

/test pull-etcd-build
/test pull-etcd-contrib-mixin
/test pull-etcd-coverage-report
/test pull-etcd-e2e-386
/test pull-etcd-e2e-amd64
/test pull-etcd-e2e-arm64
/test pull-etcd-fuzzing-v3rpc
/test pull-etcd-govulncheck-main
/test pull-etcd-grpcproxy-e2e-amd64
/test pull-etcd-grpcproxy-e2e-arm64
/test pull-etcd-grpcproxy-integration-amd64
/test pull-etcd-grpcproxy-integration-arm64
/test pull-etcd-integration-1-cpu-amd64
/test pull-etcd-integration-1-cpu-arm64
/test pull-etcd-integration-2-cpu-amd64
/test pull-etcd-integration-2-cpu-arm64
/test pull-etcd-integration-4-cpu-amd64
/test pull-etcd-integration-4-cpu-arm64
/test pull-etcd-markdown-lint
/test pull-etcd-release-tests
/test pull-etcd-robustness-amd64
/test pull-etcd-robustness-arm64
/test pull-etcd-unit-test-386
/test pull-etcd-unit-test-amd64
/test pull-etcd-unit-test-arm64
/test pull-etcd-verify

Use /test all to run the following jobs that were automatically triggered:

pull-etcd-build
pull-etcd-contrib-mixin
pull-etcd-coverage-report
pull-etcd-e2e-386
pull-etcd-e2e-amd64
pull-etcd-e2e-arm64
pull-etcd-fuzzing-v3rpc
pull-etcd-govulncheck-main
pull-etcd-grpcproxy-e2e-amd64
pull-etcd-grpcproxy-e2e-arm64
pull-etcd-grpcproxy-integration-amd64
pull-etcd-grpcproxy-integration-arm64
pull-etcd-integration-1-cpu-amd64
pull-etcd-integration-1-cpu-arm64
pull-etcd-integration-2-cpu-amd64
pull-etcd-integration-2-cpu-arm64
pull-etcd-integration-4-cpu-amd64
pull-etcd-integration-4-cpu-arm64
pull-etcd-release-tests
pull-etcd-robustness-amd64
pull-etcd-robustness-arm64
pull-etcd-unit-test-386
pull-etcd-unit-test-amd64
pull-etcd-unit-test-arm64
pull-etcd-verify

In response to this:

/retest pull-etcd-robustness-arm64
/retest pull-etcd-robustness-amd64

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@serathius
Copy link
Member

/test pull-etcd-robustness-arm64
/test pull-etcd-robustness-amd64

@k8s-ci-robot
Copy link

@nwnt: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-etcd-coverage-report 17b929e link true /test pull-etcd-coverage-report
pull-etcd-robustness-arm64 17b929e link true /test pull-etcd-robustness-arm64
pull-etcd-robustness-amd64 17b929e link true /test pull-etcd-robustness-amd64

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@nwnt
Copy link
Member Author

nwnt commented Aug 20, 2025

From https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/etcd-io_etcd/20239/pull-etcd-robustness-arm64/1957745366933704704

{"level":"warn","ts":"2025-08-19T10:08:20.850392Z","caller":"etcdserver/server.go:854","msg":"server error","error":"the member has been permanently removed from the cluster"}
{"level":"warn","ts":"2025-08-19T10:08:20.850415Z","caller":"etcdserver/server.go:855","msg":"data-dir used by this member must be removed"}

Should I remove each process's data directory after scaling down during Inject?

@nwnt
Copy link
Member Author

nwnt commented Aug 20, 2025

From https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/etcd-io_etcd/20239/pull-etcd-robustness-amd64/1957745366849818624

/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd (TestRobustnessExploratoryEtcdTrafficDeleteLeasesClusterOfSize3-test-2) (24040): {"level":"warn","ts":"2025-08-19T10:20:00.008061Z","caller":"etcdserver/server.go:1639","msg":"rejecting member remove request; local member has not been connected to all peers, reconfigure breaks active quorum","local-member-id":"bf19ae4419db00dc","requested-member-remove":"eabdbb777cf498cb","active-peers":1,"error":"etcdserver: unhealthy cluster"}

This looks like the condition I should add to Available. What do you think @serathius?

@serathius
Copy link
Member

Should I remove each process's data directory after scaling down during Inject?

No, please don't read random logs.

This looks like the condition I should add to Available. What do you think @serathius?

No, this log is related to fact that membership are rejected if there is suspicion that cluster member are unstable. It requires at least 5 seconds of being connected to all members. This could be mitigated by adding 5 seconds sleep before membership changes.

@serathius
Copy link
Member

Please read MemberReplace failpoint, seems there is a some overlap with your case.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added stale and removed stale labels Oct 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment