-
Notifications
You must be signed in to change notification settings - Fork 693
Controller forced reconfiguration: pare down #28307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
controller_forced_recovery will, for now, only recovery the controller leader. The burden of executing nodewise recovery and decomission will then fall on the caller. This commit updates the fixture tests per this expectation such that they execute nodewise recovery and decomission after controller forced recovery.
352f2b8 to
776eaad
Compare
The burden of calling nodewise recovery and decomission now falls on the operator. This commit updates controller_forced_reconfiguration-test.py to match those expectations. It now executes forced recovery, executes nodewise recovery, decomissions the dead brokers, and then checks cluster health.
Removes nodewise recovery and node decommission from the steps that controller forced recovery executes. These, for now, will need to be executed by the operator.
776eaad to
7fa83f9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors controller forced recovery (CFR) to only handle raft0 reconfiguration and controller leader election, removing automatic partition recovery and broker decommissioning steps. The changes address limitations where node-wise recovery cannot override existing partition moves, making the manual workflow (node-wise recovery followed by decommission) more reliable.
Key Changes:
- CFR now stops after establishing a controller leader, leaving partition recovery to operators
- Tests updated to explicitly call node-wise recovery and decommission after CFR
- Documentation updated to reflect the reduced scope of CFR
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
tests/rptest/tests/controller_forced_reconfiguration_test.py |
Updated Python test to manually execute node-wise recovery and decommission after CFR completes |
tests/rptest/services/admin.py |
Added type hints to get_partition_balancer_status method |
tests/rptest/clients/rpk.py |
Added type hints to force_partition_recovery method parameters |
src/v/cluster/tests/controller_forced_reconfiguration_test.cc |
Added C++ test infrastructure for node-wise recovery and decommission workflows |
src/v/cluster/tests/BUILD |
Added dependency on redpanda test fixture |
src/v/cluster/controller_forced_reconfiguration_manager.h |
Updated documentation to remove steps 3 and 4 from CFR description |
src/v/cluster/controller_forced_reconfiguration_manager.cc |
Removed partition force recovery and broker decommission logic from CFR implementation |
proto/redpanda/core/admin/internal/v1/breakglass.proto |
Updated API documentation to clarify CFR only handles controller recovery |
Updates the breakglass service documentation per the changes to controller forced recovery. Removes indications of nodewise recovery and decomissioning as controller_forced_recovery will no longer perform these.
7fa83f9 to
9cfed73
Compare
CI test resultstest results on build#75443
|
Node-wise recovery today does not support force moving partitions which are already moving.
As a result, the existing implementation of controller forced recovery would often trigger partition moves from the brokers to be decommissioned before the force_moves from node-wise recovery were attempted.
The longterm resolution is to make node-wise recovery override existing moves.
For now, though, CFR is substantially more reliable if the operators execute node-wise recovery and then decommission.
This PR pares back controller forced recovery to only execute the raft0 update and wait for a leader, alongside the relevant test updates per the new expected operator workflow.
Backports Required
Release Notes
Bug Fixes