Controller forced reconfiguration: pare down #28307

joe-redpanda · 2025-10-31T18:25:37Z

Node-wise recovery today does not support force moving partitions which are already moving.

As a result, the existing implementation of controller forced recovery would often trigger partition moves from the brokers to be decommissioned before the force_moves from node-wise recovery were attempted.

The longterm resolution is to make node-wise recovery override existing moves.

For now, though, CFR is substantially more reliable if the operators execute node-wise recovery and then decommission.

This PR pares back controller forced recovery to only execute the raft0 update and wait for a leader, alongside the relevant test updates per the new expected operator workflow.

Backports Required

Release Notes

Bug Fixes

makes CFR cleanup manual to avoid nodewise recovery race

controller_forced_recovery will, for now, only recovery the controller leader. The burden of executing nodewise recovery and decomission will then fall on the caller. This commit updates the fixture tests per this expectation such that they execute nodewise recovery and decomission after controller forced recovery.

The burden of calling nodewise recovery and decomission now falls on the operator. This commit updates controller_forced_reconfiguration-test.py to match those expectations. It now executes forced recovery, executes nodewise recovery, decomissions the dead brokers, and then checks cluster health.

Removes nodewise recovery and node decommission from the steps that controller forced recovery executes. These, for now, will need to be executed by the operator.

Copilot

Pull Request Overview

This PR refactors controller forced recovery (CFR) to only handle raft0 reconfiguration and controller leader election, removing automatic partition recovery and broker decommissioning steps. The changes address limitations where node-wise recovery cannot override existing partition moves, making the manual workflow (node-wise recovery followed by decommission) more reliable.

Key Changes:

CFR now stops after establishing a controller leader, leaving partition recovery to operators
Tests updated to explicitly call node-wise recovery and decommission after CFR
Documentation updated to reflect the reduced scope of CFR

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`tests/rptest/tests/controller_forced_reconfiguration_test.py`	Updated Python test to manually execute node-wise recovery and decommission after CFR completes
`tests/rptest/services/admin.py`	Added type hints to `get_partition_balancer_status` method
`tests/rptest/clients/rpk.py`	Added type hints to `force_partition_recovery` method parameters
`src/v/cluster/tests/controller_forced_reconfiguration_test.cc`	Added C++ test infrastructure for node-wise recovery and decommission workflows
`src/v/cluster/tests/BUILD`	Added dependency on redpanda test fixture
`src/v/cluster/controller_forced_reconfiguration_manager.h`	Updated documentation to remove steps 3 and 4 from CFR description
`src/v/cluster/controller_forced_reconfiguration_manager.cc`	Removed partition force recovery and broker decommission logic from CFR implementation
`proto/redpanda/core/admin/internal/v1/breakglass.proto`	Updated API documentation to clarify CFR only handles controller recovery

proto/redpanda/core/admin/internal/v1/breakglass.proto

src/v/cluster/tests/controller_forced_reconfiguration_test.cc

Updates the breakglass service documentation per the changes to controller forced recovery. Removes indications of nodewise recovery and decomissioning as controller_forced_recovery will no longer perform these.

vbotbuildovich · 2025-11-01T01:16:10Z

CI test results

test results on build#75443

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
ShadowLinkBasicTests	test_task_pausing	{"shuffle_leadership": false}	integration	https://buildkite.com/redpanda/redpanda/builds/75443#019a3c8c-9fb6-41ac-ada6-5ef880f3a235	FLAKY	20/21	upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkBasicTests&test_method=test_task_pausing
ShadowLinkConsumeGroupsMirroringTest	test_continuous_group_sync	{"source_cluster_spec": {"cluster_type": "redpanda"}, "with_failures": false}	integration	https://buildkite.com/redpanda/redpanda/builds/75443#019a3c84-1628-4864-9627-55864f23c9db	FLAKY	15/21	upstream reliability is '96.17590822179733'. current run reliability is '71.42857142857143'. drift is 24.74734 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkConsumeGroupsMirroringTest&test_method=test_continuous_group_sync
ShadowLinkingReplicationTests	test_replication_timestamps_match	{"source_cluster_spec": {"cluster_type": "redpanda"}, "timestamp_type": "CreateTime"}	integration	https://buildkite.com/redpanda/redpanda/builds/75443#019a3c84-1624-43e5-902c-1610ba015c14	FLAKY	20/21	upstream reliability is '95.13888888888889'. current run reliability is '95.23809523809523'. drift is -0.09921 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_timestamps_match
ShadowLinkingReplicationTests	test_replication_timestamps_match	{"source_cluster_spec": {"cluster_type": "redpanda"}, "timestamp_type": "CreateTime"}	integration	https://buildkite.com/redpanda/redpanda/builds/75443#019a3c8c-9fad-43ed-8ba5-519972e385b3	FLAKY	20/21	upstream reliability is '95.14563106796116'. current run reliability is '95.23809523809523'. drift is -0.09246 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_timestamps_match
DataMigrationsApiTest	test_creating_and_listing_migrations	null	integration	https://buildkite.com/redpanda/redpanda/builds/75443#019a3c8c-9fad-43ed-8ba5-519972e385b3	FLAKY	19/21	upstream reliability is '96.15720524017468'. current run reliability is '90.47619047619048'. drift is 5.68101 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DataMigrationsApiTest&test_method=test_creating_and_listing_migrations
topic_properties_syncer_test	topic_properties_sync		unit	https://buildkite.com/redpanda/redpanda/builds/75443#019a3bce-ebd0-4cab-92d1-bf37791042f2	FAIL	0/1

github-actions bot added area/build area/redpanda labels Oct 31, 2025

joe-redpanda force-pushed the cfr_pare_down branch from 352f2b8 to 776eaad Compare October 31, 2025 18:37

joe-redpanda added 4 commits October 31, 2025 12:11

rpk.py: chore typing improvement

eb49111

admin.py: chore typing improvements

90bbff0

cluster/cfr_m: remove nodewise recovery and decom

9e76380

Removes nodewise recovery and node decommission from the steps that controller forced recovery executes. These, for now, will need to be executed by the operator.

joe-redpanda force-pushed the cfr_pare_down branch from 776eaad to 7fa83f9 Compare October 31, 2025 19:11

joe-redpanda requested review from bashtanov, bharathv and mmaslankaprv October 31, 2025 19:33

joe-redpanda marked this pull request as ready for review October 31, 2025 19:33

Copilot AI review requested due to automatic review settings October 31, 2025 19:33

Copilot AI reviewed Oct 31, 2025

View reviewed changes

proto/redpanda/core/admin/internal/v1/breakglass.proto Outdated Show resolved Hide resolved

src/v/cluster/tests/controller_forced_reconfiguration_test.cc Show resolved Hide resolved

breakglass: documentation updates

9cfed73

Updates the breakglass service documentation per the changes to controller forced recovery. Removes indications of nodewise recovery and decomissioning as controller_forced_recovery will no longer perform these.

joe-redpanda force-pushed the cfr_pare_down branch from 7fa83f9 to 9cfed73 Compare October 31, 2025 19:46

dotnwat changed the title ~~Cfr pare down~~ Nov 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Controller forced reconfiguration: pare down #28307

Controller forced reconfiguration: pare down #28307

Uh oh!

joe-redpanda commented Oct 31, 2025 •

edited

Loading

Copilot AI left a comment

Uh oh!

Uh oh!

vbotbuildovich commented Nov 1, 2025

Labels

2 participants

Controller forced reconfiguration: pare down #28307

Are you sure you want to change the base?

Controller forced reconfiguration: pare down #28307

Uh oh!

Conversation

joe-redpanda commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backports Required

Release Notes

Bug Fixes

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

vbotbuildovich commented Nov 1, 2025

CI test results

Labels

2 participants

joe-redpanda commented Oct 31, 2025 •

edited

Loading