Skip to content

Conversation

@joe-redpanda
Copy link
Contributor

@joe-redpanda joe-redpanda commented Oct 31, 2025

Node-wise recovery today does not support force moving partitions which are already moving.

As a result, the existing implementation of controller forced recovery would often trigger partition moves from the brokers to be decommissioned before the force_moves from node-wise recovery were attempted.

The longterm resolution is to make node-wise recovery override existing moves.

For now, though, CFR is substantially more reliable if the operators execute node-wise recovery and then decommission.

This PR pares back controller forced recovery to only execute the raft0 update and wait for a leader, alongside the relevant test updates per the new expected operator workflow.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

Bug Fixes

  • makes CFR cleanup manual to avoid nodewise recovery race
controller_forced_recovery will, for now, only recovery the controller
leader. The burden of executing nodewise recovery and decomission will
then fall on the caller.

This commit updates the fixture tests per this expectation such that
they execute nodewise recovery and decomission after controller forced
recovery.
The burden of calling nodewise recovery and decomission now falls on the
operator. This commit updates controller_forced_reconfiguration-test.py
to match those expectations. It now executes forced recovery, executes
nodewise recovery, decomissions the dead brokers, and then checks
cluster health.
Removes nodewise recovery and node decommission from the steps that
controller forced recovery executes.

These, for now, will need to be executed by the operator.
@joe-redpanda joe-redpanda marked this pull request as ready for review October 31, 2025 19:33
Copilot AI review requested due to automatic review settings October 31, 2025 19:33
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors controller forced recovery (CFR) to only handle raft0 reconfiguration and controller leader election, removing automatic partition recovery and broker decommissioning steps. The changes address limitations where node-wise recovery cannot override existing partition moves, making the manual workflow (node-wise recovery followed by decommission) more reliable.

Key Changes:

  • CFR now stops after establishing a controller leader, leaving partition recovery to operators
  • Tests updated to explicitly call node-wise recovery and decommission after CFR
  • Documentation updated to reflect the reduced scope of CFR

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/rptest/tests/controller_forced_reconfiguration_test.py Updated Python test to manually execute node-wise recovery and decommission after CFR completes
tests/rptest/services/admin.py Added type hints to get_partition_balancer_status method
tests/rptest/clients/rpk.py Added type hints to force_partition_recovery method parameters
src/v/cluster/tests/controller_forced_reconfiguration_test.cc Added C++ test infrastructure for node-wise recovery and decommission workflows
src/v/cluster/tests/BUILD Added dependency on redpanda test fixture
src/v/cluster/controller_forced_reconfiguration_manager.h Updated documentation to remove steps 3 and 4 from CFR description
src/v/cluster/controller_forced_reconfiguration_manager.cc Removed partition force recovery and broker decommission logic from CFR implementation
proto/redpanda/core/admin/internal/v1/breakglass.proto Updated API documentation to clarify CFR only handles controller recovery
Updates the breakglass service documentation per the changes to
controller forced recovery.

Removes indications of nodewise recovery and decomissioning as
controller_forced_recovery will no longer perform these.
@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#75443
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ShadowLinkBasicTests test_task_pausing {"shuffle_leadership": false} integration https://buildkite.com/redpanda/redpanda/builds/75443#019a3c8c-9fb6-41ac-ada6-5ef880f3a235 FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkBasicTests&test_method=test_task_pausing
ShadowLinkConsumeGroupsMirroringTest test_continuous_group_sync {"source_cluster_spec": {"cluster_type": "redpanda"}, "with_failures": false} integration https://buildkite.com/redpanda/redpanda/builds/75443#019a3c84-1628-4864-9627-55864f23c9db FLAKY 15/21 upstream reliability is '96.17590822179733'. current run reliability is '71.42857142857143'. drift is 24.74734 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkConsumeGroupsMirroringTest&test_method=test_continuous_group_sync
ShadowLinkingReplicationTests test_replication_timestamps_match {"source_cluster_spec": {"cluster_type": "redpanda"}, "timestamp_type": "CreateTime"} integration https://buildkite.com/redpanda/redpanda/builds/75443#019a3c84-1624-43e5-902c-1610ba015c14 FLAKY 20/21 upstream reliability is '95.13888888888889'. current run reliability is '95.23809523809523'. drift is -0.09921 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_timestamps_match
ShadowLinkingReplicationTests test_replication_timestamps_match {"source_cluster_spec": {"cluster_type": "redpanda"}, "timestamp_type": "CreateTime"} integration https://buildkite.com/redpanda/redpanda/builds/75443#019a3c8c-9fad-43ed-8ba5-519972e385b3 FLAKY 20/21 upstream reliability is '95.14563106796116'. current run reliability is '95.23809523809523'. drift is -0.09246 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_timestamps_match
DataMigrationsApiTest test_creating_and_listing_migrations null integration https://buildkite.com/redpanda/redpanda/builds/75443#019a3c8c-9fad-43ed-8ba5-519972e385b3 FLAKY 19/21 upstream reliability is '96.15720524017468'. current run reliability is '90.47619047619048'. drift is 5.68101 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DataMigrationsApiTest&test_method=test_creating_and_listing_migrations
topic_properties_syncer_test topic_properties_sync unit https://buildkite.com/redpanda/redpanda/builds/75443#019a3bce-ebd0-4cab-92d1-bf37791042f2 FAIL 0/1
@dotnwat dotnwat changed the title Cfr pare down Nov 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment