Skip to content

Conversation

@kattz-kawa
Copy link
Contributor

@kattz-kawa kattz-kawa commented Nov 18, 2025

Why we need this PR

  • After new placement rule introduced in Use strict pod placement for controller-manager #383, the manager pods can only be scheduled on two specific nodes. Under these conditions, the default strategy may cause rollout deadlocks during node recovery.
  • Certificate rotation managed by OLM can trigger Deployment rollouts by changing the olmcahash annotation. With the current strategy, this may lead to deadlocks if a node is down during recovery, causing rollout failures. To improve reliability and ensure safe updates under these conditions, we need a more controlled approach.

Changes made

  • Added RollingUpdate strategy with maxSurge: 0 and maxUnavailable: 1 for safer deployment updates.

  • Before (default): When a rollout starts, the Deployment tries to keep the old pod until the new one is ready. On clusters with only 2 nodes, this can fail because there is no room to schedule the new pod, leading to a stuck rollout.

  • After: Pods are updated one at a time. The controller deletes one pod first, then creates a new one, ensuring that rollout can proceed safely even when only 2 nodes are available.

Which issue(s) this PR fixes

RHWA-366

Summary by CodeRabbit

Chores

  • Updated deployment strategies for the node-healthcheck-operator and controller-manager to use controlled RollingUpdate parameters, ensuring no pod surge and sequential pod replacement during rollouts for improved reliability.

…vailable=1) to mitigate rollout deadlocks during automatic certificate rotation by OLM

Signed-off-by: Katsuya Kawakami <kkawakami-je@nec.com>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 18, 2025

Hi @kattz-kawa. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link

coderabbitai bot commented Nov 18, 2025

📝 Walkthrough

Walkthrough

Two deployment configuration files were updated to define explicit RollingUpdate strategies with conservative settings: maxSurge: 0 and maxUnavailable: 1. This ensures controlled pod rollouts by replacing one pod at a time without over-provisioning during updates.

Changes

Cohort / File(s) Summary
Deployment strategy configuration
bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml, config/manager/manager.yaml
Added RollingUpdate deployment strategy with maxSurge: 0 and maxUnavailable: 1 to both operator and controller-manager deployments for controlled rollout behavior

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

  • Identical, homogeneous configuration changes applied consistently across both files
  • No logic modifications or complex structural changes
  • Straightforward YAML configuration additions

Suggested labels

do-not-merge/hold, needs-ok-to-test

Suggested reviewers

  • clobrano
  • slintes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Change Deployment UpdateStrategy' accurately reflects the main change in the PR, which modifies the Deployment update strategy configuration in two files.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml (1)

428-432: Strategy configuration is correct; align field order with config/manager/manager.yaml.

The RollingUpdate strategy settings match those in config/manager/manager.yaml and are correct for the stated purpose. However, the field order differs: this file places rollingUpdate before type, whereas the peer Deployment places type first. While both orderings are functionally equivalent in Kubernetes YAML, standardizing the order improves consistency and maintainability.

Suggested reorder for consistency:

           strategy:
-            rollingUpdate:
-              maxSurge: 0
-              maxUnavailable: 1
             type: RollingUpdate
+            rollingUpdate:
+              maxSurge: 0
+              maxUnavailable: 1

This aligns the field order with config/manager/manager.yaml (type → rollingUpdate).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 09db172 and 008ad55.

📒 Files selected for processing (2)
  • bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml (1 hunks)
  • config/manager/manager.yaml (1 hunks)
🔇 Additional comments (1)
config/manager/manager.yaml (1)

20-24: RollingUpdate strategy appropriate for two-node deployment constraint.

The configuration correctly implements one-at-a-time pod replacement (maxSurge: 0, maxUnavailable: 1) on a 2-replica Deployment, avoiding the scheduling deadlock described in the PR objectives. With this strategy, the controller will delete one pod before scheduling its replacement, ensuring forward progress on capacity-constrained clusters during certificate rotation–induced rollouts.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 2, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kattz-kawa, slintes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Dec 2, 2025
@slintes
Copy link
Member

slintes commented Dec 2, 2025

/ok-to-test

@slintes
Copy link
Member

slintes commented Dec 2, 2025

/retest

1 similar comment
@slintes
Copy link
Member

slintes commented Dec 2, 2025

/retest

@slintes slintes marked this pull request as ready for review December 2, 2025 14:34
@openshift-ci openshift-ci bot requested review from beekhof and clobrano December 2, 2025 14:34
@slintes slintes changed the title [WIP] Change Deployment UpdateStrategy Change Deployment UpdateStrategy Dec 2, 2025
@slintes
Copy link
Member

slintes commented Dec 2, 2025

only 4.19 wasn't green yet

/override ci/prow/4.20-openshift-e2e ci/prow/4.18-openshift-e2e ci/prow/4.17-openshift-e2e ci/prow/4.16-openshift-e2e

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 2, 2025

@slintes: Overrode contexts on behalf of slintes: ci/prow/4.16-openshift-e2e, ci/prow/4.17-openshift-e2e, ci/prow/4.18-openshift-e2e, ci/prow/4.20-openshift-e2e

In response to this:

only 4.19 wasn't green yet

/override ci/prow/4.20-openshift-e2e ci/prow/4.18-openshift-e2e ci/prow/4.17-openshift-e2e ci/prow/4.16-openshift-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@slintes
Copy link
Member

slintes commented Dec 3, 2025

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 3, 2025

@kattz-kawa: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.19-openshift-e2e 008ad55 link true /test 4.19-openshift-e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@slintes
Copy link
Member

slintes commented Dec 3, 2025

probably not related to this PR, but we need to investigate test failures...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants