Skip to content

Commit 7ad4ddd

Browse files
Update in-place update proposal
1 parent 7c722bc commit 7ad4ddd

File tree

7 files changed

+517
-370
lines changed

7 files changed

+517
-370
lines changed

docs/book/src/SUMMARY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@
3939
- [Implementing Runtime Extensions](./tasks/experimental-features/runtime-sdk/implement-extensions.md)
4040
- [Implementing Lifecycle Hook Extensions](./tasks/experimental-features/runtime-sdk/implement-lifecycle-hooks.md)
4141
- [Implementing Topology Mutation Hook Extensions](./tasks/experimental-features/runtime-sdk/implement-topology-mutation-hook.md)
42+
- [Implementing In-Place Update Hooks Extensions](./tasks/experimental-features/runtime-sdk/implement-in-place-update-hooks.md)
4243
- [Deploying Runtime Extensions](./tasks/experimental-features/runtime-sdk/deploy-runtime-extension.md)
4344
- [Ignition Bootstrap configuration](./tasks/experimental-features/ignition.md)
4445
- [Running multiple providers](./tasks/multiple-providers.md)

docs/book/src/developer/providers/contracts/control-plane.md

Lines changed: 30 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ repo or add an item to the agenda in the [Cluster API community meeting](https:/
6868
| [ControlPlane: version] | No | Mandatory if control plane allows direct management of the Kubernetes version in use; Mandatory for cluster class support. |
6969
| [ControlPlane: machines] | No | Mandatory if control plane instances are represented with a set of Cluster API Machines. |
7070
| [ControlPlane: initialization completed] | Yes | |
71+
| [ControlPlane: in-place updates] | No | Only supported for control plane providers with control plane machines |
7172
| [ControlPlane: conditions] | No | |
7273
| [ControlPlane: terminal failures] | No | |
7374
| [ControlPlaneTemplate, ControlPlaneTemplateList resource definition] | No | Mandatory for ClusterClasses support |
@@ -616,8 +617,34 @@ the ControlPlane resource will be ignored.
616617

617618
</aside>
618619

619-
### ControlPlane: conditions
620+
### ControlPlane: in-place updates
621+
622+
In case a control plane provider would like to provide support for in-place updates, please check the [proposal](https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/20240807-in-place-updates.md).
623+
624+
Supporting in-place updates requires:
625+
- implementing the call for the registered `CanUpdateMachine` hook when performing the "can update in-place" decision.
626+
- when it is decided to perform the in-place decision:
627+
- the machine spec must be updated to the desired state, as well as the spec for the corresponding infrastructure machine and bootstrap config
628+
- while updating those objects also the `in-place-updates.internal.cluster.x-k8s.io/update-in-progress` annotation must be set
629+
- once all objects are updated the `UpdateMachine` hook must be set as pending on the machine object
630+
631+
After above steps are completed, the machine controller will take over and complete the in-place upgrade.
632+
633+
<aside class="note warning">
634+
635+
<h1>High complexity</h1>
620636

637+
Implementing the in-place update transition in a race condition free, re-entrant way is more complex that it might seem.
638+
639+
Please read the proposal's [implementation notes](https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20240807-in-place-updates-implementation-notes.md)
640+
carefully.
641+
642+
Also, it is highly recommended to use the KCP implementation as a reference.
643+
644+
</aside>
645+
646+
647+
### ControlPlane: conditions
621648

622649
According to [Kubernetes API Conventions], Conditions provide a standard mechanism for higher-level
623650
status reporting from a controller.
@@ -873,7 +900,8 @@ is implemented in ControlPlane controllers:
873900
[ControlPlane: machines]: #controlplane-machines
874901
[In place propagation of changes affecting Kubernetes objects only]: https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20221003-In-place-propagation-of-Kubernetes-objects-only-changes.md
875902
[ControlPlane: version]: #controlplane-version
876-
[ControlPlane: initialization completed]: #controlplane-initialization-completed
903+
[ControlPlane: initialization completed]: #controlplane-initialization-completed
904+
[ControlPlane: in-place updates]: #controlplane-in-place-updates
877905
[ControlPlane: conditions]: #controlplane-conditions
878906
[Kubernetes API Conventions]: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties
879907
[Improving status in CAPI resources]: https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20240916-improve-status-in-CAPI-resources.md

docs/book/src/reference/glossary.md

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Table of Contents
22

3-
[A](#a) | [B](#b) | [C](#c) | [D](#d) | [E](#e) | [H](#h) | [I](#i) | [K](#k) | [L](#l)| [M](#m) | [N](#n) | [O](#o) | [P](#p) | [R](#r) | [S](#s) | [T](#t) | [W](#w)
3+
[A](#a) | [B](#b) | [C](#c) | [D](#d) | [E](#e) | [H](#h) | [I](#i) | [K](#k) | [L](#l)| [M](#m) | [N](#n) | [O](#o) | [P](#p) | [R](#r) | [S](#s) | [T](#t) | [U](#u) |[W](#w)
44

55
# A
66
---
@@ -264,6 +264,12 @@ are propagated in place by CAPI controllers to avoid the more elaborated mechani
264264
They include metadata, MinReadySeconds, NodeDrainTimeout, NodeVolumeDetachTimeout and NodeDeletionTimeout but are
265265
not limited to be expanded in the future.
266266

267+
### In-place update
268+
269+
Any change to a Machine spec, that is performed without deleting the machines and creating a new one.
270+
271+
Note: changing [in-place mutable fields](#in-place-mutable-fields) is not considered an in-place upgrade.
272+
267273
### Instance
268274

269275
see [Server](#server)
@@ -460,6 +466,17 @@ A [Runtime Hook](#runtime-hook) that allows external components to generate [pat
460466

461467
See [Topology Mutation](../tasks/experimental-features/runtime-sdk/implement-topology-mutation-hook.md)
462468

469+
# U
470+
---
471+
472+
### Update Extension
473+
474+
A [runtime extension provider](#runtime-extension-provider) that implements [Update Lifecycle Hooks](#update-lifecycle-hooks).
475+
476+
### Update Lifecycle Hooks
477+
Is a set of Cluster API [Runtime Hooks](#runtime-hook) called when performing the "can update in-place" decision or
478+
when performing an [in-place update](#in-place-update).
479+
463480
# W
464481
---
465482

Lines changed: 269 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,269 @@
1+
# Implementing in-place update hooks
2+
3+
<aside class="note warning">
4+
5+
<h1>Caution</h1>
6+
7+
Please note Runtime SDK is an advanced feature. If implemented incorrectly, a failing Runtime Extension can severely impact the Cluster API runtime.
8+
9+
</aside>
10+
11+
## Introduction
12+
13+
The proposal for [n-place updates in Cluster API](https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/20240807-in-place-updates.md)
14+
introduced extensions allowing users to execute changes on existing machines without deleting the machines and creating a new one.
15+
16+
Notably, the Cluster API user experience remain the same as of today no matter of the in-place update feature is enabled
17+
or not e.g. in order to trigger a MachineDeployment rollout, you have to rotate a template, etc.
18+
19+
Users should care ONLY about the desired state (as of today).
20+
21+
Cluster API is responsible to choose the best strategy to achieve desired state, and with the introduction of
22+
update extensions, Cluster API is expanding the set of tools Cluster API can use to achieve the desired state.
23+
24+
If external update extensions can not cover the totality of the desired changes, CAPI will fall back to Cluster API’s default,
25+
immutable rollouts.
26+
27+
Cluster API will be also responsible to determine which Machine/MachineSet should be updated, as well as to handle rollout
28+
options like MaxSurge/MaxUnavailable. With this regard:
29+
30+
- Machines updating in-place are considered not available, because in-place updates are always considered as potentially disruptive.
31+
- For control plane machines, if maxSurge is one, a new machine must be created first, then as soon as there is
32+
“buffer” for in-place, in-place update can proceed.
33+
- KCP will not use in-place in case it will detect that it can impact health of the control plane.
34+
- For workers machines, if maxUnavailable is zero, a new machine must be created first, then as soon as there
35+
is “buffer” for in-place, in-place update can proceed.
36+
- When in-place is possible, the system should try to in-place update as many machines as possible.
37+
In practice, this means that maxSurge might be not fully used (it is used only for scale up by one if maxUnavailable=0).
38+
- No in-place updates are performed for workers machines when using rollout strategy on delete.
39+
40+
<!-- TOC -->
41+
* [Implementing in-place update hooks](#implementing-in-place-update-hooks)
42+
* [Introduction](#introduction)
43+
* [Guidelines](#guidelines)
44+
* [Definitions](#definitions)
45+
* [CanUpdateMachine](#canupdatemachine)
46+
* [CanUpdateMachineSet](#canupdatemachineset)
47+
* [UpdateMachine](#updatemachine)
48+
<!-- TOC -->
49+
50+
## Guidelines
51+
52+
All guidelines defined in [Implementing Runtime Extensions](implement-extensions.md#guidelines) apply to the
53+
implementation of Runtime Extensions for upgrade plan hooks as well.
54+
55+
In summary, Runtime Extensions are components that should be designed, written and deployed with great caution given
56+
that they can affect the proper functioning of the Cluster API runtime. A poorly implemented Runtime Extension could
57+
potentially block upgrade transitions from happening.
58+
59+
Following recommendations are especially relevant:
60+
61+
* [Timeouts](implement-extensions.md#timeouts)
62+
* [Idempotence](implement-extensions.md#idempotence)
63+
* [Deterministic result](implement-extensions.md#deterministic-result)
64+
* [Error messages](implement-extensions.md#error-messages)
65+
* [Error management](implement-extensions.md#error-management)
66+
* [Avoid dependencies](implement-extensions.md#avoid-dependencies)
67+
68+
## Definitions
69+
70+
For additional details about the OpenAPI spec of the upgrade plan hooks, please download the [`runtime-sdk-openapi.yaml`]({{#releaselink repo:"https://github.com/kubernetes-sigs/cluster-api" gomodule:"sigs.k8s.io/cluster-api" asset:"runtime-sdk-openapi.yaml" version:"1.11.x"}})
71+
file and then open it from the [Swagger UI](https://editor.swagger.io/).
72+
73+
### CanUpdateMachine
74+
75+
This hook is called by KCP when performing the "can update in-place" for a control plane machine.
76+
77+
Example request
78+
79+
```yaml
80+
apiVersion: hooks.runtime.cluster.x-k8s.io/v1alpha1
81+
kind: CanUpdateMachineRequest
82+
settings: <Runtime Extension settings>
83+
current:
84+
machine:
85+
apiVersion: cluster.x-k8s.io/v1beta2
86+
kind: Machine
87+
metadata:
88+
name: test-cluster
89+
namespace: test-ns
90+
spec:
91+
...
92+
infrastructureMachine:
93+
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
94+
kind: VSphereMachine
95+
metadata:
96+
name: test-cluster
97+
namespace: test-ns
98+
spec:
99+
...
100+
boostrapConfig:
101+
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
102+
kind: KubeadmConfig
103+
metadata:
104+
name: test-cluster
105+
namespace: test-ns
106+
spec:
107+
...
108+
desired:
109+
machine:
110+
...
111+
infrastructureMachine:
112+
...
113+
boostrapConfig:
114+
...
115+
```
116+
117+
Note:
118+
- All the objects will have the latest API version known by Cluster API.
119+
- Only spec is provided, status fields are not included
120+
- When more than one extension will be supported, the current state will already include changes that can handle in-place by other runtime extensions.
121+
122+
Example Response
123+
124+
```yaml
125+
apiVersion: hooks.runtime.cluster.x-k8s.io/v1alpha1
126+
kind: CanUpdateMachineResponse
127+
status: Success # or Failure
128+
message: "error message if status == Failure"
129+
machinePatch:
130+
patchType: JSONPatch
131+
patch: <JSON-patch>
132+
infrastructureMachinePatch:
133+
...
134+
boostrapConfigPatch:
135+
...
136+
```
137+
138+
Note:
139+
- Extensions should return per-object patches to be applied on current objects to indicate which changes they can handle in-place.
140+
- Only fields in Machine/InfraMachine/BootstrapConfig spec have to be covered by patches
141+
- Patches must be in JSONPatch or JSONMergePatch format
142+
143+
### CanUpdateMachineSet
144+
145+
This hook is called by the MachineDeployment controller when performing the "can update in-place" for all the Machines controlled by
146+
a MachineSet.
147+
148+
Example request
149+
150+
```yaml
151+
apiVersion: hooks.runtime.cluster.x-k8s.io/v1alpha1
152+
kind: CanUpdateMachineSetRequest
153+
settings: <Runtime Extension settings>
154+
current:
155+
machineSet:
156+
apiVersion: cluster.x-k8s.io/v1beta2
157+
kind: MachineSet
158+
metadata:
159+
name: test-cluster
160+
namespace: test-ns
161+
spec:
162+
...
163+
infrastructureMachineTemplate:
164+
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
165+
kind: VSphereMachineTemplate
166+
metadata:
167+
name: test-cluster
168+
namespace: test-ns
169+
spec:
170+
...
171+
boostrapConfigTemplate:
172+
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
173+
kind: KubeadmConfigTemplate
174+
metadata:
175+
name: test-cluster
176+
namespace: test-ns
177+
spec:
178+
...
179+
desired:
180+
machineSet:
181+
...
182+
infrastructureMachineTemplate:
183+
...
184+
boostrapConfigTemplate:
185+
...
186+
```
187+
188+
Note:
189+
- All the objects will have the latest API version known by Cluster API.
190+
- Only spec is provided, status fields are not included
191+
- When more than one extension will be supported, the current state will already include changes that can handle in-place by other runtime extensions.
192+
193+
Example Response
194+
195+
```yaml
196+
apiVersion: hooks.runtime.cluster.x-k8s.io/v1alpha1
197+
kind: CanUpdateMachineSetResponse
198+
status: Success # or Failure
199+
message: "error message if status == Failure"
200+
machineSetPatch:
201+
patchType: JSONPatch
202+
patch: <JSON-patch>
203+
infrastructureMachineTemplatePatch:
204+
...
205+
boostrapConfigTemplatePatch:
206+
...
207+
```
208+
209+
Note:
210+
- Extensions should return per-object patches to be applied on current objects to indicate which changes they can handle in-place.
211+
- Only fields in Machine/InfraMachine/BootstrapConfig spec have to be covered by patches
212+
- Patches must be in JSONPatch or JSONMergePatch format
213+
214+
### UpdateMachine
215+
216+
This hook is called by the Machine controller when performing the in-place updates for a Machine.
217+
218+
Example request
219+
220+
```yaml
221+
apiVersion: hooks.runtime.cluster.x-k8s.io/v1alpha1
222+
kind: UpdateMachineRequest
223+
settings: <Runtime Extension settings>
224+
desired:
225+
machine:
226+
apiVersion: cluster.x-k8s.io/v1beta2
227+
kind: Machine
228+
metadata:
229+
name: test-cluster
230+
namespace: test-ns
231+
spec:
232+
...
233+
infrastructureMachineTemplate:
234+
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
235+
kind: VSphereMachineTemplate
236+
metadata:
237+
name: test-cluster
238+
namespace: test-ns
239+
spec:
240+
...
241+
boostrapConfigTemplate:
242+
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
243+
kind: KubeadmConfigTemplate
244+
metadata:
245+
name: test-cluster
246+
namespace: test-ns
247+
spec:
248+
...
249+
```
250+
251+
Note:
252+
- Only desired is provided (the external updater extension should know current state of the Machine).
253+
- Only spec is provided, status fields are not included
254+
255+
Example Response
256+
257+
```yaml
258+
apiVersion: hooks.runtime.cluster.x-k8s.io/v1alpha1
259+
kind: UpdateMachineSetResponse
260+
status: Success # or Failure
261+
message: "error message if status == Failure"
262+
retryAfterSeconds: 10
263+
```
264+
265+
Note:
266+
- The status of the update operation is determined by the CommonRetryResponse fields:
267+
- Status=Success + RetryAfterSeconds > 0: update is in progress
268+
- Status=Success + RetryAfterSeconds = 0: update completed successfully
269+
- Status=Failure: update failed

docs/book/src/tasks/experimental-features/runtime-sdk/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,5 +31,6 @@ Additional documentation:
3131
* [Implementing Runtime Extensions](./implement-extensions.md)
3232
* [Implementing Lifecycle Hook Extensions](./implement-lifecycle-hooks.md)
3333
* [Implementing Topology Mutation Hook Extensions](./implement-topology-mutation-hook.md)
34+
* [Implementing In-Place Update Hooks Extensions](./implement-in-place-update-hooks.md)
3435
* For Cluster operators:
3536
* [Deploying Runtime Extensions](./deploy-runtime-extension.md)

docs/proposals/20240807-in-place-updates-implementation-notes.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@ into the proposal or into the user-facing documentation for this feature.
77

88
## Notes about in-place update implementation for machine deployments
99

10-
- In place is always considered as potentially disruptive
11-
- in place must respect maxUnavailable
10+
- In-place update is always considered as potentially disruptive
11+
- in-place update must respect maxUnavailable
1212
- if maxUnavailable is zero, a new machine must be created first, then as soon as there is “buffer” for in-place, in-place update can proceed
1313
- when in-place is possible, the system should try to in-place update as many machines as possible.
1414
- maxSurge is not fully used (it is used only for scale up by one if maxUnavailable =0)

0 commit comments

Comments
 (0)