-
Notifications
You must be signed in to change notification settings - Fork 41
Pulumi update issues with AKS clusters #959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@michaelbpalmer can you share the |
We're having this issue too (in our case we're updating a node pool).
|
@michaelbpalmer Thank you for reporting this Re:1 It looks like Open API specs are missing annotations for which properties are immutable and must cause replacements. Same goes for the error reported by @nesl247 We may need to add manual overrides here. Also, once pulumi/pulumi#7226 lands, you'll be able to do that yourself ad-hoc. For now, the only workaround is probably changing the name of the resource. Re:2 I don't immediately see how we can solve this. If those taints are added automatically, you may have to add them in your code to avoid overwrite attempts. Would that work for you? |
|
Quick fix for #1808 Remove all manually curated agentPoolProfile properties from the forceNewMap as we're currently also forcing recreation when adding or removing agent pools. This method of replacing agent pools is documented here: https://learn.microsoft.com/en-us/azure/aks/use-system-pools?tabs=azure-cli#system-and-user-node-pools If the current behaviour is still wanted (to replace the whole cluster instead of updating the agent pools in place) then the replaceOnChanges resource option can be used: https://www.pulumi.com/docs/concepts/options/replaceonchanges/ This was originally introduced by #1210 to fix #959 but it didn't take into account rotating agent pools within the existing cluster.
Quick fix for #1808 Remove all manually curated agentPoolProfile properties from the forceNewMap as we're currently also forcing recreation when adding or removing agent pools. This method of replacing agent pools is documented here: https://learn.microsoft.com/en-us/azure/aks/use-system-pools?tabs=azure-cli#system-and-user-node-pools If the current behaviour is still wanted (to replace the whole cluster instead of updating the agent pools in place) then the replaceOnChanges resource option can be used: https://www.pulumi.com/docs/concepts/options/replaceonchanges/ This was originally introduced by #1210 to fix #959 but it didn't take into account rotating agent pools within the existing cluster.
We keep hitting issues with AKS and Pulumi. Namely, when we try to update properties on existing AKS clusters (managed by Pulumi), we often get into error cases where we end up having to manually edit our stack. Here are some examples:
Steps to reproduce
Issue 1:
Expected: VM size updates (even if cluster needs to be replaced).
Actual: Error and update is not performed. Only way to get out of this is to manually delete the cluster, edit the stack and rerun pulumi up.
Issue 2:
Expected: pulumi up to succeed with maxCount updated.
Actual: pulumi up fails
azure-native:containerservice:AgentPool (dg-gpuspot-agentpool-cpu-staging-dev): error: Code="PropertyChangeNotAllowed" Message="Changing property 'properties.nodeTaints' is not allowed." Target="properties.nodeTaints"
We weren’t trying to update the node taints. It seems like Pulumi knows it was a change to maxCount:
~ azure-native:containerservice:AgentPool dg-gpuspot-agentpool-cpu-staging-dev updating [diff: ~maxCount]; error: Code="PropertyChangeNotAllowed" Message="Changing property 'properties.nodeTaints' is not allowed." Target="properties.nodeTaints" ~ azure-native:containerservice:AgentPool dg-gpuspot-agentpool-cpu-staging-dev **updating failed** [diff: ~maxCount]; error: Code="PropertyChangeNotAllowed" Message="Changing property 'properties.nodeTaints' is not allowed." Target="properties.nodeTaints"
Taking a look at the stack output for this resource, I think the issue is that Pulumi is trying to remove the taints that Azure automatically adds to spot node pools. In this case we haven’t been able to get this to work even after editing the stack.
{ "urn": "urn:pulumi:spark-staging::spark-deployment::azure-native:containerservice:AgentPool::dg-gpuspot-agentpool-cpu-staging-dev", "custom": true, "id": "/subscriptions/a8c3fdb1-94c2-4db4-bc18-470696fa4bd4/resourcegroups/cp-staging/providers/Microsoft.ContainerService/managedClusters/dg-spark-kubernetescluster-cpu-staging-dev/agentPools/gpuspot", "type": "azure-native:containerservice:AgentPool", "inputs": { "agentPoolName": "gpuspot", "count": 1, "enableAutoScaling": true, "maxCount": 6, "minCount": 0, "mode": "User", "nodeTaints": [ "sku=gpu:NoSchedule" ], "resourceGroupName": "cp-staging", "resourceName": "dg-spark-kubernetescluster-cpu-staging-dev", "scaleSetPriority": "Spot", "spotMaxPrice": -1, "type": "VirtualMachineScaleSets", "vmSize": "Standard_NC8as_T4_v3", "vnetSubnetID": "/subscriptions/a8c3fdb1-94c2-4db4-bc18-470696fa4bd4/resourceGroups/cp-staging/providers/Microsoft.Network/virtualNetworks/dg-spark-vnet-cpu-staging-dev/subnets/default" }, "outputs": { "__inputs": { "4dabf18193072939515e22adb298388d": "1b47061264138c4ac30d75fd1eb44270", "ciphertext": "redacted" }, "count": 1, "enableAutoScaling": true, "enableFIPS": false, "id": "/subscriptions/a8c3fdb1-94c2-4db4-bc18-470696fa4bd4/resourcegroups/cp-staging/providers/Microsoft.ContainerService/managedClusters/dg-spark-kubernetescluster-cpu-staging-dev/agentPools/gpuspot", "kubeletDiskType": "OS", "maxCount": 6, "maxPods": 110, "minCount": 0, "mode": "User", "name": "gpuspot", "nodeImageVersion": "AKSUbuntu-1804gen2containerd-2021.06.02", "nodeLabels": { "kubernetes.azure.com/scalesetpriority": "spot" }, "nodeTaints": [ "sku=gpu:NoSchedule", "kubernetes.azure.com/scalesetpriority=spot:NoSchedule" ], "orchestratorVersion": "1.19.11", "osDiskSizeGB": 128, "osDiskType": "Ephemeral", "osSKU": "Ubuntu", "osType": "Linux", "powerState": { "code": "Running" }, "provisioningState": "Succeeded", "scaleSetEvictionPolicy": "Delete", "scaleSetPriority": "Spot", "spotMaxPrice": -1, "type": "VirtualMachineScaleSets", "vmSize": "Standard_NC8as_T4_v3", "vnetSubnetID": "/subscriptions/a8c3fdb1-94c2-4db4-bc18-470696fa4bd4/resourceGroups/cp-staging/providers/Microsoft.Network/virtualNetworks/dg-spark-vnet-cpu-staging-dev/subnets/default"
The text was updated successfully, but these errors were encountered: