Skip to content

Pulumi update issues with AKS clusters #959

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
michaelbpalmer opened this issue Jun 30, 2021 · 4 comments · Fixed by #1210 or #2774
Closed

Pulumi update issues with AKS clusters #959

michaelbpalmer opened this issue Jun 30, 2021 · 4 comments · Fixed by #1210 or #2774
Assignees
Labels
customer/feedback Feedback from customers kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed
Milestone

Comments

@michaelbpalmer
Copy link

We keep hitting issues with AKS and Pulumi. Namely, when we try to update properties on existing AKS clusters (managed by Pulumi), we often get into error cases where we end up having to manually edit our stack. Here are some examples:

  1. We are unable to update the vm size. pulumi up fails and we have to manually delete the old AKS cluster, edit the stack and then run pulumi up to create the new cluster.
  2. Updating maCount on spot instance agent pools

Steps to reproduce

Issue 1:

  1. Create an AKS cluster in Pulumi with any VM size.
  2. Change the VM Size and run pulumi up.

Expected: VM size updates (even if cluster needs to be replaced).
Actual: Error and update is not performed. Only way to get out of this is to manually delete the cluster, edit the stack and rerun pulumi up.

Issue 2:

  1. Create an AKS cluster with a spot instance AgentPool. Set the maxCount property to 1.
  2. Update the maxCount property to 2 and run pulumi up.

Expected: pulumi up to succeed with maxCount updated.
Actual: pulumi up fails

azure-native:containerservice:AgentPool (dg-gpuspot-agentpool-cpu-staging-dev): error: Code="PropertyChangeNotAllowed" Message="Changing property 'properties.nodeTaints' is not allowed." Target="properties.nodeTaints"

We weren’t trying to update the node taints. It seems like Pulumi knows it was a change to maxCount:

~ azure-native:containerservice:AgentPool dg-gpuspot-agentpool-cpu-staging-dev updating [diff: ~maxCount]; error: Code="PropertyChangeNotAllowed" Message="Changing property 'properties.nodeTaints' is not allowed." Target="properties.nodeTaints" ~ azure-native:containerservice:AgentPool dg-gpuspot-agentpool-cpu-staging-dev **updating failed** [diff: ~maxCount]; error: Code="PropertyChangeNotAllowed" Message="Changing property 'properties.nodeTaints' is not allowed." Target="properties.nodeTaints"

Taking a look at the stack output for this resource, I think the issue is that Pulumi is trying to remove the taints that Azure automatically adds to spot node pools. In this case we haven’t been able to get this to work even after editing the stack.

{ "urn": "urn:pulumi:spark-staging::spark-deployment::azure-native:containerservice:AgentPool::dg-gpuspot-agentpool-cpu-staging-dev", "custom": true, "id": "/subscriptions/a8c3fdb1-94c2-4db4-bc18-470696fa4bd4/resourcegroups/cp-staging/providers/Microsoft.ContainerService/managedClusters/dg-spark-kubernetescluster-cpu-staging-dev/agentPools/gpuspot", "type": "azure-native:containerservice:AgentPool", "inputs": { "agentPoolName": "gpuspot", "count": 1, "enableAutoScaling": true, "maxCount": 6, "minCount": 0, "mode": "User", "nodeTaints": [ "sku=gpu:NoSchedule" ], "resourceGroupName": "cp-staging", "resourceName": "dg-spark-kubernetescluster-cpu-staging-dev", "scaleSetPriority": "Spot", "spotMaxPrice": -1, "type": "VirtualMachineScaleSets", "vmSize": "Standard_NC8as_T4_v3", "vnetSubnetID": "/subscriptions/a8c3fdb1-94c2-4db4-bc18-470696fa4bd4/resourceGroups/cp-staging/providers/Microsoft.Network/virtualNetworks/dg-spark-vnet-cpu-staging-dev/subnets/default" }, "outputs": { "__inputs": { "4dabf18193072939515e22adb298388d": "1b47061264138c4ac30d75fd1eb44270", "ciphertext": "redacted" }, "count": 1, "enableAutoScaling": true, "enableFIPS": false, "id": "/subscriptions/a8c3fdb1-94c2-4db4-bc18-470696fa4bd4/resourcegroups/cp-staging/providers/Microsoft.ContainerService/managedClusters/dg-spark-kubernetescluster-cpu-staging-dev/agentPools/gpuspot", "kubeletDiskType": "OS", "maxCount": 6, "maxPods": 110, "minCount": 0, "mode": "User", "name": "gpuspot", "nodeImageVersion": "AKSUbuntu-1804gen2containerd-2021.06.02", "nodeLabels": { "kubernetes.azure.com/scalesetpriority": "spot" }, "nodeTaints": [ "sku=gpu:NoSchedule", "kubernetes.azure.com/scalesetpriority=spot:NoSchedule" ], "orchestratorVersion": "1.19.11", "osDiskSizeGB": 128, "osDiskType": "Ephemeral", "osSKU": "Ubuntu", "osType": "Linux", "powerState": { "code": "Running" }, "provisioningState": "Succeeded", "scaleSetEvictionPolicy": "Delete", "scaleSetPriority": "Spot", "spotMaxPrice": -1, "type": "VirtualMachineScaleSets", "vmSize": "Standard_NC8as_T4_v3", "vnetSubnetID": "/subscriptions/a8c3fdb1-94c2-4db4-bc18-470696fa4bd4/resourceGroups/cp-staging/providers/Microsoft.Network/virtualNetworks/dg-spark-vnet-cpu-staging-dev/subnets/default"

@michaelbpalmer michaelbpalmer added the kind/bug Some behavior is incorrect or out of spec label Jun 30, 2021
@clstokes
Copy link

@michaelbpalmer can you share the Cluster resource code too?

@nesl247
Copy link

nesl247 commented Jun 30, 2021

We're having this issue too (in our case we're updating a node pool).

error: Code="PropertyChangeNotAllowed" Message="Changing property 'properties.vmSize' is not allowed." Target="properties.vmSize"

@mikhailshilkov
Copy link
Member

mikhailshilkov commented Jul 1, 2021

@michaelbpalmer Thank you for reporting this

Re:1 It looks like Open API specs are missing annotations for which properties are immutable and must cause replacements. Same goes for the error reported by @nesl247 We may need to add manual overrides here. Also, once pulumi/pulumi#7226 lands, you'll be able to do that yourself ad-hoc. For now, the only workaround is probably changing the name of the resource.

Re:2 I don't immediately see how we can solve this. If those taints are added automatically, you may have to add them in your code to avoid overwrite attempts. Would that work for you?

@mikhailshilkov
Copy link
Member

replaceOnChanges has now landed, so the workaround for such issues is to add this option to the resource-in-question to force its replacement. See pulumi/pulumi#7226 for details.

@infin8x infin8x added the customer/feedback Feedback from customers label Aug 12, 2021
@mikhailshilkov mikhailshilkov added this to the 0.63 milestone Oct 6, 2021
@mikhailshilkov mikhailshilkov self-assigned this Oct 8, 2021
@pulumi-bot pulumi-bot added the resolution/fixed This issue was fixed label Oct 13, 2021
danielrbradley added a commit that referenced this issue Sep 26, 2023

Verified

This commit was signed with the committer’s verified signature. The key has expired.
danielrbradley Daniel Bradley
Quick fix for #1808

Remove all manually curated agentPoolProfile properties from the forceNewMap as we're currently also forcing recreation when adding or removing agent pools. This method of replacing agent pools is documented here: https://learn.microsoft.com/en-us/azure/aks/use-system-pools?tabs=azure-cli#system-and-user-node-pools

If the current behaviour is still wanted (to replace the whole cluster instead of updating the agent pools in place) then the replaceOnChanges resource option can be used: https://www.pulumi.com/docs/concepts/options/replaceonchanges/

This was originally introduced by #1210 to fix #959 but it didn't take into account rotating agent pools within the existing cluster.
danielrbradley added a commit that referenced this issue Sep 26, 2023

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Quick fix for #1808

Remove all manually curated agentPoolProfile properties from the
forceNewMap as we're currently also forcing recreation when adding or
removing agent pools. This method of replacing agent pools is documented
here:
https://learn.microsoft.com/en-us/azure/aks/use-system-pools?tabs=azure-cli#system-and-user-node-pools

If the current behaviour is still wanted (to replace the whole cluster
instead of updating the agent pools in place) then the replaceOnChanges
resource option can be used:
https://www.pulumi.com/docs/concepts/options/replaceonchanges/

This was originally introduced by
#1210 to fix
#959 but it didn't
take into account rotating agent pools within the existing cluster.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer/feedback Feedback from customers kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed
Projects
None yet
6 participants