EKS Windows Node 30s Boot Now Possible - Start as Fast as Linux! (eks-windows-bootstrapper - Part 2)

cover
15 May 2024

Previously…

https://awstip.com/reducing-eks-windows-node-5-min-start-time-to-90s-eks-windows-bootstrapper-4314f72367a0?embedable=true

I threw in some tricks I used to make our Windows EKS nodes start in 90s rather than the 5 minutes a standard EKS Windows node takes. I introduced you to the git repo eks-windows-bootstrapper - the application replacing the AWS-provided configuration scripts for EKS Windows.

https://github.com/atg-cloudops/eks-windows-bootstrapper?embedable=true

In today’s story, I will walk you through how to make your Windows nodes take ~30s to start!

Recap

In the previous story, we got to the point where our start times looked like this:

Let’s take a closer look at our latencies now:

In order for us to get better launch times, we now have 2 areas where we can improve timing

  • Windows Startup

  • Configure Network/Start K8s Services

Improving Windows Startup Performance

As we are using the AWS Image Builder (https://aws.amazon.com/image-builder/) to build our custom AMIs as discussed in the previous story, any changes we make to our image need to be made here. Outside AWS Fast Launch - there really isn’t much we can do to help with the Windows boot process, if you are able - you can uninstall Windows Defender which will massively improve start performance.

Side note - I found in our use case we do not actually need Fast Launch as we do not actually need to run the OOBE process. We can set our culture using Set-Culture, etc. inside one of our components, and as long as we remove the sections in Unattend.xml related to OOBE, Windows will skip much of the setup process that EC2 Fast Launch deals with. When you start the output AMI, it will start as quickly as a Fast Launch AMI, but without the added expense of pre-provisioning EBS snapshots.

The downside to this approach is that you won’t be able to configure Windows settings on the first boot, what you bake into the AMI is what you will start with - So make sure you set culture/timezone information correctly somewhere in one of your Image Builder components.

For reference:

The 'components' under 'oobeSystem' can be deleted to avoid the need for 'Fast Launch'.

So with all avenues to make Windows do less stuff on boot exhausted, there was only one choice left…

Changes were made to the EKS Windows Bootstrapper code so it could be run as a Windows Service. The idea behind this is that we could squeeze a few more seconds out if we could run our code as Windows is starting services on boot instead of waiting for the userdata script to fire. This worked - however, it did break the AWS SSM Agent as a result as it also starts early. In the installation script included in the repo, we work around this by setting the SSM agent startup type to Automatic (Delayed).

Faster Configure Network/Start K8s Services

This basically involved translating the remainder of the code in the standard AWS bootstrapping scripts in EKS-StartupTask.ps1

  • Add routes to created NIC

            ...
              routeAddCommands.Append($"route ADD {ipAddrs[i]} MASK 255.255.255.255 0.0.0.0 IF {vNICIndex}");
            ...
            Process process = new Process();
            process.StartInfo.FileName = "cmd.exe";
            process.StartInfo.Arguments = $"/C {routeAddCommands}";
            process.StartInfo.RedirectStandardOutput = true;
            process.StartInfo.UseShellExecute = false;
            process.StartInfo.CreateNoWindow = true;
            process.Start();
            await process.WaitForExitAsync();
    

    https://github.com/atg-cloudops/eks-windows-bootstrapper/blob/8f519953209e3caea590652832acc74585ab1a45/BootstrapperService.cs#L586

  • Start Kube-proxy and Kubelet in parallel once all configuration files have been written (all the previously converted code)

Installation

To install the bootstrapper, add this to a component in Image Builder:

      - name: InstallEksWindowsBootstrapper
        action: ExecutePowerShell
        inputs:
          commands:
            - |
              Invoke-WebRequest -Uri 'https://github.com/atg-cloudops/eks-windows-bootstrapper/releases/download/v1.29.1/Install-Service.ps1' -OutFile 'Install-Service.ps1'; 
              .\Install-Service.ps1; 
              Remove-Item 'Install-Service.ps1';

Putting everything together, I tested the output image using these settings:

  • EC2 Instance
    • m4.xlarge (4 Core)
    • EBS
      • GP3 - 120GB/600Mbps/3600 IOPS

Testing on these settings got me start times around 55s! We are finally under a minute, maybe spot instances are now viable for Windows EKS Nodes now that the 5-minute penalty is now 1 minute.

Hold On… You Said A 30s Start Time! You Said as Fast as Linux!

A 30s start time is possible! It just costs a lot. 😢

I ended up testing how fast the sum of all the software was by eliminating as many hardware bottlenecks as I could.

Switched from GP3 → IO2 30K IOPS

Switched from m4.xlarge → c6a.4xlarge

Require Nitro Hypervisor

It’s quite easy to change in Karpenter; simply change the Ec2NodeClass to change storage options, and then add the relevant keys in NodePool, e.g.

  blockDeviceMappings:
    - deviceName: /dev/sda1
      ebs:
        iops: 30000
        volumeSize: 120Gi
        volumeType: io2

        - key: karpenter.k8s.aws/instance-hypervisor
          operator: In
          values:
            - 'nitro'
        - key: karpenter.k8s.aws/instance-cpu
          operator: In
          values:
            - '8'

The beast that was produced by Karpenter was visible in K8s in 32s. It would be sub-30s with antivirus disabled.

Our final tuning produces nodes in around 45s (Using GP3). You can tune for cost/performance yourself for your own scenario.

Looking for Help

EKS Windows bootstrapper is a simple app that writes config and starts Kubernetes services. It is currently set up for EKS 1.29. As Kubernetes goes through 1.30, 1.31, 1.32… it will need to be kept up to date… I am looking for help with this. 🙂