AWS spot/on-demand Copr builders

Most of the builders we use nowadays in AWS are started in the spot tenancy. The reason is simple - they are significantly cheaper than on-demand, and it isn’t a big problem when some builder machine gets suddenly terminated (if that happened we simply start a new VM that will re-try the build task).

We still keep also several complementary “on-demand” builders just in case there was some spot request allocation outage. Admittedly, this has not happened to us so far - probably because we set the maximum spot price circa on the level of the on-demand price, and our builders are usually run for a short period of time.

Pricing

Currently, we use the i3.large instance type for x86_64 builders, and the a1.xlarge for the aarch64. The a1.xlarge has insufficient storage, so for those arm-based machines we allocate an additional 160G gp2 volume, per EBS pricing:

160G * $0.1 (per month) / 30 days / 24 hours = $0.022/h

Fedora Copr runs in N. Virginia region, so the on-demand price is:

$0.156/h for i3.large
$0.102/h for a1.xlarge + $0.022/h for 160G volume = $0.124/h

How to check the spot pricing? Log into the AWS web UI => go to “EC2” category => Open collapsible “Instances” menu on the left side => “Spot Requests” => press “Pricing history” in the right top corner => search the instance types, see the history of spot price:

$0.0468/h for i3.large, about 70% savings against on-demand
$0.0336/h for a1.xlarge + $0.022/h for the volume (nothing changes here) = $0.0556/h, that is 45% savings

Despite the fact that most of the demand is for the x86 architecture, we have about 50:50 x86 vs. arm ratio in AWS. That’s because we have our own in-house x86 hypervisors that provide additional computational power.

How do we start them?

We need to have a flexible builder allocation mechanism, depending on the current usage. And we want to have always some machines preallocated, so when users come they don’t have to wait till the VMs are booted (2-3 minutes). We use the resalloc server for the VM allocation, where a script is called to start, stop and check the virtual machine.

The starting script is just a thin wrapper around a set of our “configuration” playbooks, those include the crucial spawning task file. There are related configuration files in the inventory.

Caveats

Because some AWS subnet (lab) locations may be temporarily unavailable (outage) we randomly pick from a predefined set of subnets; So when this happens Resalloc just stops and re-tries the allocation in a different location.

The good thing on this Ansible approach is that we can “declaratively” describe the machine to be started, and it usually “just works”. The bad thing is that the starting process isn’t 100% under our control, and sometimes the playbook fails while the resource stays in some intermediate (but still running, and thus charged) state in AWS. So, we have also a cron job doing periodic cleanups.

The ‘spot_price’ argument can not be set to null or a float number easily (as specified by ansible-doc ec2) with the Jinja2 templating in Ansible. Therefore we use ”” or “float” strings (fortunately it works as we expect).