Scaling iOS CI with Anka

Shopify has a growing number of software developers working on mobile apps such as Shopify, Shopify POS and Frenzy. As a result, the demand for a scalable and stable build system increased. Our Developer Acceleration team decided to invest in creating a single unified build system for all continuous integration and delivery (CI/CD) pipelines across Shopify, which includes support for Android and iOS.

We want our developers to build and test code in a reliable way, as often as they want. Having a CI system that makes this effortless. The result is that we can deliver new features quickly and with confidence, without sacrificing the stability of our products.

Shopify’s Build System

We have built our own CI system at Shopify, which we call Shopify Build. It’s based on Buildkite, and we run it on our own infrastructure. We’ve deployed our own version of the job bootstrap script that sets up the CI environment, rather than the one that ships with Buildkite. This allows us to accomplish the following goals:

  • Provide a standard way to define general purpose build pipelines
  • Ensure the build environment integrates well with our other developer tools and are consistent with our production environment
  • Ensure builds are resilient against infrastructure failures and flakiness of third-party dependencies
  • Provide disposable build environments so that subsequent jobs can’t interfere with each other
  • Support selective builds for monorepos, or repositories with multiple projects in them

Initially, Shopify Build only supported Linux environments using Docker to provide disposable environments, and it works extremely well for backend and Android projects. Previously, we had separate CI systems for iOS projects, but we wanted to provide our iOS developers with the same benefits as our other developers by integrating iOS into Shopify Build.

Build Infrastructure for iOS

Building infrastructure for iOS comes with its unique set of challenges. It’s the only piece of infrastructure at Shopify that doesn’t run on top of Linux. We can leverage the same Google Cloud infrastructure we already use in production for our Android build nodes. Unfortunately, Cloud providers such as Amazon Web Services (AWS) and Google Cloud Platform (GCP) don’t provide infrastructure that can run macOS. The only feasible option for us is using a non-cloud provider like MacStadium but the tradeoff is that we can’t auto-scale the infrastructure based on demand.

2017: VMware ESXi and a Storage Area Network

Since we published our blog post on the VMware-based CI for Android and iOS, we’ve learned many lessons. We had a cluster of Mac Pros running ephemeral VMs on top of VMware ESXi. Although it served us well in 2017, it was a maintenance burden on the small team. We relied on tools such as Packer and ovftool, but we built many custom provisioning scripts to build and distribute VMware virtual machines.

On top being difficult to maintain, the setup had a single point of failure: the Storage Area Network (SAN). Each Mac Pro shared this solid-state based infrastructure. By the end of 2017, we exceeded the write throughput, degrading build stability and speed for all of our mobile developers. Due to our write-heavy CI workload, the only solution was to upgrade to a substantially more expensive dedicated storage solution. Dedicated storage would push us a bit farther, but the system would not be horizontally scalable.

2018: Disposable Infrastructure with Anka

During the time we had our challenges with VMWare, a new virtualization technology called Anka was released by Veertu. Anka provides a Docker-like command line interface for spinning up lightweight macOS virtual machines, built on top of Apple’s Hypervisor.framework.

Anka has the concept of a container registry similar to Docker with push and pull functionality, fast boot times, and easy provisioning provided through a command line interface. With Anka, we can quickly provision a virtual machine with the preferred macOS version, disk, memory, CPU configuration and Xcode version.

Mac Minis Versus Mac Pros

Our VMWare-based setup was running a small cluster of 12-core Mac Pros in MacStadium. The Mac Pros provided high bandwidth to the shared storage and ran multiple VMs in parallel. For that reason, they were the only viable choice for a SAN-based setup. However, Anka runs on local storage, and therefore it doesn’t require a SAN.

After further experimentation, we realized a cluster of Core i7 Mac Minis would be a better fit to run with Anka. They are more cost-effective than Mac Pros while providing the same or higher per-core CPU performance. For the price of a single Mac Pro, we could run about 6 Mac Minis. Mac Minis don’t provide 10 Gbit networking, but that isn’t a deal breaker in our Anka setup as we no longer need a SAN. We’re running only one Anka VM per Mac Mini, giving us four cores and up to 16 GB memory per build node. Running a single VM also avoids the performance degradation that we found when running multiple VMs on the same host, as they need to share resources.

Distributing Anka Images to Nodes in Different Regions

We use a separate Mac Mini as a controller node that provisions an Anka VM with all dependencies such as Xcode, iOS simulators and Ruby. The command anka create generates the base macOS image in about 30 minutes and only needs a macOS installer (.app) from the Mac App Store as input.

Anka’s VM image management optimizes disk space usage and data transfer times when pushing and pulling the VMs on the Mac nodes. Our images build automatically in multiple layers to benefit from this mechanism. Multiple layers allow us to make small changes to an image quickly. By re-using previous layers, changing a small number of files in an image across our nodes can be done in under 10 minutes, and upgrading the Xcode version in about an hour.

After the provisioning completes, our controller node continues by suspending the VM and pushes it to our Anka registries. The image is tagged with its unique git revision. We host the Anka Registry on machines with 10 Gbps networking. Since all nodes run Anka independently, we can run our cluster in two MacStadium data centers in parallel. If a regional outage occurs, we offload builds to just one of the two clusters, giving us extra resiliency.

The final step of the image distribution is a parallel pull performed on the Mac Minis with each pulling only the new layers from the available images in their respective Anka Registry to speed up the process. Each Mac Mini has 500 GB of SSD storage, which is enough to store all our macOS image variants. We allow build pipelines to specify images with both name and tags, such as macos-xcode92:latest or macos-xcode93:<git-revision>, similar to how Docker manages images. The Anka Image Distribution Process

Running Builds With Anka and Buildkite

We use Buildkite as the scheduler and front-end for CI at Shopify. It allows for fine-grained customization of pipelines and build scripts, which makes it a good fit for our needs.

We run a single Buildkite Agent on each Mac Mini and keep our git repositories cached on each of the hosts, for a fast git checkout. We also support shallow clones. We found that with a single large repository and many git submodules, a local cache gives the best performance. As mentioned before, we maintain copies of suspended Anka images on each of the Mac Minis. Suspended Anka VMs, rather than stopped ones, can boot in under a second, which is a significant improvement over our VMware VMs, which took about one minute to boot even from a suspended state.

As part of running a build, a sequence of Anka commands is invoked. First, we clone the base image to a temporary snapshot. This is done using anka clone. We then start the VM, wait for it to be booted and continue by mounting volumes to expose artifacts. With anka run we execute the command corresponding to the Buildkite step and wait for it to finish. Artifacts are uploaded to cloud storage and the Anka VM is deleted afterwards with anka delete. The Lifecycle of a Build Job Using Anka Containers.

We monitor the demand for build nodes and work with MacStadium to scale the number of Mac Minis in both data centers. It’s easier than managing Macs ourselves, but it’s still a challenge as we can’t scale our cluster dynamically. In the graph below, you can see the green line indicating the total number of available build nodes and the required agent count in orange.

Our workload is quite spiky, with high load exceeding our capacity at moments during the day. During those moments, our queue time will increase. We expect to add more Mac Minis to our cluster as we grow our developer teams to keep our queue times under control.

 A Graph Showing Shopify's iOS CI Workload over 4 Hours During Our Work Day

Summary

It took us about four months to implement the new infrastructure on top of Anka with a small team. Building your own CI system requires an investment in engineering time and infrastructure, and at Shopify, we believe it’s worth it for companies that plan to scale while continuing to iterate at a high pace on their iOS apps.

By using Anka, we substantially improved the maintainability and scalability of our iOS build infrastructure. We recommend it to anyone looking for macOS virtualization in a Docker-like fashion. During the day, our team of about 60 iOS developers runs about 350 iOS build jobs per hour. Anka provides superior boot times by reducing the setup time of a build step. Upgrading to new versions of macOS and Xcode is easier than before. We have eliminated shared storage as a single point of failure thereby increasing the reliability of our CI system. It also means the system is horizontally scalable, so we can easily scale with the growth of our engineering team.

Finally, the system is easier to use for our developers by being part of Shopify Build, sharing the same interface we use for CI across Shopify.

If you would like to chat with us about CI infrastructure or other developer productivity topics, join us at Developer Productivity on Slack.