Jack Li 11/14/2019

Successfully Merging the Work of 1000+ Developers

Collaboration with a large team is challenging, and even more so if it’s on a single codebase, like the Shopify monolith. Shopify changes 40 times a day. We follow a trunk-based development workflow and merge around 400 commits to master daily. There are three rules that govern how we deploy safely, but they were hard to maintain at our growing scale. Soft conflicts broke master, slow deployments caused large drift between master and production, and the time to deploy emergency merges slowed due to a backlog of pull requests. To solve these issues, we upgraded the Merge Queue (our tool to automate and control the rate of merges going into master) so it integrates with GitHub, runs continuous integration (CI) before merging to master keeping it green, removes pull requests that fail CI, and maximizes deployment throughput of pull requests.

Our three essential rules about deploying safely and maintaining master:

  1. Master must always be green (passing CI). Important because we must be able to deploy from master at all times. If master is not green, our developers cannot merge, slowing all development across Shopify.
  2. Master must stay close to production. Drifting master too far ahead of what is deployed to production increases risk.
  3. Emergency merges must be fast. In case of emergencies, we must be able to quickly merge fixes intended to resolve the incident.

Merge Queue v1

Two years ago, we built the first iteration of the merge queue inside our open-source continuous deployment tool, Shipit. Our goal was to prevent master from drifting too far from production. Rather than merging directly to master, developers add pull requests to the merge queue which merges pull requests on their behalf.

Merge Queue v1 - developers add pull requests to the merge queue which merges pull requests on their behalfMerge Queue v1

Pull requests build up in the queue rather than merging to master all at once. Merge Queue v1 controlled the batch size of each deployment and prevented merging when there were too many undeployed pull requests on master. It reduced the risk of failure and possible drift from production. During incidents, we locked the queue to prevent any further pull requests from merging to master, giving space for emergency fixes.

Merge Queue v1 browser extensionMerge Queue v1 Browser Extension

Merge Queue v1 used a browser extension allowing developers to send a pull request to the merge queue within the GitHub UI, but also allowed them to quickly merge fixes during emergencies by bypassing the queue.

Problems with Merge Queue v1

Merge Queue v1 kept track of pull requests, but we were not running CI on pull requests while they sat in the queue. On some unfortunate days—ones with production incidents requiring a halt to deploys—we would have upwards of 50 pull requests waiting to be merged. A queue of this size could take hours to merge and deploy. There was also no guarantee that a pull request in the queue would pass CI after it was merged, since there could be soft conflicts (two pull requests that pass CI independently, but fail when merged together) between pull requests in the queue.

The browser extension was a major pain point because it was a poor experience for our developers. New developers sometimes forgot to install the extension which resulted in accidental direct merges to master instead of going through the merge queue, which can be disruptive if the deploy backlog is already large, or if there is an ongoing incident.

Merge Queue v2

This year, we completed Merge Queue v2. We focused on optimizing our throughput by reducing the time that the queue is idle, and improving the user experience by replacing the browser extension with a more integrated experience. We also wanted to address the pieces that the first merge queue didn’t address: keeping master green and faster emergency merges. In addition, our solution needed to be resilient to flaky tests—tests that can fail nondeterministically.

No More Browser Extension

Merge Queue v2 came with a new user experience. We wanted an interface for our developers to interact with that felt native to GitHub. We drew inspiration from Atlantis, which we were already using for our Terraform setup, and went with a comment-based interface.

Merge Queue v2 went with a comment-based interfaceMerge Queue v2 went with a comment-based interface

A welcome message gets issued on every pull request with instructions on how to use the merge queue. Every merge now starts with a /shipit comment. This comment fires a webhook to our system to let us know that a merge request has been initiated. We check if Branch CI has passed and if the pull request has been approved by a reviewer before adding the pull request to the queue. If successful, we issue a thumbs up emoji reaction to the /shipit comment using the GitHub addReaction GraphQL mutation.

In the case of errors, such as invalid base branch, or missing reviews, we surface the errors as additional comments on the pull request.

Jumping the queue by merging directly to master is bad for overall throughput. To ensure that everyone uses the queue, we disable the ability to merge directly to master using GitHub branch protection programmatically as part of the merge queue onboarding process.

 

However, we still need to be able to bypass the queue in an emergency, like resolving a service disruption. For these cases, we added a separate /shipit --emergency command that skips any checks and merges directly to master. This helps communicate to developers that this action is reserved for emergencies only and gives us auditability into the cases where this gets used.

Keeping Master Green

In order to keep master green, we took another look at how and when we merged a change to master. If we run CI before merging to master, we ensure that only green changes merge. This improves the local development experience by eliminating the cases of pulling a broken master, and by speeding up the deploy process by not having to worry about delays due to a failing build.

Our solution here is to have what we call a “predictive branch,” implemented as a git branch, onto which pull requests are merged, and CI is run. The predictive branch serves as a possible future version of master, but one where we are still free to manipulate it. We avoid maintaining a local checkout, which incurs the cost of running a stateful system that can easily be out of sync, and instead interact with this branch using the GraphQL GitHub API.

To ensure that the predictive branch on GitHub is consistent with our desired state, we use a similar pattern as React’s “Virtual DOM.” The system constructs an in-memory representation of the desired state and runs a reconciliation algorithm we developed that performs the necessary mutations to the state on GitHub. The reconciliation algorithm synchronizes our desired state to GitHub by performing two main steps. The first step is to discard obsolete merge commits. These are commits that we may have created in the past, but are no longer needed for the desired state of the tree. The second step is to create the missing desired merge commits. Once these merge commits are created, a corresponding CI run will be triggered. This pattern allows us to alter our desired state freely when the queue changes and gives us a layer of resiliency in the case of desynchronization.

Merge Queue v2Merge Queue v2 runs CI in the queue

To ensure our goal of keeping master green, we need to also remove pull requests that fail CI from the queue to prevent them from cascading failures to all pull requests behind them. However, like many other large codebases, our core Shopify monolith suffers from flaky tests. The existence of these flaky tests makes removing pull requests from the queue difficult because we lack certainty about whether failed tests are legitimate or flaky. While we have work underway to clean up the test suite, we have to be resilient to the situation we have today.

We added a failure-tolerance threshold, and only remove pull requests when the number of successive failures exceeds the failure tolerance. This is based on the idea that legitimate failures will propagate to all later CI runs, but flaky tests will not block later CI runs from passing. Larger failure tolerances will increase the accuracy, but at the tradeoff of taking longer to remove problematic changes from the queue. In order to calculate the best value, we can take a look at the flakiness rate. To illustrate, let’s assume a flakiness rate of 25%. These are the probabilities of a false positive based on how many successive failures we get.

Failure tolerance
Probability
0 25%
1 6.25%
2 1.5%
3 0.39%
4 0.097%


From these numbers, it’s clear that the probability decreases significantly with each increase to the failure tolerance. The possibility will never reach exactly 0%, but in this case, a value of 3 will bring us sufficiently close. This means that on the fourth consecutive failure, we will remove the first pull request failing CI from the queue.

Increasing Throughput

An important objective for Merge Queue v2 was to ensure we can maximize throughput. We should be continuously deploying and making sure that each deployment contains the maximum amount of pull requests we deem acceptable.

To continuously deploy, we make sure that we have a constant flow of pull requests that are ready to go. Merge Queue v2 affords this by ensuring that CI is started for pull requests as soon as they are added to the queue. The impact is especially noticeable during incidents when we lock the queue. Since CI is running before merging to master, we will have pull requests passing and ready to deploy by the time the incident is resolved and the queue is unlocked. From the following graph, the number of queued pull requests rises as the queue gets locked, and then drops as the queue is unlocked and pull requests get merged immediately.

The number of queued pull requests rises as the queue gets locked, and then drops as the queue is unlocked and pull requests get merged immediately

To optimize the number of pull requests for each deploy, we split the pull requests in the merge queue up into batches. We define a batch as the maximum number of pull requests we can put in a single deploy. Larger batches result in higher theoretical throughput, but higher risk. In practice, the increased risk of larger batches impedes throughput by causing failures that are harder to isolate, and results in an increased number of rollbacks. In our application, we went with a batch size of 8 as a balance between throughput and risk.

At any given time, we run CI on 3 batches worth of pull requests in the queue. Having a bounded number of batches ensures that we’re only using CI resources on what we will need soon, rather than the entire set of pull requests in the queue. This helps reduce cost and resource utilization.

Conclusion

We improved the user experience, safety of deploying to production, and throughput of deploys through the introduction of the Merge Queue v2. While we accomplished our goals for our current level of scale, there will be patterns and assumptions that we’ll need to revisit as we grow. Our next steps will focus on the user experience and ensure developers have the context to make decisions every step of the way. Merge Queue v2 has given us flexibility to build for the future, and this is only the beginning of our plans to scale deploys.


We’re always looking for awesome people of all backgrounds and experiences to join our team. Visit our Engineering career page to find out what we’re working on.