Using DNS Traffic Management to Add Resiliency to Shopify’s Services

Aug 5, 2020
13 minute read

If you are lacking understanding of what is DNS, traffic management, or why we would even use it, read Part 1: Introduction to DNS traffic management.

Distributed systems are only as resilient as we build them to be. Domain Name System (DNS) traffic management (TM) is a well-used approach to do so. In this second part of the two-part series, we’re sharing Shopify’s DNS traffic management journey from the numerous manually set-up, maintained, and updated traffic management approaches to the fully automated self-served system used by 40+ domains owned amongst 12+ different teams, handling 100M+ requests per day.

Shopify’s Previous Approaches to DNS Traffic Management

DNS traffic management isn’t entirely new at Shopify. A number of different teams had their own way of doing traffic management through DNS changes before we automated in 2019, which brought different sets of features and techniques to update records.

Streaming Platform Team

The team handling our Kafka pipelines used Kubernetes ConfigMaps to define target clusters. So, making changes required

creating a pull request (PR) to the git repository holding the deployment configuration
getting the PR approved
merging the PR
waiting for the tests to pass
waiting for the shipping pipeline to run and deploy the change, which can take a few minutes depending on the circumstances

On top of the process duration which isn't ideal for failover time, using this manual approach doesn't open the door to any active/active configuration (where we share the traffic between two active clusters), since it would require to use two target clusters at once, while this only allows using the one defined in the ConfigMaps. At the time, being able to share traffic wasn't considered necessary for Kafka.

Search Platform Team

The team handling Elasticsearch set up the beginnings of DNS updates automation. They used our chatops bot, spy, to run commands requesting failovers. The bot creates a PR in our record-store repository (which uses the record_store open source project) with the requested DNS change, which then needs to be approved, merged, and deployed after passing the tests. Except for the automation, it’s a similar approach to the manual one, hence it has the same limitations with failover delays and active/active capabilities.

Edgescale Team

Part of the Edgescale team's responsibilities is to handle our assets (images and static files) and Content Delivery Networks (CDN), which bring the assets as close to the client as possible thanks to a network of distributed servers that store those files. To succeed in this mission, the team wanted more control and active/active capabilities. They used DNS providers allowing for weighted traffic management. They set weights to their records and define which share of traffic goes to which endpoint. It allowed them to share the traffic between their two CDN providers. To set this up, they used the DNS provider’s APIs with weights that could span from 0 (disabled) to 15, using DNS A records (hostname to IP address). To make their lives easier, they wrote a spy cdn command, responsible for making the API calls to the two DNS providers. It reduced the limitations of failover time, as well as provided the active/active capabilities. However, we couldn’t produce a stable, easy to reproduce, and version-controlled configuration of the providers. Adding endpoints to the traffic management was to be done manually, and thus prone to errors. The providers we use for our CDN needs don’t perform the same way in every region of the world and incur different costs in those regions. Using different traffic shares depending on the geographical location of the requesters is one of the traffic-management use-cases we presented in Part 1: Introduction to DNS traffic management(URL), but that wasn’t available here.

Other Teams

Finally, a few other teams decided to manually create records in one of the DNS providers used by the team handling our assets. They had fast failover and active/active capabilities but didn’t have a stable configuration, nor redundancy from using a single DNS provider.

We had too many setups going all over the place, corresponding to maintainability problems. Also, any impactful changes needed a lot of coordination and moving pieces. All of these setups had similar use cases and needs. We started working on how to improve and consolidate our approach to DNS traffic management to

reduce the limitations encountered by teams
connect other teams with similar needs
improve maintainability
create ownership

Consolidation and Ownership of DNS Traffic Management

The first step of creating a better service used by many teams at Shopify was to define the requirements and goals. We wanted a reliable and redundant system that would provide

regionalized traffic
fine-grained traffic sharing
failover capabilities
easy setup and updating by the teams owning traffic-managed domains

The final state of our setup provides a unified way of setting up and handling our services. We use a git repository to store the domain's configuration that’s then deployed to two DNS providers. The configuration can be tweaked in a fast and easy manner for both providers through a set of spy commands, allowing for efficient failovers. Let's talk about those choices to build our system, and how we built it.

Establish Reliability and Redundancy

Each domain name has a set of nameservers, and when using a DNS client, one of those nameservers is selected and queried first, another one is used when a timeout occurs. Shopify used a single DNS provider until 2016, where a large DNS outage happened while our DNS provider was under a distributed denial of service (DDoS) attack, effectively dropping a large number of legitimate requests. We learned from our mistakes and increased our reliability and redundancy by using more than one DNS provider.

When CDN traffic management was set up, it used two different DNS providers to follow in the steps of our static DNS records in the record_store. The decision for our new system was easy to make since it prevented being dependent on a single provider, we wanted to follow the same approach and adopt two providers to build our new standard.

Define The Traffic Management Layers

Two of our DNS providers allowed for regionalized and weighted traffic management, as well as multiple failover layers. It was just a matter of defining how we wanted things to work and build the equivalent approach for both providers.

We defined our approach in layers of traffic management and considered that each layer had a decision to make that would reduce the set of options that the next layers can choose from.

Layer 1 Geographical Fencing

The layer supports globally matching endpoints, which are mandatory. We always have an answer to a DNS request for an existing traffic-managed domain, even if there is no region specifically matching the requester. We defined a global region that is selected when nothing more specific matches. The geographical fencing layer selects the endpoints that fit the region where the request originates from. This layer selects the best geographical match with the client’s request. For instance, we set a rule to have endpoint A answered for Canada and endpoint B for Quebec. When we get a request originating from Montreal, we return B. If the request originates from Ottawa, we return A.

Layer 2 Endpoint Status

We provided a way to enable and disable the use of endpoints depending on their status, which is manually or automatically set. Automatically setting endpoints status depends on a process called health checking or monitoring, where we try to reach the endpoint regularly in order to verify if it does (healthy) or does not (unhealthy) answer. We added a layer of traffic management based on the endpoint status aimed at selecting only the endpoints currently considered as healthy. However, we don’t want the requester to receive an empty answer for a domain that does exist, as it would trigger the negative TTL, most of the time higher than our traffic-managed domain TTL. If any of the endpoints is healthy, then only the healthy endpoints are returned by Layer 2. If none of the endpoints are healthy, all the endpoints are returned. The logic behind this is simple: returning something that doesn’t work is better than not returning anything, as it allows the client to start back using the service as soon as endpoints are back online.

Layer 3 Endpoint Priority

Another aspect we want control over is the failover approach for our endpoints. We allow users to define levels of priorities for the traffic shares of their domains. For instance, they could define, as the highest priority, that three endpoints A, B, and C would receive 100%, 0%, and 0% of the traffic respectively. However, when A is unhealthy, instead of using B and C, we define, as a second priority, that B would receive 100% of the traffic. This can’t be done without a layer selecting endpoints based on their priority, as we’d be sending a share of the normal traffic to B, or we don’t have automated control over how B and C share the traffic in the case where A is failing. Endpoints of higher priority layers with a weight set to 0 (not receiving traffic) are also considered down for those layers. This means when the endpoints receiving traffic are unhealthy, through health checking, any higher-priority endpoints get discarded at Layer 2, allowing Layer 3 to keep only the highest priority endpoints left in the returned list.

Layer 4 Weighted Selection

This final layer deals with the weights defined for the endpoints. Each endpoint E reaching this layer has a probability P_E of being selected as the answer. P_E is obtained through the formula <weight of E>/<sum of the weights of all endpoints reaching Layer 4>. Any 0-weighted endpoints will be automatically discarded unless there are only 0-weighted endpoints where all endpoints will have an equal chance of being selected and returned to the requester.

Deploying and Maintaining The Traffic-Managed Domains

We try to build tooling in a self-service way. It creates a new standard requiring us to make tooling easily accessible for other teams to deploy their traffic-managed domains. Since we use Terraform with Atlantis for a number of our deployments, we built a Terraform module that receives only the required parameters for an application owner and hides most of the work happening behind the scenes to configure our providers.

The above code represents the gist of what an application owner needs to provide to deploy their own traffic-managed domain.

We work to keep our deployments organized, so we derive the zone and subdomain parameters from the path of the domain being terraformed. For example, the path to this file is terraform/tm.shopifysvc.com/test/domain.tf allows deriving that the zone is tm.shopifysvc.com and the subdomain is test.

When an application owner wants to make changes to their traffic-managed domain, they just need to update the domain.tf file and apply the terraform change. There are a number of extra features that control

automated monitoring and failover for their domains
monitoring configuration for domains
paging when a failover is automatically triggered or not.

When we make changes to how traffic-managed domains are deployed, or add new features, we update the module and move domains to the new module version one by one. Everything stays transparent for the application owners and easy to maintain for us, the Edgescale team.

Everyday Traffic Steering Operations

We allowed the users of our new standard to make changes fast and easily applied to their traffic-managed domain. We built a new command in our chatops bot, spy endpoints, to perform operations on the traffic-managed domains.

Those commands will operate on relative and absolute domains. Relative domains will automatically receive our default traffic-managed zone as a suffix. It’s also possible to specify in which region the change should apply by using square brackets; for example, cdn[us-*,na] would concern the cdn traffic-managed domain but only in region na and only ones starting with us-.

The spy endpoints get command gets the current traffic shares between endpoints for a given domain. If all providers are holding the same data (which should be the case most of the time), then the command results without any mention of the DNS providers. When results are different (the providers went out of sync), the data will be shown with mentions of the providers to make sure that the current traffic shares are known.

The spy endpoints set command changes the traffic shares using specific weight values we provide. It updates every provider and runs the spy endpoints set command to show the new traffic shares. Instead of specifying the weights for each endpoint, it’s possible to use a number of defined profiles that set predefined traffic shares. For example, our test domain with its two endpoints mostly[central-4] will define weights of 95 for central-4 and 5 for central-5.

Our Success Stories

Moving to ElasticSearch 7.0.0 - We talked about the process that was used by the ElasticSearch team and the fact that it was limited to failovers and didn’t allow traffic shares. When we moved our internal ElasticSearch clusters to ElasticSearch 7.0.0, the team was able to use the weighted load balancing provided by our tools to move the traffic chunk by chunk and ensure everything was working properly. It allowed them to keep the regular traffic going and mitigate any issue they might have encountered along the way, making the transition to ElasticSearch 7.0.0 seamless to the different systems using it.

Recovering from Kafka overload during a flash sale - During a large flash sale, the Kafka brokers in one of our clusters started overloading from the traffic they had to handle. Once the problem was identified, it took a few minutes for the Kafka team to realize that they now had traffic share capabilities from our DNS traffic manager. They used it to divert half the traffic from the overloaded region and send it to another available region. Less than five minutes after making that change, the Kafka queues started recovering.

Relieving on-call stress - Being on-call is stressful, especially when running errands and we don’t want to be stuck at home waiting for our phone to ring. Even with the great on-call culture at Shopify, and people always happy to override parts of your shift, being able to use the DNS traffic manager to steer traffic of an application to another cluster when something happens helps in so many cases. One aspect the different teams appreciated is that work can be done from the phone easily (thanks spy!). Another one is allowing Shopifolk to stay serene while solving the incidents which are mitigated thanks to traffic management and don’t impact our merchants and their clients. In summary, easy to use tooling and practical features together improve the experience of both our merchants and coworkers.

Since the creation of our new DNS traffic management standard in the middle of 2019, we’ve onboarded more than 40 different domains across more than 12 different teams.

Why Ownership Is Important. Demonstrated By Example

A few months ago, while we moved teams to use the initial version of our DNS traffic manager, we got an email from one of our DNS providers letting us know that they would discontinue their service because it would be merged with the services of the company that bought them a few years prior. Of course, we weren’t so lucky, having their systems merged together would require action on our part. We needed to manually migrate our zones to the new provider.

We launched a project to find our next DNS provider as a result. Since we needed to manually migrate our zones and consequently all of our tooling, we might as well evaluate our options. We looked at more than 40 providers, keeping in mind our needs for our static zones and traffic management requirements. We selected a few providers that fit our needs and decided on which one to sign a contract with.

Once we chose the provider, the big migration happened. First, we updated our terraform module to support the new provider and deployed the traffic-managed domains in the three providers. Then we updated our spy endpoints tooling to update all providers when making changes so everything was ready and in sync. Next, we moved the nameservers of our traffic-managed zone one by one from the DNS provider we were leaving to the new DNS provider, making sure that in case of a problem only a controlled share of the traffic would be affected. We explained our migration plans to the different teams owning domains in the traffic manager, letting them know when it would happen, and that it should be transparent to them, but if anything unexpected seems to be happening, they should contact us. We also told the incident manager on-call of the changes happening and the timelines.

Everything was in the plan for the change to happen. However, 30 minutes before change time, the provider we were leaving had an incident, preventing us from moving traffic and happening at the same time as one of the application owners having an incident that they wanted to mitigate with the traffic manager. It pushed our timeline forward, but we continued with the change without any issue and it was fully transparent to all the application owners.

Looking back to how things were before we rolled out our new standard for DNS traffic management, we easily can say that moving to a new DNS provider wouldn’t have been that smooth. We would have had to

contact every team using their own approach to gather their needs and usage so we could find a good alternative to our current DNS provider (luckily this was done while preparing and building this project)
coordinate between those teams for the change to happen, and then chase after them to make sure they updated any tooling used

The change couldn’t be handled for all of them as a whole as there wasn’t one product that one team handles, but many products that many teams handle.

With our DNS traffic management system, we brought ownership to this aspect of our infrastructure because we understand the capabilities and requirements of teams, and how we can maintain and evolve as our teams’ needs evolve, improving the experience of our merchants and their customers.

Our DNS traffic management journey took us from many manually setup, maintained, and updated traffic management approaches to a fully automated self-served system used by more than 40 domains owned by more than 12 different teams, and handling more than 100M requests per 24h. If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together - a future that is digital by default.