Black Friday and Cyber Monday are the biggest days of the year at Shopify with respect to every metric. As the Infrastructure team started preparing for the upcoming seasonal traffic in the late summer of 2014, we were confident that we could cope, and determined resiliency to be the top priority. A resilient system is one that functions with one or more components being unavailable or unacceptably slow. Applications quickly become intertwined with their external services if not carefully monitored, leading to minor dependencies becoming single points of failure.
For example, the only part of Shopify that relies on the session store is user sign-in - if the session store is unavailable, customers can still purchase products as guests. Any other behaviour would be an unfortunate coupling of components. This post is an overview of the tools and techniques we used to make Shopify more resilient in preparation for the holiday season.