Toni Akinwumi 08/16/2017

Shopify is Going to Dublin for SREcon Europe 2017

6 minute read

Get excited because production engineers from Shopify will be speaking at this year's SREcon Europe in Dublin, Ireland! The conference runs from August 30 to September 1, and presentations will feature topics ranging from scriptable load balancers to resiliency testing with Toxiproxy.

This talk is about the tension between an urgency driven organization and the diligent SRE teams that operate within it. We'll examine how to build, nurture, and support those teams. We'll look at how to celebrate and reward them for being prudent, cautious, and skeptical. And because it is the deliberate pace of these teams that allows the rest of the organization to move quickly, we'll dive into how to concretely measure the benefits and sell them as positives to the rest of the organization. Attendees will leave with tools and techniques to highlight the importance of their work as SREs, when to trade speed for diligence and how to move fast and stay sane—all without cutting corners.

Wednesday, 30 August 2017 - 11:30am–12:00pm

Load balancers have the potential to provide application-aware middleware without making changes to the application itself. However, traditional load balancers can’t be easily and deeply customized or redeployed quickly without significant risk. Instead, we can embed a scripting language to fulfill these requirements.

At Shopify, we do this with Nginx and LuaJIT via OpenResty. Our Nginx scripts deploy in 10 seconds, run through a thorough suite of automated tests, and have allowed us to solve sharding across data centers, handle some of the world’s biggest flash sales, and respond quickly to layer 7 DDoS attacks. What once took a large team of engineers can now be accomplished by one of any size.

The lessons learned from building this middleware framework are applicable to any service. By solving hard problems in your load balancers, you can benefit every application or service you run.

Wednesday, 30 August 2017 - 1:40pm–2:30pm

In a world with ever-growing DDoS attacks, L7 attacks give even the most experienced engineers the sweats. Imagine if instead of following easy to detect patterns, bots could mimic the behavior of customers. Well, that’s exactly what Shopify sees every day during flash sales.

Come and learn how we block nearly all bot traffic on our load balancers without any human intervention. We will share our challenges of differentiating between web crawlers and bots, users behind NATs and bots rotating user agents, as well as fast humans and browser extensions. When the stakes are blocking a customer completing a checkout, misclassification isn’t an option.

This is not yet another machine learning talk, but an example of how simple statistics, heuristics and some sane limits can give great results with minimal complexity. The lessons learned in this talk are applicable to any real-world problem with inexact constraints.

Wednesday, 30 August 2017 - 3:40pm–4:10pm

Fibers get cut, databases crash, and you’ve adopted Chaos Engineering to challenge your production environment as much as possible. But what are you doing to craft the resiliency test suites that minimizes the impact of failure on your application as much as possible? How do you debug resiliency problems locally and make sure single points of failures don't creep into the application in the first place? We’ve used the open-source Toxiproxy for the past two years to emulate timeouts, latency and outages in development environments. This talk will equip you with the tools to start writing resiliency test suites to harden your own applications, to supplement other chaos engineering practices.

Thursday, 31 August 2017 - 9:50am–10:20am

SREs are expected to be incident management experts. Yet, incident handling is hard, often messy, and exhausting. We encounter new incidents, look up everywhere for possible explanations, sometimes tunnel on symptoms, and, under pressure, forget some good practices.

At Shopify, we care not only about handling incidents quickly and efficiently, but also SRE well-being. We have a special IMOC (Incident Manager On Call) rotation and an incident chatbot to assist IMOCs. In this talk, I’ll first explain the IMOC role and how training SREs for this duty is essential to handling incidents well.

 Our chatbot assists the IMOC by reducing manual effort and context switching. We integrated the bot with our conversation tool and several third-party tools (PagerDuty, StatusPage, Github) to send timely reminders. It also binds the incident to a discussion channel where all communications happen, allows status page updates directly from the chat room, keeps notes and records event times, and generates service disruption content. To avoid burnout for long-running incidents, the chatbot also reaches out to other IMOCs.

Our chatbot supports best practices and "streamlines" incident response. Attendees will leave with strategies for incorporating chatbots into their incident management and considerations for automating precisely and smartly.

Thursday, 31 August 2017 - 4:30pm–5:00pm

  • 6 Ways a Culture of Communication Strengthens Your Team’s Resiliency - Jaime Woo

Shopify’s SRE team has grown quickly. Before, the team could all fit in a room, and, now, we span several officesand timezones. Proper channels of communication are vital and strengthen the team’s resiliency, delivering better work and happier SREs. This talk share six ways our investment in a culture of communication has meaningfully improved the flow of information inside Shopify.

Thursday, 31 August 2017 - 5:00-6:00pm (During lightning talks)

Recently, Shopify began migrating from our custom container management system to Kubernetes. This switch will makes us more efficient at running our large Rails monolith, as well as the current and future microservices that run alongside. The first step in migrating was building a cluster using our own hardware.

Running Kubernetes on-premise requires building services that cloud providers hide from their customers: Etcd, high-availability master nodes, scalable networking, Ingress, and persistent storage. We believe that understanding the challenges and tradeoffs in providing these services is beneficial to not only those who run their own cluster, but also to those who use cloud providers.

Beyond building the cluster, we also had to modify our core application and tooling to fit Kubernetes’ container-centric framework. We expect that most applications currently on homegrown deployment systems will have to similarly overcome host-based assumptions. In our case: unbounded jobs, hard coded assumptions about hosts, and services exposed to external monitoring tools via global DNS.

Attendees will leave this talk equipped to decide if running their own Kubernetes cluster is right for them and how to make the shift as successful as possible.

Friday, 1 September 2017 - 9:00am–9:30am

If you're an engineer interested in life at Shopify, Emma and Sam will be at SREcon: reach out to them on twitter. If you are interested in working at Shopify, head over to our careers page.