How an Intern Released 3 Terabytes Worth of Storage Before BFCM

by Gurpreet Gill
Infrastructure

Dec 14, 2018
4 minute read

Hi there! I’m Gurpreet and currently finishing up my second internship at Shopify. I was part of the Products team during both of my internships. The team is responsible for building and maintaining the products area of Shopify admin. As a developer, every day is another opportunity to learn something new. Although I worked on many tasks during my internship, today I will be talking about one particular problem I solved.

The Problem

As part of the Black Friday Cyber Monday (BFCM) preparations, we wanted to make sure our database was resilient enough to smoothly handle increased traffic during flash sales. After completing an analysis of our top SQL queries, we realized that the database was scanning a large number of fixed-size storage units, called innoDB pages, just to return a single record. We identified the records, historically kept for reporting purposes, that caused this excess scanning. After talking among different teams and making sure that these records were safe to delete, the team decided to write a background job to delete them.

So how did we accomplish this task which could have potentially taken our database down, resulting in downtime for our merchants?

The Background Job

I built the Rails background job using existing libraries that Shopify built to avoid overloading the database while performing different operations including deletion. A naive way to perform deletions is sending either a batch delete query or one delete query per record. It’s not easy to interrupt MySQL operations and doing the naive approach would easily overload the database with thousands of operations. The job-iteration library allows background jobs to run in iterations and it’s one of the Shopify libraries I leveraged to overcome the issue. The job runs in small chunks and can be paused between iterations to let other higher priority jobs run first or to perform certain checks. There are two parts of the job; the enumerator and the iterator. The enumerator fetches records in batches and passes one batch to the iterator at a time. The iterator then fetches the records in the given batch and deletes them. While this made sure that we weren’t deleting a large number of records in a single SQL query, we still needed to make sure we weren’t deleting the batches too fast. Deleting batches too fast results in a high replication lag and can affect the availability of the database. Thankfully, we have an existing internal throttling enumerator which I also leveraged writing the job.

After each iteration, the throttling enumerator checks if we’re starting to overload the database. If so, it automatically pauses the job until the database is back in a healthy state. We ensured our fetch queries used proper indexes and the enumerator used a proper cursor for batches to avoid timeouts. A cursor can be thought of as flagging the last record in the previous batch. This allows fetching records for the next batch by using the flagged record as the pivot. It avoids having to re-fetch previous records and only including the new ones in the current batch.

The Aftermath

We ran the background job approximately two weeks before BFCM. It was a big deal because not only did it free up three terabytes of storage and resulted in large cost savings, it made our database more resilient to flash sales.

For example, after the deletion, as seen in the chart below, our database was scanning around ~3x fewer pages in order to return a single record. Since the database was reading fewer pages to return a single record, it meant that during flash sales, it can serve an increased number of requests without getting overloaded because of unnecessary page scans. This also meant that we were making sure our merchants get the best BFCM experience with minimal technical issues during flash sales.

Database Scanning After Deletion

Truth to be told, I was very nervous watching the background job run because if anything went wrong, that meant downtime for the merchants, which is the last thing we want and man, what a horrible intern experience. At the peak, we were deleting approximately six million records a minute. The Shopify libraries I leveraged helped to make deleting over 🔥5 billion records🔥 look like a piece of cake 🎂.

5 billion Records Deleted

What I Learned

I learned so much from this project. I got vital experience with open source projects when using Shopify’s job-iteration library. I also did independent research to better understand MySQL indexes and how cursors work. For example, I didn’t know about partial indexes and how they worked. MySQL will pick a subset of prefix keys, based on the longest prefix match with predicates in the WHERE clause, to be used by the partial index to evaluate the query. Suppose we have an index on (A,B,C). A query with predicates (A,C) in the WHERE clause will only use the key A from the index, but a query with predicates (A,B) in the WHERE clause will use the keys A and B. I also learned how to use SQL EXPLAIN to analyze SQL queries. It shows exactly which indexes the database considered using, which index it ended up using, how many pages were scanned, and a lot of other useful information. Apart from improving my technical skills, working on this project made me realize the importance of collecting as much context as one can before even attempting to solve the problem. My mentor helped me with cross-team communication. Overall, context gathering allowed me to identify any possible complications ahead of time and make sure the background job ran smoothly.

Can you see yourself as one of our interns? Applications for the Summer 2019 term will be available at shopify.com/careers/interns from January 7, 2019. The deadline for applications is Monday, January 21, 2019, at 9:00 AM EST!

The Problem

The Background Job

The Aftermath

What I Learned

Get stories like this in your inbox!

Ready to tackle frontend, backend, infrastructure, data, or security challenges?