Building an Internal Cloud with Docker and CoreOS

This is the first in a series of posts about adding containers to our server farm to make it easier to scale, manage, and keep pace with our business.

The key ingredients are:

Docker: container technology for making applications portable and predictable
CoreOS: provides a minimal operating system, systemd for orchestration, and Docker to run containers

Shopify is a large Ruby on Rails application that has undergone massive scaling in recent years. Our production servers are able to scale to over 8,000 requests per second by spreading the load across 1700 cores and 6 TB RAM.

Docker has been getting lots of attention for its ability to bundle applications into portable containers that can be version-controlled and easily distributed.

A container feels much like a virtual machine (VM) in that it has an isolated file-system, network, etc. but is lighter weight and makes much more efficient use of hardware. Rather than emulating physical hardware like a VM, Docker containers allow safe sharing of a Linux host between applications using:

kernel namespaces to keep applications isolated
cgroups to provide resource limits and accounting
layered file systems to keep container sizes down

Docker is growing in popularity on both our development and operations teams for many reasons. Here’s the not-so-short list:

Fast Startup: containers are up and running fast, single-digit seconds fast. If you’re smart, you can also do expensive operations while the container is built. Faster startup means faster deploys.
Density: containers share resources aggressively with the host machine (e.g. single kernel) allowing much more efficient use of hardware than VMs. This results in more application per rack of hardware.
Consistency: containers ensure that each instance starts in an identical state.
Developer-friendly workflow: containers are built from plaintext Dockerfiles which can be version controlled, and you can diff container versions. In addition, containers can be distributed with familiar git-like push/pull primitives.
Ops-friendly workflow: containers include the operating system so if you need a package or custom configuration your developers can make the change rather than having to pull in ops. This is a major shift in responsibility: developers own the containers, and ops can concentrate on providing bulletproof hardware and networks.

The killer feature of containers is that a machine adopts multiple purposes. Containers doing different work can coexist peacefully on the same hardware. This turns the data center into a general purpose computing resource.

For example:

Scaling up the number of app server containers can be done quickly and easily, meaning we’re always ready to handle sudden traffic spikes.
Teams can easily borrow spare capacity from the production server farm to perform computationally heavy operations, such as data analysis.

Sounds compelling? We think so, and we’re ready to start talking about our experiences turning this vision into production-grade reality. You can expect more than talk as we’ll be sharing as much of the code as we can with the community.

There’s a lot of ground to cover, so we’ll roll out the content in installments:

Containerizing: How do you jam an existing application into a container in a way that satisfies an opinionated, simplicity-obsessed development team?
Managing Secrets: Real apps are full of API keys and database passwords which you really want to keep safe (with bonus points for making them version controlled and easy to find when the next Heartbleed happens). We have a solution that works.
Routing: How do you connect a container to the outside world, and play nicely with production infrastructure like load balancers?
Monitoring: Our containerized stack needs to match or improve upon Shopify’s 99.97% uptime. Spotting and heading off trouble before it turns critical is a surprising amount of work, and requires a different approach in a containerized world than on bare-metal. Adapting our monitoring to this new environment is an essential component in keeping Shopify stable.
Bulletproofing: Failing gracefully and predictably is hard. Learn from our experience in handling out-of-memory, request queuing, and signal handling situations.
Provisioning: We’ve built our system on top of Chef & Ubuntu today, with a plan to evolve toward CoreOS running on bare metal. Learn about how we provision nodes, get them bootstrapped to run containers, and monitor them effectively in an ‘all containers’ world.

Keep an eye out for future deep-dives into these fascinating topics.