An introduction to Docker: images, containers, links and names

This article discusses the process of ‘containerizing’ a Node.js project I made a couple of years ago, Nodeflakes. This seasonal festive demo is dusted off once a year around Christmas time and while the code has hardly changed since 2011 it’s nice to breathe some new life into it each year. While perusing the GitHub repository (and squirming a bit—my Node.js-based JavaScript has come a long way in two years) it occurred to me that its decentralized worker approach made it a perfect candidate for Docker containerization.

Getting started: know your processes

One aspect of Docker which can take some getting used to—particularly when coming from a full-blown VM setup such as Vagrant—is the concept of each container only running one primary process. My first foray into setting up a simple Ubuntu LAMP stack resulted in a dismal failure, and whilst it is entirely possible to run something like supervisord to manage multiple long-running processes within a container it’s still limited to the one main process. It takes something of a mental mind-shift to start to understand and embrace this apparent restriction when you’re used to being able to simply SSH in and tweak your machine on-the-fly (something you can still do, but only if your main process runs an SSH daemon!).

I’ve found it more productive instead to think about an application in terms of its individual components and see if these could (or already do) map to individual processes. This isn’t always easy but the trade-off is worth it; the Single Responsibility Principle is a well understood development paradigm which lends itself just as well to the execution of code as it does to the structure of it.

As a real-world and very simple example of this I initially struggled to convert an unrelated ‘db’ Vagrant VM running MongoDB and Redis services into its corresponding Docker image. Whilst I did get it working I eventually realised that the ‘db’ tag was a misnomer and all I actually needed was two separate docker containers; one running Redis and one running MongoDB. The two processes don’t share anything after all, so isolating them from one another makes a lot of sense—not to mention that of course MongoDB and Redis images already exist in the Docker Index, whereas my clumsy hybrid does not.

The concept of singularity and encapsulation also aligns well with the Service-oriented Architecture design pattern; a docker container maps neatly to the concept of a service and enforcing separation between components encourages clean, robust external interfaces in order to communicate between them. Fortunately, Nodeflakes already consisted of three reasonably well defined, well isolated processes communicating via ZeroMQ and as such made the perfect testbed for containerization. The processes are simply:

  • The Consumer
  • The Processor(s)
  • The Server

These components are discussed in the original article and as such I won’t go into any more detail about them here. Let’s put them in some containers!

Defining our Docker images

With the individual processes identified the next step is to start defining our images to run them from. There are two main ways to achieve this:

  • interactively; by running a shell from a base image and manually installing the required software
  • imperatively; by defining a series of commands inside a Dockerfile

The first approach is undoubtedly better when experimenting or finding your feet, but when you want to start building reliable, reusable images I’ve found using a Dockerfile a must.

Our base image: makeusabrew/nodeflakes

Docker images are comprised of a series of layers. This has a number of implications; one of which is that images can be built on top of other images. It’s perfectly possible to bundle up an application’s common dependencies into a base image and then create other images which inherit from this base. All three of our processes need NodeJS, a handful of npm modules and ZeroMQ, meaning we have a very simple base image which encompasses pretty much all of the instructions we’re going to need:

You’ll have probably noticed that this Dockerfile itself inherits from another base image: makeusabrew/zeromq; again making full use of layers where appropriate.

Our child images: nodeflakes-{server, consumer, processor}

The base requirements for our three processes are in this case exactly the same, meaning our child images barely have anything in them at all. Such a homogeneous set of dependencies combined with no database requirement is probably the exception to the rule of modern, diverse web development—another project I’m currently working on has five distinct, very different node types—but it certainly makes for a simpler introduction to Docker:

Our use case is in fact so simple we don’t need the three child images at all; the only variations are the commands we execute when running each image and which ports should be exposed. However, creating the three child images has the benefit of being able to run them without having to worry about their internal configuration. Yes, the `EXPOSE` and `ENTRYPOINT` directives could just be passed as options to `docker run`, but having separate images:

  1. Promotes each executable script as a first-class process; it’s a cleaner mental model to have three well isolated, neatly packaged images each responsible for one task than it is to have one which must be used for many
  2. Allows us to talk about each image using the application’s terminology; it’s easier to discuss, tweak and share ‘the server’ or ‘the consumer’ than it is the ‘NodeJS / ZeroMQ box’
  3. Reduces the margin for error; no risk of getting ports or entrypoint commands wrong
  4. Isolates each part of the application; their dependencies can change as long as they can still communicate.

The whole idea about Docker is containerization; the less we have to know about a container’s contents in order to successfully run it, the better.

Building our images

With our Dockerfiles defined we simply need to build an image from each of them. I won’t go into much detail about this here other than to note the names we’ll give each image, since these are referred to elsewhere in the article:

  • Base image: makeusabrew/nodeflakes
  • Consumer: makeusabrew/nodeflakes-consumer
  • Server: makeusabrew/nodeflakes-server
  • Processor: makeusabrew/nodeflakes-processor

Note that these image names are arbitrary, although a couple of restrictions come into play when pushing images to the Docker Index: images must have a <user>/ namespace (non-namespaced images are reserved for official images) and you cannot push to another user’s namespace.

Docker only stores the diffs for each new image so the overheads of having three very similar child images are minimal—something we can prove to ourselves by taking a look at the visual history graph for our images. Note that the history for makeusabrew/zeromq and its dependency on the base ubuntu image has been omitted for brevity:

Tree visualisation of Nodeflakes images

Each of the hashes in the diagran represent a build step in the relevant image, most of which are cached and therefore reusable, making image generation cheap and efficient. The long chain between makeusabrew/zeromq and makeusabrew/nodeflakes neatly ties up with the number of steps in its Dockerfile (MAINTAINER counts as a build step), but most important is the fact that each of our child images clearly reuses the base image’s layers. To be absolutely certain we’re not creating undue overhead, we can take a look at the output from `docker images`, paying particular attention to the size column:

This efficiency applies everywhere you’d expect it to; pushing images, pulling images, building images—they all reuse any data they’ve already got.

Running our first container: the consumer

The consumer is the part of the application which connects to Twitter and streams incoming tweets, so it makes sense to test that part out first. It is an incredibly simple process responsible solely for maintaining stream connectivity and placing all received tweets on a queue. The `docker run` command is by far the longest of the three due to Twitter now blocking basic API authentication, meaning we have to pass in a number of awkward looking environment variables to avoid storing these in the code:

If you want to try the consumer container for yourself you’ll need to have a valid app token & secret as well as your own valid user token & secret. This is a pity since all we want is to stream some read-only public data, but it seems basic authentication has long since been deprecated.

The name of the container—one of 0.6.5’s fantastic new features—is important. You’ll see why when we start the final part of the application in a couple of steps’ time.

Running the server

We run the server next—not the processor—because it forms the other stable part of the application. Processors are designed to be transient; they can come and go as they please, so it’s important they can always connect to the relevant queue endpoints when they do so. This is based on the Divide and Conquer approach in the fantastic ZeroMQ guide.

The `-p 7979:7979` flag is important: it forwards the container’s private port 7979 to our host machine. Which port we map it to on the host doesn’t really matter, but I usually choose the same as the private one unless it’s unavailable. Port 7979 is used—arbitrarily—by the Nodeflakes server to listen for incoming WebSocket connections and by mapping it to a consistent port on the host we can now connect to the server on localhost:7979 whenever the container is running, regardless of what its IP address is (something which varies every time the image is run). Although we’ve previously discussed trying to keep runtime configuration down to a minimum this is one flag which must be a configuration option as we can’t know in advance which port the user wants to map the WebSocket server to on their host machine. Note that this port mapping is unrelated to the ports we `EXPOSE`d in our earlier dockerfiles; they’re for inter-container communication which we’ll get onto next.

Running the processor

Finally we can start the processor, and it is at this stage that the importance of the previous containers’ names and start order becomes more apparent. We’re going to use the other major new feature of Docker 0.6.5, links:

The first part of each -link parameter is the name of an already running container which will be ‘linked’ into the new container and identifiable within it by the second part of the parameter. Docker will expose several environment variables to the new container relating to details of each link, allowing us to look up their values where needed in our code. Let’s have a look at what these environment variables look like inside our newly started container:

Perfect! We can see numerous CONSUMER_* and SERVER_* environment variables—named as such after the link aliases we provided. These variables provide the details about the `EXPOSE`d ports from the linked containers, meaning we can consistently and predictably pick out the connection information required to connect to the relevant queue endpoints regardless of what IP addresses (or in fact ports) the consumer and server containers are allocated when run. The processor initialisation code should piece this all together:

ZeroMQ conveniently accepts connection details as a single string, meaning the {ALIAS}_PORT_{PORT}_TCP values are all we need to connect to our consumer queue, start processing the incoming tweets and then push them out to our server.

The benefit of the -name flag

Being able to name containers has numerous advantages but the main one from our point of view is predictability; we can always start our processor image with the same options because we know what the server and consumer containers are going to be called. If we had to look up their names each time we ran them we’d put an extra manual step in the start sequence making the system much harder to automate.

The benefit of the -link flag

Links are the perfect compliment to names; now we have the ability to consistently address running containers we can start to model dependencies between them using predictable references. Whilst prior to Docker 0.6.5 it was perfectly possible to look up a running container’s IP address and port information this was again a manual step: we’d have to start our consumer and server, look up their details and then manually pass some arguments or environment variables to our processor when starting it. These steps would have to be repeated each time the consumer or server was started.

Additionally, 0.6.5 allows inter-container communication to be disabled by default for added security. If ICC is disabled then links are the only way for containers to communicate with each other, so they’re far more than a mere convenience. I wouldn’t be too surprised if ICC were to be disabled by default in a future release.

Deploying our images

Once we’re happy with everything the last remaining step is to get our containers running on a production environment. For Nodeflakes this is simply a Rackspace Cloud server I spin up once a year, but—and this is where Docker really shines—it could be any machine running Docker, regardless of what other software it has installed or whether it is a VM, a dedicated server or even a Raspberry Pi. As long as the target platform has Docker installed the rest of your application’s dependencies are irrelevant and invisible to the host machine. The implications here for dependency simplification and process isolation are enormous.

Practically speaking there are a couple of ways to deploy containers; we could `docker export` each of them, but far simpler is to simply push the images we’ve built to the public Docker Index—which I’ve done—and then `docker pull` them down on the production machine. Running them is then just a case of starting the containers exactly as before and voila: self-contained, Docker-powered Nodeflakes! This also takes full advantage of the layers shared between the images; whichever one we `docker pull` first will take a while to download but the others will take seconds.

Docker Gotchas

As with any software under heavy development, Docker is not without its quirks. None are by any means insurmountable but some are noteworthy; what follows is a collection of gotchas which caught me out when converting Nodeflakes to run on Docker.

Layer limitations

Dockerfile instructions are repeatable, but at present the AUFS limit of 42 layers means you’re encouraged to group similar commands where possible (i.e. combining separate apt-get install lines into one RUN command). The impact of this is that a single change to a long list of required apt-get packages means invalidating Docker’s build cache for that command and all those which follow it, which can mean a non trivial amount of time spent tweaking Dockerfiles, rebuilding images and repeating until the image is just right. This can be particularly painful when building software from source as it is typically a slow process.

ADD commands do not cache

Although this is on the roadmap it can catch you out at first. There are workarounds but the simplest solution is to group your ADD commands near the end of each Dockerfile so that they don’t invalidate the cache for the instructions which follow them. I’ve also found that if a parent image contains ADD commands (as is the case with Nodeflakes) then the resultant image is correctly cached when used to build descendent images. That said, this can lead to another problem…

ADD commands do cache in a parent image

This is mostly great news if you want to work around the previous paragraph’s implications. However, if the files ADDed by the parent change, you have to rebuild both the parent and the child image in order for the child to inherit the newly ADDed files.

ADD commands need context

One thing I kept coming unstuck on was not being able to ADD any directories not directly at or below the level of the relevant Dockerfile (i.e. ADD ../src is invalid, but ADD ./src is fine). It’s most likely just my organisation which is at fault, but this means my grouping of Dockerfiles under their own namespaced directories is awkward when wanting to ADD the NodeJS code which sits in the root of the project folder. The solution is to simply copy or symlink each Dockerfile to the root of the project when building each image, but this is tedious; so much so that I’ve written a little helper utility to avoid doing it—something I’ll write up separately.

In Summary

While Nodeflakes may not flex Docker’s muscles to their limits it does provide a simple introduction to it (the diff is comprised mostly of the Dockerfile definitions) and hopefully illustrates the power of containers as a concept. In development they allow for straightforward multi-machine topologies to be replicated with ease and in production they offer overheads close to zero, all with blissful, portable and reusable isolation. But enough of that; head over and enjoy some Docker-powered nodeflakes!

Related Articles

Comments

Glen Mailer
Nice write-up Nick, I feel much more informed now!

One thing that struck me was that the ENV vars are quite docker-specific. I wonder if it would be cleaner to name them in relation to the app, then have a little bash script that is invoked by docker run to translate?

Perhaps I'm over-thinking it :)
Nick Payne
@Glen yeah, I know what you mean; they're not pretty. I'd always thought of the situation the other way round; put up with them and then just inject the same env vars should you ever not run it via docker. Ugly they might be, but I guess they're at least descriptive.

One thing I do hope for in future is the ability to alias port numbers: `CONSUMER_PORT_5554_TCP` would be far clearer and more meaningful as something like `CONSUMER_PORT_ZMQ_TCP` - plus having the '5554' bit as part of the variable seems to lose some isolation to me. Honestly not sure if port aliases are on the roadmap, but let's hope so :).

Add a comment

Your email address won’t be published you’ll never be sent any spam. Your IP address is captured for auditing purposes and your comment will be moderated before it appears.