As you may already know, I have built the Pi cluster to be a sandbox in which I can tinker with various technology stacks, to experiment, to tweak configs, to try out new ideas quickly and easily. My goal is to play with the currently fashionable stacks like Docker, Hadoop, ELK, you name it.
Everyone who has tried setting up a server stack knows that it is not an easy job: it may take days or weeks and hundreds of little steps just to get a stack running. Installation is an essential part of the fun especially for hobby projects, but it also makes you more and more cautious as your system's complexity grows, ultimately limiting the speed of innovation. What if you mess up a configuration and you need to restore the system to an earlier state? What would happen if you would need to start over from scratch? Will you still remember all the steps you need to execute to install the stack in 6 months? If you just add customization to a stack without any documentation, you will fear reinstalling it in just a few weeks. Even if you write a very detailed setup documentation, repeating each step by hand, in a 4-node cluster is tedious and error prone to say at least. Over time, you end up with snowflake-servers: unique and irreproducible machines with configs you do not dare to touch.
To overcome the obstacle of configuration management, I decided to use "Infrastructure as Code" principles to set up the cluster. The basic idea is to formalize the state of the infrastructure in a markup language (ruby, yaml, etc.) and use a provisioning / configuration management software to realize the setup on cloud or on-premise hardware. This code is usually declarative, describing properties of the desired system and the configuration management software will execute operating system commands to build a system. A configuration like this is usually written to be idempotent: it can be run multiple times without the fear of corrupting the system. The source code is written using software engineering best practices: it is stored in version control system, it is reviewed and tested much like the source code of any other regular software product. With this approach the system configuration is well documented and it can be executed by a machine which leads to quicker setup times and less errors. Provisioning systems becomes very inexpensive. In early 2010, with cloud computing, "Infrastructure as Code" became popular among early adopters and it appeared in the ThoughtWorks technology radar in 2011 as a philosophy mature enough to be adopted in projects. Since then the provisioning tools has matured even further and have became more enterprise grade. The most commonly used configuration management software as of today are: Puppet, Chef and Ansible.
I have chosen Ansible from the above list as it seemed to have a very low bar of entry. It is an agentless solution, meaning you install the Ansible application on your "Control Machine" and no additional software is needed to be installed on the nodes which are being provisioned, just standard SSH access. The control machine has access to the provisioning scripts called "playbooks" and it uses SSH to execute the playbooks on remote nodes. The playbooks are written in yaml, and can be extended with new modules in any programming language. The built-in modules have been more than adequate to serve my provisioning needs so far, and I was able to quickly script my playbooks in yaml using the Emacs yaml-mode plugin. I found the following resources to be very good introductions to the topic:
- Videos
- Books
To illustrate the power of the technology, with the help of my custom Ansible playbooks I can provision the 4-node cluster in less than 2 hours from scratch. It takes about 30 minutes to flash the memory cards, perform basic network setup and install Ansible on the controller node, then the scripts take control and install basic tooling, NFS, collectd/Graphite-based telemetry, docker and noip/SSH-based remote access in 70 minutes without user interaction. The Raspberry Pi's slow SD card I/O is the main bottleneck in the process. On real world systems with modern SSD-s the provisioning would take much less time.