First impressions of Saltstack and Reclass

Dennis van Dok

HEPiX Spring 2018 Workshop — Madison, WI, Thursday 2018-05-17

A new Configuration Management system?

We've been using Quattor since the early DataGrid days.

Changing landscape; grid services see less innovation, new CM systems emerged along with growing cloud deployments.

If there ever was a moment to do it, this was it!

Disclaimer: these “speaker notes” reflect the intent of the talk I meant to give, and in no way can be guaranteed to match the delivery. I haven't got them memorised. But I provide them gladly to the reader who wishes a little more context than what is usually found on the slides.

I took over as system administrator for the Grid systems at Nikhef in 2013. By then, we had been using Quattor for many years, which had grown up alongside the Grid since DataGrid days.

Back then, long before I got my start at Nikhef in 2005, the local grid cluster consisted of a stack of upright beige boxes. The number of services quickly grew and rack mounted hardware was introduced. Around 2008, virtualisation made its debut and most non-computational services went virtual.

The need for a configuration management system that could manage all these services was clear. Systems like cfengine had been around for a while, but few sites were out there operating at the scale of a typical grid site.

This changed with the arrival of the cloud, which allowed companies to scale out aggressively based on demand. Their engineers needed control over rapid deployments, and a slew of systems entered the stage.

The Grid stopped growing and innovating. We've consolidated the services that we are running and we will keep them running a while longer, but we're deploying other services to match user's needs.

If we ever wanted to try out a new configuration management system, this was the right moment.

About this talk

not a technical talk
the journey is more interesting than the destination
we're got plenty of the road ahead of us

A new system!

Credits to Andrew Pickford!

Looked at quattor upgrade:

a lot of work
smallness of quattor community
- they certainly wanted to help
- not easy to get going based on available documentation

Much of the work I present here is really done by Andrew Pickford, who joined the team in 2015. He's become a familiar face with the dCache community in the mean time, as we've switched from using Glusterfs to dCache for the shared storage filesystem on our local torque cluster, a relatively new use case for dCache that required some interaction with the developers.

We discussed the CM conundrum a lot and came to the conclusion that we should consider a Quattor upgrade first, since the version we were on was antiquated and we could certainly benefit from the input of the Quattor community.

It turned out that we were facing a tall order, whether we would pick the upgrade path or switching wholesale to a new system. Andrew gave it an honest attempt, but ran up against a shortage of documentation and things just not working as advertised.

It dawned on us that Quattor's reduced community would be of little benefit and we made the strategic choice to go for an all new system.

Considering several alternatives

(But some were rejected outright based on personal prejudice.)

Two candidates came very close: Saltstack and Ansible with no obvious winner.

Saltstack came out ahead by a nose on technicalities.

(Ansible would have served us just fine.)

What we liked

(Based on previous experiences)

we really liked the state concept of Saltstack (similar to Quattor).
Everything is YAML and Python. (And, ok, Jinja2.)
Nice integration with Reclass (more later).
Test mode shows what would change.

Among the advantages that helped Saltstack edge out the competition were its sound state based approach (this struck home with us at it shared that philosophy with Quattor); the fact that it's all written in Python, a language we were immediately comfortable with, and uses YAML as a simple specification language.

We really liked the test mode which helpfully indicates what the differences are between the current state of a node compared to the defined state.

Finally an interesting piece of software called Reclass provided a really nice approach to laying out the system data in a hierarchical manner that feels really natural—to a system administrator that is.

A first look at Saltstack

Discussed (a bit) at HEPiX before.

2016, Sandy Philpott, Site report,

https://indico.cern.ch/event/531810/contributions/2314173/
2017, Owen Synge, Technical talk,

https://indico.cern.ch/event/595396/contributions/2544138/

Widely used in various open source communities.

This is not a technical talk

(But anyway…)

master/minion system
minions controlled by defined states
static data provided by pillars
states are logically bundled by formulas
states are implicitly ordered by dependencies

In a nutshell, this master/minion system compiles state data for the minions to process. A ‘State’ describes a single piece of the configuration of a machine, such as the presence of a particular line in /etc/lvm/lvm.conf, or a software package, or a user account.

States are idempotent which means running multiple times has the same effect as running once.

A bundle of states to configure a single service is called a formula. Much of what a formula will do is based on information provided about the node in the form of pillar data. We'll see some examples later on. A simple way to think about it is that the pillar is a collection of configuration variables and that the formula is a configuration script.

Saltstack knows various types of dependencies which cause the running of states to be ordered, and some states to be skipped entirely. For instance, a configuration file change may cause a service restart, but if the file is untouched the service is left running.

What goes where

data source	kind of data	typical examples
pillar	static per-node	server name, ip address
formula	states related to a single aspect	mysql, iptables
state	elementary settings	installed packages, running services

Example of state run in test mode

The output (nicely highlighted on a color terminal) shows part of a test run of bringing a node to its intended state. I intentionally removed a line from the config te show that the state would put that line right back in. Instead of showing the entire file it only outputs the diff.

----------
          ID: /etc/nova/nova.conf
    Function: file.managed
      Result: None
     Comment: The file /etc/nova/nova.conf is set to be changed
     Started: 15:37:01.083553
    Duration: 380.062 ms
     Changes:   
              ----------
              diff:
                  --- 
                  +++ 
                  @@ -158,6 +158,7 @@
                   # * ``hyperv.HyperVDriver``
                   #  (string value)
                   #compute_driver=<None>
                  +compute_driver=libvirt.LibvirtDriver
                   
                   #
                   # Allow destination machine to match source for resize. …
----------

Organising our data with Reclass

We separated the

moving parts (states) that are the same for all our nodes from the
static data specific to each node (pillar).

The pillar is provided by Reclass.

Reclass

A recursive classifier, collecting static hierarchical information about nodes providing pillar data.

Originally http://reclass.pantsfullofunix.net/, but the most active fork at the moment is https://github.com/salt-formulas/reclass/. Our version currently is https://github.com/AndrewPickford/reclass/.

When you start to write configuration data for many nodes, you quickly realise that there is a lot of overlap between the nodes and that you'd rather avoid the duplication of data. The second realisation is that when you try to collect commonalities, there are subtle differences to be dealt with that cause a real headache. Soon, your data will be in tangles, and you are unhappily pondering whether it is too late to switch careers to herding cats.

The Reclass software offers some relief because it allows the specification of data in a tree of inherited classes. This means that each piece of information finds its natural place in the hierarchy, at the right level between generic and specific. It means that default values can be set at the top of the hierarchy and be overridden further down towards the leaves.

Reclass also allows referencing elements form other branches of the inheritance tree, and with late evaluation, the final values end up being used.

Reclass in a nutshell

(Remember, not a technical talk!)

Each node specifies which classes it belongs to;
each class is a file in a hierarchy (i.e. directory structure);
each class file lists more classes and/or parameters;
later classes override (simple values) or merge (lists) values from earlier classes.

The classes are written as YAML files in a directory structure. Each class file contains further classes to inherit and configuration variables called parameters. These are themselves structured, e.g.

_hardware_:network_interfaces:eth0:use_gateway: true

The final values can be simple scalars or lists or references.

Reclass example

Example, slightly simplified. This is a dCache master node in our testbed.

classes:
  - cluster.ndpf.testbed.dcache
  - hardware.vm.xen.standard
  - os.linux.redhat.centos.7
  - role.server.dcache.plain.master
environment: pre-prod
parameters:
  _hardware_: (here be the VM provisioning parameters)

This is as near as can be the full file. The hardware data lists stuff like amount of memory, VM pool name, MAC address, and ip addresses. There are four classes from different hierarchies; we've made this separation intentionally to separate the hardware descriptions from the collection of nodes which form a cluster and have a common environment, the operating system and the role of the system. As a rule, we try to stick to these hierarchies when specifying data, but there are some items that escape the labelling and end up in the catch-all system hierarchy.

We use the environment concept to group systems by areas of responsibility. There is production, pre-production and personal environments for testing and development purposes. These can still share much of the same Reclass data.

here is cluster/ndpf/testbed/dcache/init.yml:

classes:
  - cluster.ndpf.testbed
parameters:
  _cluster_:
    name: dcache testbed
    dcache_version: 3.1
    dcache_carbon_server: ${_cluster_:monitoring_satellite}
    dcache_nfs_allowed_ipv4:
      - ${_site_:networks:ipv4:stbcnet}
      - ${_site_:networks:ipv4:wnnet}

cluster/ndpf/testbed/init.yml:

classes:
  - cluster.ndpf
parameters:
  _cluster_:
    name: testbed
    monitoring_satellite: vaars-03.nikhef.nl

Note that _cluster_:name is given here, but the class cluster.ndpf.testbed.dcache overrides it.

What data goes where

Reclass allows more freedom in layout of data
Following a logical structure rather than what is imposed by a system
Only simple constructs allowed; complicated programming relegated to states

Shortcomings

Reclass is not without its shortcomings. It needed work to make it do what we wanted, and was (therefore) almost rejected.

We still went ahead and fixed it.

Redeeming qualities

Written in python which is nice and forgiving to programmers.

Our patches are available on Github, and we're looking to integrate with versions maintained by the salt-formulas people.

Added features

Exports: allow extraction of info from other nodes. This is conceptually related to the salt mine but comes in at an earlier stage of the processing chain.
References: were enhanced to allow nesting; overriding values will do merge instead of replace when values are lists or dicts.
Git backend: works just like the git backend for Salt, so data is taken straight from a repository/branch.

Improved error handling and reporting.

     - Failed to load ext_pillar reclass: ext_pillar.reclass: →
…-> cc2.cloud.ipmi.nikhef.nl
	  Cannot resolve ${_cluster_:some:value}, at →
…_cluster_:monitoring_satellite, →
…in yaml_fs:///srv/salt/env/dennisvd/classes/cluster/ndpf/cloud/init.yml

Formulas

All the moving parts are grouped by formulas.

apache, authconfig, autofs, backupninja, bind, certificates, cinder, cobbler, contrailctl, cups, cvmfs, dcache, dell_mdsm, docker, elasticsearch, eos, galera, git, glance, grafana, graphite, grid, haproxy, hardware, horizon, icinga, iptables, keepalived, kerberos, keystone, kibana, linux, logrotate, logstash, maui, memcached, munge, mysql, neutron, nfs, nikhef, nova, ntp, pacemaker, pakiti, php, postfix, postgresql, prometheus, python, rabbitmq, reclass, repo-mirrors, rsync, rsyslog, salt, sanity-check, secure, tftpd_hpa, torque, zookeeper

pros and cons

Pros:

encapsulate a functional element
forms a clear conceptual boundary
places complexity where we want to handle it

Cons:

many repositories (requires scripting)
mixed quality (often only tested on Debian)

Single or separate repositories?

Choice:

put all formulas in a single repository, or
keep all formulas in their own repository

Formulas and reclass

Formulas are driven by pillar data
This makes them integrate well with reclass.

Information flow and relationships

Version control

keep everything in private Gitlab
master branch in Gitlab defines what is in production
other branches correspond to environments

Git as a workflow driver

git push to master determines what is in production
manual deploy initiated thereafter still necessary
we needed a pre-production testbed to test changes before the push
we needed a way to sync up the many formula repositories

Pre-production

Each type of system has its counterpart in pre-production.
Pre-production looks at a local checked out version of the master branch.
Variants for treating updates:
- minor changes can be applied and tested before committing
- major updates are tested in other environments and handled via git merging of branches

Pepper wrapper

High level pepper scripts to replace low level salt.

dealing with multiple repositories
test
deploy
commit
other git commands

Pepper-deploy

will stagger updates to prevent overload on the master.

environments

Environments correspond to branches in git.

Each newly introduced formula must have branches for every environment.
Pre-production is the exception, because it looks at the master branch (but actually a local checkout).
People have their 'own' environment for testing and development purposes.
possibility to ‘move’ a machine between environments

Monitoring

Relies on the exports mechanism discussed earlier
Nodes specify
- what type of thing they are, and
- the kinds of things anyone interested in monitoring should be looking for.

The monitoring system defines how the actual monitoring is done for all of those things. It gets the list of nodes and services from the inventory.

Deployment

cobbler
based on exports.
supported by scripts
hardware description of a node
- prescriptive for VMs
- descriptive for actual hardware

The cobbler node has to manage both production and pre-production, and is the 'odd one out' as it has no pre-production counterpart.

Repositories

The cobbler server also collects mirrors of various repositories for software installation.

time-based snapshots
no dependencies on external repositories in production
support for both apt and yum repos

Systems saltified so far

dcache
salt master
cobbler
torque/maui (local cluster)
DNS (in high availability setup)
monitoring (grafana, icinga)
NFS server
EOS
Openstack (still experimentally)
more to come

Conclusions

Open Problems

Running the inventory with 'broken' nodes
Performance issues with large deployments

Future

full automated installations
pre-provisioning keys (salt, ssh, others)
orchestration
- stagger kernel updates
multi-master
performance issues
- where does the system spend most of its time?
- high load on master
- addressed by batching updates with pepper scripts
- the monitoring box will go to 500+ states as we add more systems

Lessons learned

New system is a lot of work.
Organisation of data is more important than mechanics.
Tradeoff between flexibility in prototying and control in production.
No truly bad choices, but many secondary factors to consider.
Look at the specific needs of the team; better find a good match than just go with the most popular system.

We've made a lot of progress setting up an new system. Setting up the core infrastructure cost the most time.

When starting with a new system, there are probably no truly bad choices. But secondary factors play a role in making the right choice.

We spent a lot of time discussing about organising our data, and that turned out to be very important to us; this in the end is what we'll end up maintaining.

No matter which way you go, there seems always to be this tradeoff between control over production and flexibility in development and prototyping. Figuring out what this balance should be in your case is an important exercise.

And finally: the choice is entirely yours to make. Look at the needs of the team and find a proper match; this may not be what others have done but since you're going to be the one to implement it and maintain it, it is important that you are comfortable with your pick.