Notes on private cloud bursting with public providers

Dennis van Dok

Helix Nebula Science Cloud meeting 2018-09-11 – Amsterdam

Nikhef local resources

Nikhef is the National institute for subatomic physics in The Netherlands.

  • we're operating a high-throughput grid infrastructure in several international frameworks
  • we're also running a small-scale local cluster for our own users: stoomboot.

Cloud bursting?

We were given the opportunity to use public cloud resources. The prime candidate for this was extending stoomboot.

Limitations of cloud resources

There were two providers available: Tsystems and Exoscale. Both offer a similar type of cloud host service, based on Linux KVM or Xen.

The number of public IP addresses is somehow limited; this means that most of the created nodes will not receive a public IP address from the cloud provider.

Fast network connections seemed to require special arrangements, although a simple speed test from one node (on exoscale) yielded a throughput of about 350 MB/s (2800 Mbit/s).

Challenges

The additional available resources would be welcomed by our users, but if they had to do anything special in their workflow it would probably not work out.

Networking

Our users rely heavily on the availability of several network file systems (NFS) both for reading and writing.

Private data in a public cloud

T-systems could offer us a 'direct connect' with our own network, but this turned out to be not so direct and not particularly private.

Neither one of the providers could offer us a really private connection (such as a lightpath).

Because we needed to extend our NFS network we really needed a secured path.

Experiment #1: layer 2 vpn

We set up an OpenVPN server with a tap interface to extend a local LAN to the cloud resources. The idea was to bridge our LAN across the OpenVPN connection.

The performance for a single OpenVPN connection would not be great, but we figured that we could multiplex this across a number of OpenVPN instances.

Results of layer 2 bridging

We tried with both T-systems and Exoscale, The VPN connection was up, the bridge worked, and broadcast/multicast packets would travel across fine. But unicast traffic would not.

It turns out that cloud providers switches refuse to learn mac addresses that do not originate from their own hardware.

Experiment #2: layer 3 vpn

Changing the tap to a tun interface and defining routes on the openvpn box; now we assign one of our own network blocks to the cloud resources, set up routing to the block on our own router to go through the vpn server.

This works and has been tested on Exoscale but not yet on T-Systems.

Schema

Scaling

To increase bandwidth, multiple such openvpn connections need to be set up in parallel; routing through this openvpn array is then arranged by ECMP or multipath BGP. The bird software will be installed on each cloud instance and the openvpn servers will act as route reflectors.

Conclusions

Work is ongoing but shows promise. The solution is rather dirty.