OpenVPN scaling

Previously: OpenVPN in a container

Also previously: Split tunnels considered harmful

I have this running in a staging environment and a test environment; it properly isolates OpenVPN from the underlying operating system, and seems to do the right thing performing rather well. However, its biggest limitation is that it runs on a single host & thus provides neither high availability in the case of Amazon funkiness nor scalability to handle lots of clients.

Getting this to be more reliable was a matter of tracking down the right systemd configuration. I’ll save systemd commentary for a separate post, saying only that it’s new, it’s not necessarily better for managing single machines, but I can see the advantages for consolidated management of big collections of machines.

Some OpenVPN config alternatives to pursue that look like they might do the right thing are as follows:

CloudCoreo has a setup based on the Amazon Elastic Load Balancer. They say

Getting an OpenVPN config up and running in AWS is difficult. We’ve already done it for you in an HA, durable, self-healing way. This blog post details how.

The blog post is in two sections. Part one details how they configured autoscaling, elastic load balancing, name service, S3 storage and policies, and security and IAM policies to set things up the way that you would want them within Amazon’s AWS. Part two goes on to describe the particular OpenVPN policy and routing setup, and there’s a bunch of policy there particularly with the use of split tunnels that you want to understand.

Their claim that they have “done it all for you” reflects a complete configuration available through their CloudCoreo system.

A second approach is detailed by Zalando, a Berlin based retail fashion platform. Their article Building an OpenVPN Cluster, Zalando-Style uses dynamic routing to distribute routes, and the Quagga system as their software based router. This setup does not particularly depend on AWS features, and they claim that “With this set up, we have about 800 users (150 in parallel) and haven’t faced any performance issues at anytime, from anywhere!” It’s a bit more complex to set up, though the writeup makes it clear that because of the flexibility that dynamic routing gives you there’s the opportunity to construct firewall policies that follow the user quite independent of how they are connecting to the network.

A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over, beginning with a working simple system. (John Gall, Systemantics)