The Hitchiker's Guide to Mesh VPNs

Thursday, the seventeenth of March, A.D. 2022

R ecently at work we’ve been moving to a new VPN, and naturally as part of that process we done a bunch of research into the available options before settling on one. Mostly I want to document that for my own future reference, so that if this question comes up again I don’t have to go redo it all, but if it ends up being helpful to someone else someday then that’s great too. (If I ever get this blog site launched, that is. Currently it’s not looking too good.)

TL;DR: We ended up going with Tailscale, because it looked the most user-friendly, had the security features we wanted, and was something I had already used personally so it was more of a known quantity than some of the others.

A Brief History of VPNing

There are a lot of different VPN softwares out there. Traditionally there were two main types: site-to-site and client-server. Site-to-site VPNs were for connecting geographically separated LANs into one big super-LAN, useful if you had one company with two offices in different cities or something. Client-server VPNs were for hooking individual users outside the office into your corporate network so that they could access the fileshare, locally-hosted whatevers, and so on. Maybe you could even enforce traffic filtering policies by forcing all of their traffic to go through the VPN first, where it could be inspected and potentially blocked if it were determined to be non-kosher. Seems a bit control-freaky to me, but maybe if I were responsible for the network administration of thousands of users I’d feel differently.

More recently, things have started to change in the VPN world. A new power is rising; its victory is at hand. This night, the land will be stained with the blood of IPSec. Erm. Ahem. The new breed is “mesh VPNs,” and they’re really starting to take hold.

1 To be fair, they’re not new exactly; the oldest one of which I am aware has been arround since 1998. It’s just that for some reason nobody paid them much attention until more recently.
The main difference is that instead of being site-to-site or client-server (also known as hub-and-spoke), mesh VPNs establish a direct network transit between any pair of devices that want to communicate. Which is great; it means you can send a packet straight to any other machine on your network. You can extend your LAN across any geographical boundaries and (almost) any network conditions, while still remaining secure in the knowledge that your communication is totally encrypted and eavesdropper-proof.

Actually, it’s even better than a LAN, because you can enforce access control rules on packets flowing between any two nodes, rather than just packets that cross a network boundary. This is a Big Deal, because it means that your virtual “LAN” is no longer the soft underbelly of your network security. In the olden days, someone who managed to get a foothold in your network was pretty much at liberty to talk to anyone and anything, because what was your firewall going to do about it? ARP-spoof every client on the network so it can inspect the traffic? Sounds like a fast track to a flaky and congested network to me. With a mesh VPN, on the other hand, since every packet between hosts is passing “through” the VPN, it’s free to enforce whatever access controls your heart desires.

2 You may point out that the only way to do this is to leave it up to the individual nodes to enforce these ACLs, and you’d be right. But that’s not really a problem, either. Yes, two nodes could collaborate to twiddle with their local copy of the ACL and pass traffic that you haven’t permitted. But you know what else two collaborators could do? Send each other emails. Or chat on Discord. Or mail USB sticks across the country. Your firewall isn’t there to prevent communication between two consenting parties, it’s there to prevent communication between one consenting and one unconsenting party.

If you’ve worked with cloud services much you’ll notice this is more or less exactly what “security groups” do, and that’s no accident. The big public clouds have been using software-defined networking since before everybody else, because you kind of have to when you sell virtual servers. You’re already halfway there because if the servers are virtual, then so are their network interfaces, right? And you don’t want to just dump them onto a physical LAN because that’s just asking for any Tom, Dick or Harry with a credit card to come along and sniff your network traffic. So it’s security groups and “virtual private clouds” all the way.

Table Stakes

All of which is to say, in a somewhat meandering way, that we decided pretty early on that we wanted a mesh VPN solution to replace our existing hub-and-spoke architecture. For us the security implications (as discussed above) were the main draw, but a mesh VPN has other advantages over the more classical type. For one thing, it’s a lot easier to scale your VPN up when all the network has to do is route packets, and individual hosts are responsible for the encryption/decryption part. Also, mesh VPNs can have better latency because they’re a lot more flexible with routing - you’re able to take full advantage of the internet’s existing mechanisms for minimizing transit time, instead of having to make detours through a small set of required nodes. Also, NAT holepunching. Technically not required for a mesh VPN, but pretty useless without it, since the majority of internet-connected devices in the world tend to be behind NATs.

3 I haven’t checked this. Don’t quote me on it.

So for us, the boxes that a VPN needed to tick were:

  • Mesh topology
  • NAT holepunching
  • With ACLs
  • User-friendly enough that we could feasibly expect people to install it on their own machines

Interlude: Wireguard

If you’ve been following the state of the art in VPNery for the last few years, then you’ve heard of Wireguard. It first started making serious waves (to my knowledge) in 2018, when Linus Torvalds referred to it as a “work of art” (as compared to OpenVPN and IPSec) on the Linux kernel mailing list. Given Torvalds’ reputation for acerbic comments regarding code quality, the fact that he was referring to someone else’s code as a “work of art” raised a few eyebrows. One thing led to another, eventually Wireguard was adopted into the mainline Linux kernel, and Jason A. Donenfeld became the herald of the new Golden Age of Networking.

Wireguard is relevant to our discussion for being an encrypted tunnel protocol that Works Really Well, which is why at least three of the options I’ve looked at are based on it. I say “based on”, however, because Wireguard is not a mesh VPN on its own. By itself, Wireguard gives you nothing more than an encrypted tunnel between two points. It’s fast and low-latency and (can be) in-kernel so it’s very low-overhead, and the connections are all secured with public/private keypairs like SSH. Also like SSH, however, it gives you exactly zero help when it comes to distributing those keys, and if you’re looking for some form of automatic peer discovery you’re barking up the wrong tree.

The Field

That’s ok, though, because there are a lot mesh VPNs out there that do all those things, some of them built on Wireguard and some not, so let’s talk about them!

ZeroTier

I’m starting with this one because it’s one of the most well-established players (been around since 2011, in fact) and was the one I personally discovered first. ZeroTier is a mesh VPN that provides ACLs and NAT holepunching, like everything that we’re interested in. Unlike any of the others, though, it actually emulates at layer 2 rather than layer 3, meaning that it can have broadcast traffic. This immediately makes it interesting from a user-friendliness standpoint, since how great would it be if your fileshare automatically showed up on your VPN via its built-in mDNS (or whatever) advertisement features?

Another nice feature of Zerotier is that connecting to a network requires a lot less ceremony than some of the other options. Just enter the 16-digit network id, then wait for the network admin to approve your join request. Or, if it’s a public

4 Yes, a “public virtual private network”. No, it doesn’t have to make sense.
ZeroTier network, you get in immediately.

That’s the theory, at least. In practice - well, in practice I haven’t tried it with broadcast traffic. I have, however, tried it to connect my own personal network of devices (desktops, laptop, Raspberry Pi, a server or two, and some cloud VMs). Short story: It didn’t work all that well for me. To be fair, I could usually get some kind of connectivity, but it was very unpredictable in both bandwidth and latency. In a particularly frustrating twist, the two nodes that I had the most trouble connecting were cloud VMs from different providers, which makes no sense because the main thing that kills these sorts of mesh VPNs is NAT, and the VMs all had public IPv4 addresses. This should have been easy!

Anyway, although I no longer use it, I do retain a soft spot in my heart for Zerotier, and it has some characteristics (the aforementioned VLAN properties) that really set it apart from the rest. If I were trying to set up a virtual LAN party with a group of friends to play a local-network-only game, I’d probably try Zerotier first.

Also you can self-host the network controller, although I think you lose the shiny web interface if you do that and have to use the API to configure it.

Nebula

Nebula is one of the newer crop of mesh VPNs that seem to be popping up like weeds lately. It ticks most of our boxes (mesh, ACLs, NAT holepunching) but does so in ways that all seem just ever so slightly sub-optimal (for us, at least). It’s based on the Noise protocol framework

5 Which I understand at only the most basic level. Something something ChaCha Poly1305 elliptic curves?
, on which Wireguard is also based, making them… sibling protocols, I guess?

Nebula was developed by Slack to support their… somewhat interesting architecture,

6 Look, I don’t work at Slack, I’m not terribly familiar with their requirements… but is it really the simplest solution to use hundreds of AWS accounts to manage your resources? At that scale, can’t you just… rent a bunch of bare metal servers and hook them into a big cluster with, like, Nomad and Consul or something? I dunno. Maybe it’s all justified, I’m just not convinced.
and seems like a pretty solid piece of work. It’s completely self-hostable, which I consider a plus, it uses modern cryptography, and it probably works very well for the use case for which it was designed. Unfortunately for our use case, it’s not really designed to be used directly by end-users, e.g. the only way to configure it seems to be through its main config file, and the only way to operate it is through the CLI. Not a problem when all you need to do is hook together a bunch of cloud VMs and the odd dev machine or two, but not great if you want Janice over in HR to be able to talk to the network share.

The other thing I’m not a huge fan of is that as far as I can tell, firewall rules are configured individually on each host. Again, not a problem when you’re spinning up VMs from some kind of master image that has the rules all baked in, but not something I want to repeat 50 times on everybody’s laptop (or worse, walk them through writing YAML over screen-sharing or something.) I’m sure it wouldn’t be too hard to build some kind of automation to work around that, but if we were looking to build our own thing we would have just started with vanilla Wireguard and built up from there.

Innernet

Which leads us to Innernet, which is pretty much just exactly that. The introductory blog post says it better than I can:

In the beginning, we had a shared manually-edited WireGuard config file and many sighs were heard whenever we needed to add a new peer to the network.
In the middle ages, there were bash scripts and a weird Vault backend with questionable-at-best maintainability that got new machines on the network and coordinated things like IP allocation. Many groans could be heard whenever these flimsy scripts broke for any reason. In the end, we decided to sit down, sigh one long and hopefully final time, and write innernet.

So, great! What’s more, it’s self-hosted, built in Rust (with ♥, no doubt) and uses kernel-mode Wireguard (actually I think it uses “whatever Wireguard is available on the host system”, which is kernel-mode if you’re on Linux and not otherwise). Unfortuantely, it’s still a fairly immature project, so it’s lacking things like (again) user-friendliness, which may or may not be a dealbreaker depending on your wants and needs.

Even more unfortunately, it bases its security model around CIDR network segments, just like old-skool corporate networks, which to my mind is a huge step backwards from the more flexible “security group” model that the other candidates use. The critical difference is that a given device has only one “targetable attribute” with which to specify it in your firewal rules. This tends to lead to over-proliferation of access because Device A is in Group Z but needs access to Thing Q, which the rest of Group Z doesn’t really need but you also don’t want to move Device A into its own special group because now you have to duplicate the access rules for Group Z, and then if they change you have to remember to update the new group too, and who wants to deal with that? So you give all of Group Z access to Thing Q, and before you know it you’re back to having a “soft underbelly” of a LAN where an attacker who gets in can talk to virtually anything they want to if they jump through a few hoops.

The Innernet documentation points out that CIDRs can be nested, which is true, so I guess you can have an engineering CIDR and then within that an engineering-managers CIDR that has all the access of engineering plus a few. But what happens when you have a sales CIDR with a sales-manager who needs the managery bits to match engineering-managers, but not the engineering bits, and oh no you’re back to duplicating firewall rules because you’ve locked yourself into an arbitrary limit of one “role” per device?

In theory you could solve this by allowing a single device to have multiple IPs in multiple different CIDRs, but it’s apparently a core principle of Innernet’s design that “Peers always have only one assigned IP address, and that address is permanently associated with them.” So that’s out.

(I’m also less than entirely comfortable with fixed-size address spaces in an environment where they’re not really necessary, because what happens when the /24 you’ve allocated for doodad-watchers needs its 257th member? But that’s an ancillary concern and could probably be managed fairly easily by careful allocation of address blocks.)

In conclusion, I’m conflicted. There’s a lot to like about Innernet, and I’m interested to see where they take it as time goes on, but I find myself disagreeing just a little too much with some of the fundamental design choices. I may still end up trying it out some day, since setting up a new VPN for my personal fleet of network-connected thingies is my idea of a fun weekend, but I doubt I’ll ever use it seriously unless there’s some signficant change in how access control works.

Oh yeah, and there’s no Windows client as yet. Hard to sell switching your whole workforce to Linux just so you can use a cool VPN thingy.

Cloudflare One

Ok, I’m cheating a little bit. Cloudflare One technically isn’t a mesh VPN, because it always routes your traffic through a Cloudflare gateway, rather than establishing direct links between devices and letting them do the communicating. I’m including it here anyway, because the result is pretty comparable to what you get from these mesh VPNs: A logically “flat” network in which any node can communicate with any other node, subject to centrally-administered access control rules. It even gets you most of the latency and throughput advantages you’d get from a true mesh VPN, because Cloudflare’s edge is basically everywhere and its capacity is effectively infinite, as far as the lowly user is concerned.

It’s surprisingly inexpensive, as well, with a free tier for up to 50 users, a $7/user/month tier for intermediate cases, and a “call us for pricing” option if you tend to use scientific notation when you talk about your company’s market cap. We ended up deciding against it anyway, largely because of some anecdotal claims about its user-friendliness being not-so-great, and the fact that… well, Cloudflare already gets their greasy paws

7 He said, on the blog site hosted behind Cloudflare’s CDN.
on something like 15% of internet traffic as it stands, and do we really want to contribute to that?
8 Not that I have anything against Cloudflare, mind. They seem great so far. They just give me the same feeling as 2010-era Google, and look how that turned out.

Also, the one place where you’d feel the lack of true mesh-ness would be LAN communication, which was actually a concern for us. Proper mesh VPNs can detect when two clients are on the same LAN and route their traffic accordingly, so lower latency, higher throughput, yadda yadda. As far as I can tell, Cloudflare’s needs every packet to pass through the Cloudflare edge (aka “the internet”), meaning it turns LAN hops into WAN hops. Probably not a big deal for their customers, since this product is pretty clearly targeting Proper Enterprise types, and they undoubtedly have built-up layers of LAN cruft that you couldn’t dig your way out of with a backhoe and so wouldn’t be using it within their LAN anyway. A slightly bigger deal for us, since “route even LAN traffic through the VPN so we can enforce ACLs” was one of our stated goals.

Netmaker

Netmaker is a newcomer to this space; the first commit in their Github repo is from March of 2021. It looks to be quite functional, though, with the whole nine yards - full mesh, NAT holepunching, ACLs, and traffic relays for those stubborn NATs that just can’t be punched. Pretty impessive for a year and change, which is probably why they got funded by YCombinator.

It’s fully self-hostable, with some fancy options for HA cluster-type setups if you want to do that. (The Netmaker docs also introduced me to rqlite, which looks like quite an interesting project.) We probably came closer to settling on this one than any others in this list (other than the one we did settle on), and I’d still really like to play with it at some point.

It seems to use kernel-mode Wireguard, which is a big plus in my book. Presumably that’s platform-dependent, e.g. I don’t think MacOS and maybe Windows have kernel-mode Wireguard yet, but presumably it will be easy to slot in once it does arrive on a given platform.

My one gripe is with the way it does ACLs. It looks like the ACL configuration is just a simple yes/no to every distinct pair of peers in your network, the question being “can these two peers communicate dircectly?” No mention of ports, either source

9 To be fair, the concept of the “source port” is largely irrelevant when dealing with software-defined networking. In my experience you tend think about flows more than individual packets (ZeroTier being the exception), so the source port is just whatever ephemeral port gets assigned to the connection.
or destination. Also no mention of groups/roles/tags/etc, which means that the number of buttons to click is going to scale with the square of your network size. Not my idea of fun. On the other hand, ACLs are a very new feature (just added in the last release), so maybe they will improve over time.

Regardless, Netmaker looks like an extremely interesting project and I’d very much like to try it out at some point.

Tailscale

Obviously, this is the one we settled on. The Cadillac of the bunch. Although not the oldest, I’d probably call Tailscale the most well-established of the candidates in this list. It didn’t take them very long (I think they started in 2018 or 2019?) because their product is just really damn good. It slices, it dices, it meshes, it firewalls, and it even twiddles with your DNS settings so that you can type ping homepi and homepi will resolve to the Tailscale-internal IP of the raspberry pi that’s hanging out with the dust bunnies next to your cable modem.

So why did we like it? Well, for one I had been using it for about a year and a half to connect my personal devices, so I knew it would get the job done. That’s not the only reason, though. A few of the others:

User-friendliness: Installing Tailscale is basically just downloading the app and logging in. There’s pratcically nothing to it. After that it just hums along quietly in the background, and your things are magically connected to your other things whenver you want them to be. This is what networking should feel like. Too bad script kiddies with DDoS botnets have ruined it all for us over the last 20 years.

The Best NAT holepunching: I don’t think I’m exaggerating here. As they explain, Tailscale goes a lot further than “try sending packets both ways and give up if it doesn’t work.” Among the various tricks it pulls is sending a whole bunch of packets and hoping the birthday paradox kicks in and one of them gets through, which I think is pretty clever.

Magic DNS: To be fair, I haven’t looked super deeply into what all of the competitors do for this, but it’s a pretty big quality-of-life feature. Admittedly Tailscale IPs are stable (as long as you don’t clear the device’s local state), so you could just stick a public DNS record somewhere that points devicename.yourdomain.net to a Tailscale IP. You could even automate it, if you really felt like it. Still, not having to do that is worth something, especially given how much of a pain it is to manage split-horizon DNS

10 Which is why this is the Achilles heel of Magic DNS. Immediately upon starting to set up Tailscale we spent an entire morning trying to debug why DNS queries for single-label names on Windows were taking 2+ seconds to resolve. However, since Magic DNS is still officially in beta, I’ll give it a pass on that for the time being.
(it’s even worse on other platforms, from what I hear.)

Looking back over these I realize that I might be slightly underselling it: it’s hard to overemphasize how well Tailscale just works. You kind of have to use it to appreciate it - Tailscale discussions are chock-full of people saying variations on “I never understood why everyone was so crazy about it, I mean it’s just a mesh VPN right? There’s a bunch of those. But then I tried it and OMG THIS IS THE BEST THING EVER TELL EVERYONE!!!” The attention paid to the little details at every level is just phenomenal. If Apple (old Apple, under Steve Jobs) had decided to go after networking rather than laptops and phones, they might have come up with something like Tailscale.

Of course, it’s not perfect. What ever is? I have a few (minor) nitpicks:

Cost: This is probably the one that comes up the most. Tailscale plans start at $5/user/month (except for the free tier, which is only suitable for a single user) and go up from there. Any reasonably-complex network will need the $15/user/month plan, which is (I think) more than any other VPN on this list. You get what you pay for, of course, but that doesn’t change the fact that you do pay for it. Absolutely worth it, in my opinion, but it does make it a harder sell to a lot of people.

Usermode Wireguard: Obviously this currently only applies to Linux (and maybe BSD?) as far as I’m aware. Still, it would be nice if Tailscale could make use of kernel-mode Wireguard where available, since otherwise you’re leaving throughput on the table. For example, between two fairly beefy machines I get about 680 Mb/s throughput when testing with iPerf. Between one beefy machine and one Synology NAS with a wimpy CPU, I get about 300. Obviously the extent to which this matters depends on what you’re trying to do, and it’s more than fast enough for most use cases. It just bugs me that it could be better.

Data Sovereignty: (Network sovereignty?) Different people will weight this one differently, but at the end of the day it’s true that Tailscale runs a coordination server that is responsible for telling your network who’s in it and what kind of access they get. If they decide to add an invisible node that can talk to any of your devices on any port, there’s not really anything you can do about it.

11 Note that this still doesn’t mean they can eavsedrop on network traffic between two nodes you do control. Even if you can’t make NAT traversal work and end up using a relay, the actual network flows are encrypted with Wireguard. Effectively, each packet is encrypted with its destination’s public key. And since private keys are generated on the client, the control server has no ability to decrypt them.
It’s not quite as much control over your infrastructure as a third-party SSO service gets, but it’s up there. Oh, and I don’t think it’s officially mentioned on their site, but I’ve seen comments from Tailscale employees that they can do an on-premise control server for big enough enterprise installs.

Headscale

No discussion of Tailscale would be complete without mentioning Headscale, a community-driven re-implementation of the Tailscale control plane. You can point the official Tailscale clients at it, although they may require a bit of hackery to work properly. And the Tailscale people have said that although it’s not officially supported, they are personally in favor of its existence, which I take to mean that they probably won’t intentionally break its functionality with an update within the immediate future.

It solves the cost issue of Tailscale, although it introduces the cost of having to maintain it yourself, which may or may not be something you’d worry about. It does introduce a UX penalty, and I doubt that’s going to change any time soon - the Tailscale people don’t seem to mind its existence, but I can’t see them going very far out of their way to make it easier for something that exists specifically so that people can avoid paying for their service. Still, if you really really want Tailscale, but you simply can’t justify the cost, or you’re especially paranoid about the control plane, it’s worth a shot.

The Rest of the Iceberg

The above options are what I’ve researched in depth, but they’re far from the only mesh VPN solutions out there. I’ve come across others, but didn’t look into them closely for one reason or another - they were either missing some critical component of what we needed, or I didn’t discover them until too late, or I just got a weird feeling from them for whatever reason. Still, I’ll mention them here in case they happen to be what anybody else is looking for:

Tinc

Tinc is the OG. It’s been around since 1998 and still has a community of dedicated users to this day. It does full-mesh, NAT traversal, and even (aparently) some LAN stuff, like ZeroTier.

12 I don’t get the impression it fully emulates Layer 2 the way ZeroTier does, rather it just has the ability to “bridge” LANs together, which I assume just means “forward broadcast traffic over the tunnel.” Probably works ok for small LANs, but I’d hate to see how it scales.

It doesn’t do ACLs, as far as I am aware, which made it a non-starter for us, so that’s why it’s down here rather than up in the previous section. Moreover, I can’t help wondering - if Tinc has been doing this so long, why is it still so niche? Mesh VPNs are obviously great, so why hasn’t Tinc eaten the world?

One possibility (borne out by a few anecdotes that I’ve seen online) is that Tinc just doesn’t perform very well. And I don’t just mean in terms of raw bandwidth

13 Although its bandwidth doesn’t seem to be great, from the few benchmarks I’ve seen.
, I mean everything. How often does NAT traversal fail? How long does it take state changes to propagate through the network? How often does it randomly disconnect without saying anything?

From a brief glance at its documentation it also seems that it might be a bit of a pain to manage. E.g. the documentation recommends manually distributing configuration by sending config files back and forth, which doesn’t sound terribly pleasant.

PeerVPN

I don’t really know too much about this one, it just popped up when I was Googling around. It looks like it has the basics, i.e. peer discovery and NAT traversal, and probably not any kind of access control, but the site is extremely minimal so I can’t get much of a read on it.

FreeLAN

Much like the above, just something that showed up while I was looking around. It looks to be a bigger project than PeerVPN, or at least the website is a little more fleshed out. I honestly can’t quite parse out all of its features - I don’t think it does NAT traversal? I can’t quite tell for sure, though. The documentation is a little light. Although it does mention that it uses X.509 certificates, which is an instant turnoff for me because messing with X.509 is a pain.

VPNCloud

VPNCloud is a little more fully-featured, like the bigger players I’ve mentioned. It doesn’t seem to do access control, so it’s not a true contender for our use-case, but it does look like it works fairly well for what it does do. Their site claims that they’ve gotten multiple gigabits of throughput between m5.large AWS instances (so, not terribly beefy) which is better than pretty much anything else I’ve seen other than vanilla Wireguard.

Netbird

The first time I ran across this one, it was called “Wiretrustee”. A change for the better, I think. It looks to be pretty much exactly “open-source Tailscale”, so my guess is it will entirely live or die by how well it executes on that. Obviously Tailscale is great, and Headscale proves that there are people who would like to run the control plane themselves, so there’s a market for them. Unfortunately it looks like their monetization scheme is “be Tailscale” (i.e. run a hosted version and charge for anything over a single user), at which point why wouldn’t you just use Tailscale?

And More

There’s a handy list on Github of Wireguard mesh things, some of which I’ve already mentioned. And I’m sure even more will continue to pop up like weeds, since everybody seems to want one and a surprisingly large number of people are happy to just sit down and write their own. I guess that’s proof that Wireguard made good choices about what problems to address and what to ignore - not an easy task, especially the latter.

Where Do We Go From Here

It’s an exciting time in the world of networking. The Tailscale people talk a lot about this on their blog, because of course they do, but the advent of high-performance, low-overhead VPNery has opened up some pretty interesting possibilities in the world of how we interact with computers. Most excitingly it promises something of a return to the Good Old LAN Days, where every device on the network was trusted by default and no one ever worried about things like authentication and encryption, because why would anyone want to do anything unpleasant to your computer? The Internet made that position untenable, but Tailscale and its ilk hope to bring it back again, With some added benefits from modern cryptography. I can’t say whether they’ll succeed, but if nothing else it’s looking like a fun ride.