The Hitchiker's Guide to Mesh VPNs
Thursday, the seventeenth of March, A.D. 2022
R ecently at work we’ve been moving to a new VPN, and naturally as part of that process we done a bunch of research into the available options before settling on one. Mostly I want to document that for my own future reference, so that if this question comes up again I don’t have to go redo it all, but if it ends up being helpful to someone else someday then that’s great too. (If I ever get this blog site launched, that is. Currently it’s not looking too good.)
TL;DR: We ended up going with Tailscale, because it looked the most user-friendly, had the security features we wanted, and was something I had already used personally so it was more of a known quantity than some of the others.
A Brief History of VPNing
There are a lot of different VPN softwares out there. Traditionally there were two main types: site-to-site and client-server. Site-to-site VPNs were for connecting geographically separated LANs into one big super-LAN, useful if you had one company with two offices in different cities or something. Client-server VPNs were for hooking individual users outside the office into your corporate network so that they could access the fileshare, locally-hosted whatevers, and so on. Maybe you could even enforce traffic filtering policies by forcing all of their traffic to go through the VPN first, where it could be inspected and potentially blocked if it were determined to be non-kosher. Seems a bit control-freaky to me, but maybe if I were responsible for the network administration of thousands of users I’d feel differently.
More recently, things have started to change in the VPN world. A new power is rising; its victory is at hand. This night, the land will be stained with the blood of IPSec. Erm. Ahem. The new breed is “mesh VPNs,” and they’re really starting to take hold.
Actually, it’s even better than a LAN, because you can enforce access control rules on packets flowing between any two nodes, rather than just packets that cross a network boundary. This is a Big Deal, because it means that your virtual “LAN” is no longer the soft underbelly of your network security. In the olden days, someone who managed to get a foothold in your network was pretty much at liberty to talk to anyone and anything, because what was your firewall going to do about it? ARP-spoof every client on the network so it can inspect the traffic? Sounds like a fast track to a flaky and congested network to me. With a mesh VPN, on the other hand, since every packet between hosts is passing “through” the VPN, it’s free to enforce whatever access controls your heart desires.
If you’ve worked with cloud services much you’ll notice this is more or less exactly what “security groups” do, and that’s no accident. The big public clouds have been using software-defined networking since before everybody else, because you kind of have to when you sell virtual servers. You’re already halfway there because if the servers are virtual, then so are their network interfaces, right? And you don’t want to just dump them onto a physical LAN because that’s just asking for any Tom, Dick or Harry with a credit card to come along and sniff your network traffic. So it’s security groups and “virtual private clouds” all the way.
Table Stakes
All of which is to say, in a somewhat meandering way, that we decided pretty early on that we wanted a mesh VPN solution to replace our existing hub-and-spoke architecture. For us the security implications (as discussed above) were the main draw, but a mesh VPN has other advantages over the more classical type. For one thing, it’s a lot easier to scale your VPN up when all the network has to do is route packets, and individual hosts are responsible for the encryption/decryption part. Also, mesh VPNs can have better latency because they’re a lot more flexible with routing - you’re able to take full advantage of the internet’s existing mechanisms for minimizing transit time, instead of having to make detours through a small set of required nodes. Also, NAT holepunching. Technically not required for a mesh VPN, but pretty useless without it, since the majority of internet-connected devices in the world tend to be behind NATs.
So for us, the boxes that a VPN needed to tick were:
- Mesh topology
- NAT holepunching
- With ACLs
- User-friendly enough that we could feasibly expect people to install it on their own machines
Interlude: Wireguard
If you’ve been following the state of the art in VPNery for the last few years, then you’ve heard of Wireguard. It first started making serious waves (to my knowledge) in 2018, when Linus Torvalds referred to it as a “work of art” (as compared to OpenVPN and IPSec) on the Linux kernel mailing list. Given Torvalds’ reputation for acerbic comments regarding code quality, the fact that he was referring to someone else’s code as a “work of art” raised a few eyebrows. One thing led to another, eventually Wireguard was adopted into the mainline Linux kernel, and Jason A. Donenfeld became the herald of the new Golden Age of Networking.
Wireguard is relevant to our discussion for being an encrypted tunnel protocol that Works Really Well, which is why at least three of the options I’ve looked at are based on it. I say “based on”, however, because Wireguard is not a mesh VPN on its own. By itself, Wireguard gives you nothing more than an encrypted tunnel between two points. It’s fast and low-latency and (can be) in-kernel so it’s very low-overhead, and the connections are all secured with public/private keypairs like SSH. Also like SSH, however, it gives you exactly zero help when it comes to distributing those keys, and if you’re looking for some form of automatic peer discovery you’re barking up the wrong tree.
The Field
That’s ok, though, because there are a lot mesh VPNs out there that do all those things, some of them built on Wireguard and some not, so let’s talk about them!
ZeroTier
I’m starting with this one because it’s one of the most well-established players (been around since 2011, in fact) and was the one I personally discovered first. ZeroTier is a mesh VPN that provides ACLs and NAT holepunching, like everything that we’re interested in. Unlike any of the others, though, it actually emulates at layer 2 rather than layer 3, meaning that it can have broadcast traffic. This immediately makes it interesting from a user-friendliness standpoint, since how great would it be if your fileshare automatically showed up on your VPN via its built-in mDNS (or whatever) advertisement features?
Another nice feature of Zerotier is that connecting to a network requires a lot less ceremony than some of the other options. Just enter the 16-digit network id, then wait for the network admin to approve your join request. Or, if it’s a public
That’s the theory, at least. In practice - well, in practice I haven’t tried it with broadcast traffic. I have, however, tried it to connect my own personal network of devices (desktops, laptop, Raspberry Pi, a server or two, and some cloud VMs). Short story: It didn’t work all that well for me. To be fair, I could usually get some kind of connectivity, but it was very unpredictable in both bandwidth and latency. In a particularly frustrating twist, the two nodes that I had the most trouble connecting were cloud VMs from different providers, which makes no sense because the main thing that kills these sorts of mesh VPNs is NAT, and the VMs all had public IPv4 addresses. This should have been easy!
Anyway, although I no longer use it, I do retain a soft spot in my heart for Zerotier, and it has some characteristics (the aforementioned VLAN properties) that really set it apart from the rest. If I were trying to set up a virtual LAN party with a group of friends to play a local-network-only game, I’d probably try Zerotier first.
Also you can self-host the network controller, although I think you lose the shiny web interface if you do that and have to use the API to configure it.
Nebula
Nebula is one of the newer crop of mesh VPNs that seem to be popping up like weeds lately. It ticks most of our boxes (mesh, ACLs, NAT holepunching) but does so in ways that all seem just ever so slightly sub-optimal (for us, at least). It’s based on the Noise protocol framework
Nebula was developed by Slack to support their… somewhat interesting architecture,
The other thing I’m not a huge fan of is that as far as I can tell, firewall rules are configured individually on each host. Again, not a problem when you’re spinning up VMs from some kind of master image that has the rules all baked in, but not something I want to repeat 50 times on everybody’s laptop (or worse, walk them through writing YAML over screen-sharing or something.) I’m sure it wouldn’t be too hard to build some kind of automation to work around that, but if we were looking to build our own thing we would have just started with vanilla Wireguard and built up from there.
Innernet
Which leads us to Innernet, which is pretty much just exactly that. The introductory blog post says it better than I can:
In the beginning, we had a shared manually-edited WireGuard config file and many sighs were heard whenever we needed to add a new peer to the network.
In the middle ages, there were bash scripts and a weird Vault backend with questionable-at-best maintainability that got new machines on the network and coordinated things like IP allocation. Many groans could be heard whenever these flimsy scripts broke for any reason. In the end, we decided to sit down, sigh one long and hopefully final time, and writeinnernet
.
So, great! What’s more, it’s self-hosted, built in Rust (with ♥, no doubt) and uses kernel-mode Wireguard (actually I think it uses “whatever Wireguard is available on the host system”, which is kernel-mode if you’re on Linux and not otherwise). Unfortuantely, it’s still a fairly immature project, so it’s lacking things like (again) user-friendliness, which may or may not be a dealbreaker depending on your wants and needs.
Even more unfortunately, it bases its security model around CIDR network segments, just like old-skool corporate networks, which to my mind is a huge step backwards from the more flexible “security group” model that the other candidates use. The critical difference is that a given device has only one “targetable attribute” with which to specify it in your firewal rules. This tends to lead to over-proliferation of access because Device A is in Group Z but needs access to Thing Q, which the rest of Group Z doesn’t really need but you also don’t want to move Device A into its own special group because now you have to duplicate the access rules for Group Z, and then if they change you have to remember to update the new group too, and who wants to deal with that? So you give all of Group Z access to Thing Q, and before you know it you’re back to having a “soft underbelly” of a LAN where an attacker who gets in can talk to virtually anything they want to if they jump through a few hoops.
The Innernet documentation points out that CIDRs can be nested, which is true, so I guess you can have an engineering
CIDR and then within that an engineering-managers
CIDR that has all the access of engineering
plus a few. But what happens when you have a sales
CIDR with a sales-manager
who needs the managery bits to match engineering-managers
, but not the engineering bits, and oh no you’re back to duplicating firewall rules because you’ve locked yourself into an arbitrary limit of one “role” per device?
In theory you could solve this by allowing a single device to have multiple IPs in multiple different CIDRs, but it’s apparently a core principle of Innernet’s design that “Peers always have only one assigned IP address, and that address is permanently associated with them.” So that’s out.
(I’m also less than entirely comfortable with fixed-size address spaces in an environment where they’re not really necessary, because what happens when the /24 you’ve allocated for doodad-watchers
needs its 257th member? But that’s an ancillary concern and could probably be managed fairly easily by careful allocation of address blocks.)
In conclusion, I’m conflicted. There’s a lot to like about Innernet, and I’m interested to see where they take it as time goes on, but I find myself disagreeing just a little too much with some of the fundamental design choices. I may still end up trying it out some day, since setting up a new VPN for my personal fleet of network-connected thingies is my idea of a fun weekend, but I doubt I’ll ever use it seriously unless there’s some signficant change in how access control works.
Oh yeah, and there’s no Windows client as yet. Hard to sell switching your whole workforce to Linux just so you can use a cool VPN thingy.
Cloudflare One
Ok, I’m cheating a little bit. Cloudflare One technically isn’t a mesh VPN, because it always routes your traffic through a Cloudflare gateway, rather than establishing direct links between devices and letting them do the communicating. I’m including it here anyway, because the result is pretty comparable to what you get from these mesh VPNs: A logically “flat” network in which any node can communicate with any other node, subject to centrally-administered access control rules. It even gets you most of the latency and throughput advantages you’d get from a true mesh VPN, because Cloudflare’s edge is basically everywhere and its capacity is effectively infinite, as far as the lowly user is concerned.
It’s surprisingly inexpensive, as well, with a free tier for up to 50 users, a $7/user/month tier for intermediate cases, and a “call us for pricing” option if you tend to use scientific notation when you talk about your company’s market cap. We ended up deciding against it anyway, largely because of some anecdotal claims about its user-friendliness being not-so-great, and the fact that… well, Cloudflare already gets their greasy paws
Also, the one place where you’d feel the lack of true mesh-ness would be LAN communication, which was actually a concern for us. Proper mesh VPNs can detect when two clients are on the same LAN and route their traffic accordingly, so lower latency, higher throughput, yadda yadda. As far as I can tell, Cloudflare’s needs every packet to pass through the Cloudflare edge (aka “the internet”), meaning it turns LAN hops into WAN hops. Probably not a big deal for their customers, since this product is pretty clearly targeting Proper Enterprise types, and they undoubtedly have built-up layers of LAN cruft that you couldn’t dig your way out of with a backhoe and so wouldn’t be using it within their LAN anyway. A slightly bigger deal for us, since “route even LAN traffic through the VPN so we can enforce ACLs” was one of our stated goals.
Netmaker
Netmaker is a newcomer to this space; the first commit in their Github repo is from March of 2021. It looks to be quite functional, though, with the whole nine yards - full mesh, NAT holepunching, ACLs, and traffic relays for those stubborn NATs that just can’t be punched. Pretty impessive for a year and change, which is probably why they got funded by YCombinator.
It’s fully self-hostable, with some fancy options for HA cluster-type setups if you want to do that. (The Netmaker docs also introduced me to rqlite, which looks like quite an interesting project.) We probably came closer to settling on this one than any others in this list (other than the one we did settle on), and I’d still really like to play with it at some point.
It seems to use kernel-mode Wireguard, which is a big plus in my book. Presumably that’s platform-dependent, e.g. I don’t think MacOS and maybe Windows have kernel-mode Wireguard yet, but presumably it will be easy to slot in once it does arrive on a given platform.
My one gripe is with the way it does ACLs. It looks like the ACL configuration is just a simple yes/no to every distinct pair of peers in your network, the question being “can these two peers communicate dircectly?” No mention of ports, either source
Regardless, Netmaker looks like an extremely interesting project and I’d very much like to try it out at some point.
Tailscale
Obviously, this is the one we settled on. The Cadillac of the bunch. Although not the oldest, I’d probably call Tailscale the most well-established of the candidates in this list. It didn’t take them very long (I think they started in 2018 or 2019?) because their product is just really damn good. It slices, it dices, it meshes, it firewalls, and it even twiddles with your DNS settings so that you can type ping homepi
and homepi
will resolve to the Tailscale-internal IP of the raspberry pi that’s hanging out with the dust bunnies next to your cable modem.
So why did we like it? Well, for one I had been using it for about a year and a half to connect my personal devices, so I knew it would get the job done. That’s not the only reason, though. A few of the others:
User-friendliness: Installing Tailscale is basically just downloading the app and logging in. There’s pratcically nothing to it. After that it just hums along quietly in the background, and your things are magically connected to your other things whenver you want them to be. This is what networking should feel like. Too bad script kiddies with DDoS botnets have ruined it all for us over the last 20 years.
The Best NAT holepunching: I don’t think I’m exaggerating here. As they explain, Tailscale goes a lot further than “try sending packets both ways and give up if it doesn’t work.” Among the various tricks it pulls is sending a whole bunch of packets and hoping the birthday paradox kicks in and one of them gets through, which I think is pretty clever.
Magic DNS: To be fair, I haven’t looked super deeply into what all of the competitors do for this, but it’s a pretty big quality-of-life feature. Admittedly Tailscale IPs are stable (as long as you don’t clear the device’s local state), so you could just stick a public DNS record somewhere that points devicename.yourdomain.net
to a Tailscale IP. You could even automate it, if you really felt like it. Still, not having to do that is worth something, especially given how much of a pain it is to manage split-horizon DNS
Looking back over these I realize that I might be slightly underselling it: it’s hard to overemphasize how well Tailscale just works. You kind of have to use it to appreciate it - Tailscale discussions are chock-full of people saying variations on “I never understood why everyone was so crazy about it, I mean it’s just a mesh VPN right? There’s a bunch of those. But then I tried it and OMG THIS IS THE BEST THING EVER TELL EVERYONE!!!” The attention paid to the little details at every level is just phenomenal. If Apple (old Apple, under Steve Jobs) had decided to go after networking rather than laptops and phones, they might have come up with something like Tailscale.
Of course, it’s not perfect. What ever is? I have a few (minor) nitpicks:
Cost: This is probably the one that comes up the most. Tailscale plans start at $5/user/month (except for the free tier, which is only suitable for a single user) and go up from there. Any reasonably-complex network will need the $15/user/month plan, which is (I think) more than any other VPN on this list. You get what you pay for, of course, but that doesn’t change the fact that you do pay for it. Absolutely worth it, in my opinion, but it does make it a harder sell to a lot of people.
Usermode Wireguard: Obviously this currently only applies to Linux (and maybe BSD?) as far as I’m aware. Still, it would be nice if Tailscale could make use of kernel-mode Wireguard where available, since otherwise you’re leaving throughput on the table. For example, between two fairly beefy machines I get about 680 Mb/s throughput when testing with iPerf. Between one beefy machine and one Synology NAS with a wimpy CPU, I get about 300. Obviously the extent to which this matters depends on what you’re trying to do, and it’s more than fast enough for most use cases. It just bugs me that it could be better.
Data Sovereignty: (Network sovereignty?) Different people will weight this one differently, but at the end of the day it’s true that Tailscale runs a coordination server that is responsible for telling your network who’s in it and what kind of access they get. If they decide to add an invisible node that can talk to any of your devices on any port, there’s not really anything you can do about it.
Headscale
No discussion of Tailscale would be complete without mentioning Headscale, a community-driven re-implementation of the Tailscale control plane. You can point the official Tailscale clients at it, although they may require a bit of hackery to work properly. And the Tailscale people have said that although it’s not officially supported, they are personally in favor of its existence, which I take to mean that they probably won’t intentionally break its functionality with an update within the immediate future.
It solves the cost issue of Tailscale, although it introduces the cost of having to maintain it yourself, which may or may not be something you’d worry about. It does introduce a UX penalty, and I doubt that’s going to change any time soon - the Tailscale people don’t seem to mind its existence, but I can’t see them going very far out of their way to make it easier for something that exists specifically so that people can avoid paying for their service. Still, if you really really want Tailscale, but you simply can’t justify the cost, or you’re especially paranoid about the control plane, it’s worth a shot.
The Rest of the Iceberg
The above options are what I’ve researched in depth, but they’re far from the only mesh VPN solutions out there. I’ve come across others, but didn’t look into them closely for one reason or another - they were either missing some critical component of what we needed, or I didn’t discover them until too late, or I just got a weird feeling from them for whatever reason. Still, I’ll mention them here in case they happen to be what anybody else is looking for:
Tinc
Tinc is the OG. It’s been around since 1998 and still has a community of dedicated users to this day. It does full-mesh, NAT traversal, and even (aparently) some LAN stuff, like ZeroTier.
It doesn’t do ACLs, as far as I am aware, which made it a non-starter for us, so that’s why it’s down here rather than up in the previous section. Moreover, I can’t help wondering - if Tinc has been doing this so long, why is it still so niche? Mesh VPNs are obviously great, so why hasn’t Tinc eaten the world?
One possibility (borne out by a few anecdotes that I’ve seen online) is that Tinc just doesn’t perform very well. And I don’t just mean in terms of raw bandwidth
From a brief glance at its documentation it also seems that it might be a bit of a pain to manage. E.g. the documentation recommends manually distributing configuration by sending config files back and forth, which doesn’t sound terribly pleasant.
PeerVPN
I don’t really know too much about this one, it just popped up when I was Googling around. It looks like it has the basics, i.e. peer discovery and NAT traversal, and probably not any kind of access control, but the site is extremely minimal so I can’t get much of a read on it.
FreeLAN
Much like the above, just something that showed up while I was looking around. It looks to be a bigger project than PeerVPN, or at least the website is a little more fleshed out. I honestly can’t quite parse out all of its features - I don’t think it does NAT traversal? I can’t quite tell for sure, though. The documentation is a little light. Although it does mention that it uses X.509 certificates, which is an instant turnoff for me because messing with X.509 is a pain.
VPNCloud
VPNCloud is a little more fully-featured, like the bigger players I’ve mentioned. It doesn’t seem to do access control, so it’s not a true contender for our use-case, but it does look like it works fairly well for what it does do. Their site claims that they’ve gotten multiple gigabits of throughput between m5.large AWS instances (so, not terribly beefy) which is better than pretty much anything else I’ve seen other than vanilla Wireguard.
Netbird
The first time I ran across this one, it was called “Wiretrustee”. A change for the better, I think. It looks to be pretty much exactly “open-source Tailscale”, so my guess is it will entirely live or die by how well it executes on that. Obviously Tailscale is great, and Headscale proves that there are people who would like to run the control plane themselves, so there’s a market for them. Unfortunately it looks like their monetization scheme is “be Tailscale” (i.e. run a hosted version and charge for anything over a single user), at which point why wouldn’t you just use Tailscale?
And More
There’s a handy list on Github of Wireguard mesh things, some of which I’ve already mentioned. And I’m sure even more will continue to pop up like weeds, since everybody seems to want one and a surprisingly large number of people are happy to just sit down and write their own. I guess that’s proof that Wireguard made good choices about what problems to address and what to ignore - not an easy task, especially the latter.
Where Do We Go From Here
It’s an exciting time in the world of networking. The Tailscale people talk a lot about this on their blog, because of course they do, but the advent of high-performance, low-overhead VPNery has opened up some pretty interesting possibilities in the world of how we interact with computers. Most excitingly it promises something of a return to the Good Old LAN Days, where every device on the network was trusted by default and no one ever worried about things like authentication and encryption, because why would anyone want to do anything unpleasant to your computer? The Internet made that position untenable, but Tailscale and its ilk hope to bring it back again, With some added benefits from modern cryptography. I can’t say whether they’ll succeed, but if nothing else it’s looking like a fun ride.