FOSDEM and the unrealistic IPv6-only network

Most of you know FOSDEM already, for those who don’t, it’s the largest Free and Open Source Software focused conference in Europe (if not the world.) If you haven’t been to it I definitely suggest it, particularly because it’s a free admission conference and it always has something interesting to discuss.

Even though there is no ticket and no badge, the conference does have free WiFi Internet access, which is how the number of attendees is usually estimated. In the past few years, their network has also been pushing the envelope on IPv6 support, first providing a dualstack network when IPv6 was fairly rare, and in the recent (three?) years providing an IPv6-only network as the default.

I can see the reason to do this, in the sense that a lot of Free Software developers are physically at the conference, which means they can see their tools suffer in an IPv6 environment and fix them. But at the same time, this has generated lots of complaints about Android not working in this setup. While part of that noise was useful, I got the impression this year that the complaints are repeated only for the sake of complaining.

Full disclosure, of course: I do happen to work for the company behind Android. On the other hand, I don’t work on anything related at all. So this post is as usual my own personal opinion.

The complaints about Android started off quite healthy: devices couldn’t actually connect to an IPv6 dual-stack network, and then they couldn’t connect to a IPv6-only network. Both are valid complaints to begin with, though there is a bit more to it. This year in particular the complaints were not so healthy because current versions of Android (6.0) actually do support IPv6-only networks, though most of the Android devices out there are not running this version, either because they have too old hardware or because the manufacturer has not released a new build yet.

What does tick me though has really nothing to do with Android, but rather with the idea that people have that the current IPv6-only setup used by FOSDEM is a realistic approach to IPv6 networking — it really is not. It is a nice setup to test things out and stress the need for proper support for IPv6 in tools, but it’s very unlikely to be used in production by anybody as is.

The technique used (at least this year) by FOSDEM is NAT64. To oversimplify how this works, it is designed to modify the DNS replies when resolving hostnames so that they always provide an IPv6 address, even though they would only have A records (IPv4 addresses). The IPv6 addresses used would then map back to IPv4, and the edge router would then “translate” between the two connections.

Unlike classic NAT, this technique requires user-space components, as the kernel uses separate stacks for IPv4 and IPv6 which do not allow direct message passing between the two. This makes it complicated and significantly slower (you have to copy the data from kernel to userspace and back all the time), unless you use one of the hardware router that are designed to deal with this (I know both Juniper and Cisco have those.)

NAT64 is a very useful testbed, if your target is figuring out what in your stack is not ready for IPv6. It is not, though, a realistic approach for consumer networks. If your client application does not have IPv6 support, it’ll just fail to connect. If for whatever reason you rely on IPv4 literals, they won’t work. Even worse, if the code allows a connection to be established over IPv6, but relies on IPv4 semantics for things like logging, or (worse) access control, then you now have bugs, crashes or worse, vulnerabilities.

And while fuzzing and stress-testing are great for development environments, they are not good for final users. In the same way -Werror is a great tool to fix your code, but uselessly disrupts your users.

In a similar fashion, while IPv6-only datacenters are not that uncommon – Facebook (the company) talked about them two years ago already – they serve a definite different purpose from a customer network. You don’t want, after all, your database cluster to connect to random external services that you don’t control — and if you do control the services, you just need to make sure they are all available over IPv6. In such a system, having a single stack to worry about simplifies, rather than complicate, things. I do something similar for the server I divide into containers: some of them, that are only backends, get no IPv4 at all, not even in NAT. If they ever have to go fetch something to build on the Internet at large, they go through a proxy instead.

I’m not saying that FOSDEM setting up such a network is not useful. It actually hugely is, as it clearly highlights the problems of applications not supporting IPv6 properly. And for Free Software developers setting up a network like this might indeed be too expensive in time or money, so it is a chance to try things out and iron out bugs. But at the same time it does not reflect a realistic environment. Which is why adding more and more rant on the tracking Android bug (which I’m not even going to link here) is not going to be useful — the limitation was known for a while and has been addressed on newer versions, but it would be useless to try backporting it.

For what it’s worth, what is more likely to happen as IPv6 adoption needs to happen, is that providers will move towards solutions like DS-Lite (nothing to do with Nintendo), which couples native IPv6 with carrier-grade NAT. While this has limitations, depending on the size of the ISP pools, it is still easier to set up than NAT64, and is essentially transparent for customers if their systems don’t support IPv6 at all. My ISP here in Ireland (Virgin Media) already has such a setup.

Predictable persistently (non-)mnemonic names

This is part two of a series of articles looking into the new udev “predictable” names. Part one is here and talks about the path-based names.

As Steve also asked on the comments from last post, isn’t it possible to just use the MAC address of an interface to point at it? Sure it’s possible! You just need to enable the mac-based name generator. But what does that mean? It means that your new interface names will be enx0026b9d7bf1f and wlx0023148f1cc8 — do you see yourself typing them?

Myself, I’m not going to type them. My favourite suggestion to solve the issue is to rely on rules similar to the previous persistent naming, but not re-using the eth prefix to avoid collisions (which will no longer be resolved by future versions of udev). I instead use the names wan0 and lan0 (and so on), when the two interfaces sit straddling between a private and a public network. How do I achieve that? Simple:

SUBSYSTEM=="net", ACTION=="add", ATTR{address}=="00:17:31:c6:4a:ca", NAME="lan0"
SUBSYSTEM=="net", ACTION=="add", ATTR{address}=="00:07:e9:12:07:36", NAME="wan0"

Yes these simple rules are doing all the work you need if you just want to make sure not to mix the two interfaces by mistake. If your server or vserver only has one interface, and you want to have it as wan0 no matter what its mac address is (easier to clone, for instance), then you can go for

SUBSYSTEM=="net", ACTION=="add", ATTR{address}=="*", NAME="wan0"

As long as you only have a single network interface, that will work just fine. For those who use Puppet, I also published a module that you can use to create the file, and ensure that the other methods to achieve “sticky” names are not present.

My reasoning to actually using this kind of names is relatively simple: the rare places where I do need to specify the interface name are usually in ACLs, the firewall, and so on. In these, the most important part to me is knowing whether the interface is public or not, so the wan/lan distinction is the most useful. I don’t intend trying to remember whether enp5s24k1f345totheright4nextothebaker is the public or private interface.

Speaking about which, one of the things that appears obvious even from Lennart’s comment to the previous post, is that there is no real assurance that the names are set in stone — he says that an udev upgrade won’t change them, but I guess most people would be sceptic, remembering the track record that udev and systemd has had over the past few months alone. In this situation my personal, informed opinion is that all this work on “predictable” names is a huge waste of time for almost everybody.

If you do care about stable interface names, you most definitely expect them to be more meaningful than 10-digits strings of paths or mac addresses, so you almost certainly want to go through with custom naming, so that at least you attach some sense into the names themselves.

On the other hand, if you do not care about interface names themselves, for instance because instead of running commands or scripts, you just use NetworkManager… well what the heck are you doing playing around with paths? If it doesn’t bother you that the interface for an USB device changes considerably between one port and another, how can it matter to you whether it’s called wwan0 or wwan123? And if the name of the interface does not matter to you, why are you spending useless time trying to get these “predictable” names working?

All in all, I think this is just an useless nice trick, that will only cause more headaches than it can possibly solve. Bahumbug!

Predictably non-persistent names

This is going to be fun. The Gentoo “udev team”, in the person of Samuli – who seems to suffer from 0-day bump syndrome – decided to now enable by default the new predictable names feature that is supposed to make things so much nicer in Linux land where, especially for people coming from FreeBSD, things have been pretty much messed up. This replaces the old “persistent” names, that were often enough too fragile to work, as they did in-place renaming of interfaces, and would cause way too often conflicts at boot time, since swapping two devices’ names is not an atomic operation for obvious reasons.

So what’s this predictable name all around? Well, it’s mostly a merge of the previous persistent naming system, and the BIOS label naming project which was developed by RedHat for a few years already so that the names of interfaces for server hardware in the operating system match the documentation of said server, so that you can be sure that if you’re connecting the port marked with “1” on the chassis, out of four on the motherboard, it will bring up eth2.

But why were those two technologies needed? Let’s start first with explaining how (more or less) the kernel naming scheme works: unlike the BSD systems, where the interfaces are named after the kernel driver (en0, dc0, etc.), the Linux kernel uses generic names, mostly eth, wlan and wwan, and maybe a couple more, for tunnels and so on. This causes the first problem: if you have multiple devices of the same class (ethernet, wlan, wwan) coming from different drivers, the order of the interface may very well vary between reboots, either because of changes in the kernel, if the drivers are built-in, or simply because of locking and execution of modules load (which is much more common for binary distributions).

The reason why changes in the kernel can change the order is that the order in which drivers are initialized has changed before and might change again in the future. A driver could also decide to change the order with which its devices are initialized (PCI tree scanning order, PCI ID order, MAC address order, …) and so on, causing it to change the order of interfaces even for the same driver. More about this later.

But here’s my first doubt arises: how common is for people to have more than one interface of the same class from vendors different enough to use different drivers? Well it depends on the class of device; on a laptop you’d have to search hard for a model with more than one Ethernet or wireless interface, unless you add an ExpressCard or PCMCIA expansion card (and even those are not that common). On a desktop, I’ve seen a few very recent motherboards with more than one network port, and I have yet to see one with different chips for the two. Servers, that’s a different story.

Indeed, it’s not that uncommon to have multiple on-board and expansion card ports on a server. For instance you could use the two onboard ports as public and private interfaces for the host… and then add a 4-port card to split between virtual machines. In this situation, having a persistent naming of the interfaces is indeed something you would be glad of. How can you tell which one of eth{0..5} is your onboard port #2, otherwise? This would be problem number two.

Another situation in which having a persistent naming of interfaces is almost a requirement is if you’re setting up a router: you definitely don’t want to switch the LAN and WAN interface names around, especially where the firewall is involved.

This background is why the persistent-net rules were devised quite a few years ago for udev. Unfortunately almost everybody got at least one nasty experience with them. Sometimes the in-place rename would fail, and you’d end up with the temporary names at the end of boot. In a few cases the name was not persistent at all: if the kernel driver for the device would change, or change name at least, the rules wouldn’t match and your eth0 would become eth1 (this was the case when Intel split the e1000 and e1000e drivers, but it’s definitely more common with wireless drivers, especially if they move from staging to main).

So the old persistent net rules were flawed. What about the new predictable rules? Well, not only they combined the BIOS naming scheme (which is actually awesome when it works — SuperMicro servers such as Excelsior do not expose the label; my Dell laptop only exposes a label for the Ethernet port but doesn’t for either the wireless adapter or the 3G one), but it has two “fallbacks” that are supposed to be used when the labels fail, one based on the MAC address of the interface, and the other based on the “path” — which for most PCI, PCI-E, onboard, ExpressCard ports is basically the PCI address; for USB… we’ll see in a moment.

So let’s see, from my laptop:

# lspci | grep 'Network controller'
03:00.0 Network controller: Intel Corporation Centrino Advanced-N 6200 (rev 35)
# ifconfig | grep wlp3
wlp3s0: flags=4163  mtu 1500

Why “wlp3s0”? It’s the Wireless adapter (wl) PCI (p) card at bus 3, slot 0 (s0): 03:00.0. Matches lspci properly. But let’s see the WWAN interface on the same laptop:

# ifconfig -a | grep ww
wwp0s29u1u6i6: flags=4098  mtu 1500

Much longer name! What’s going on then? Let’s see, it’s reporting it’s card at bus 0, slot 29 (0x1d) — lspci will use hexadecimal numbers for the addresses:

# lspci | grep '00:1d'
00:1d.0 USB controller: Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller (rev 05)

Okay so it’s an USB device, even though the physical form factor is a mini-PCIE card. It’s common. Does it match lsusb?

# lsusb | grep Broadband
Bus 002 Device 004: ID 413c:8184 Dell Computer Corp. F3607gw v2 Mobile Broadband Module

Not the Bus/Device specification there, which is good: the device number will increase every time you pop something in/out of the port, so it’s not persistent across reboots at all. What it uses is the path to the device standing by USB ports, which is a tad more complex, but basically means it matches /sys/bus/usb/devices/2-1.6:1.6/ (I don’t pretend to know how the thing works exactly, but it describe to which physical port the device is connected).

In my laptop’s case, the situation is actually quite nice: I cannot move either the WLAN or WWAN device on a different slot so the name assigned by the slot is persistent as well as predictable. But what if you’re on a desktop with an add-on WLAN card? What happens if you decide to change your video card, with a more powerful one that occupies the space of two slots, one of which happen to be the place where you WLAN card is? You move it, reboot and .. you just changed the interface name! If you’ve been using Network Manager, you’ll just have to reconfigure the network I suppose.

Let’s take a different example. My laptop, with its integrated WWAN card, is a rare example; most people I know use USB “keys”, as the providers give them away for free, at least in Italy. I happen to have one as well, so let me try to plug it in one of the ports of my laptop:

# lsusb | grep modem
Bus 002 Device 014: ID 12d1:1436 Huawei Technologies Co., Ltd. E173 3G Modem (modem-mode)
# ifconfig -a | grep ww
wwp0s29u1u2i1: flags=4098  mtu 1500
wwp0s29u1u6i6: flags=4098  mtu 1500

Okay great this is a different USB device, connected to the same USB controller as the onboard one, but at different ports, neat. Now, what if I had all my usual ports busy, and I decided to connect it to the USB3 add-on ExpressCard I got on the laptop?

# lsusb | grep modem
Bus 003 Device 004: ID 12d1:1436 Huawei Technologies Co., Ltd. E173 3G Modem (modem-mode)
# ifconfig -a | grep ww
wwp0s29u1u6i6: flags=4098  mtu 1500
wws1u1i1: flags=4098  mtu 1500

What’s this? Well, the USB3 controller provides slot information, so udev magically uses that to rename the interface, so it avoids using the otherwise longer wwp6s0u1i1 name (the USB3 controller is on the PCI bus 6).

Let’s go back to the on-board ports:

# lsusb | grep modem
Bus 002 Device 016: ID 12d1:1436 Huawei Technologies Co., Ltd. E173 3G Modem (modem-mode)
# ifconfig -a | grep ww
wwp0s29u1u3i1: flags=4098  mtu 1500
wwp0s29u1u6i6: flags=4098  mtu 1500

Seems the same, but it’s not. Now it’s u3 not u2. Why? I used a different port on the laptop. And the interface name changed. Yes, any port change will produce a different interface name, predictably. But what happens if the kernel decides to change the way the ports are enumerated? What happens if the USB 2 driver is buggy and is supposed to provide slot information, and they fix it? You got it, even in these cases, the interface names are changed.

I’m not saying that the kernel naming scheme is perfect. But if you’re expected to always only have an Ethernet port, a WLAN card and a WWAN USB stick, with it you’ll be sure to have eth0, wlan0 and wwan0, as long as the drivers are not completely broken as they are now (like if the WLAN is appearing as eth1), and as long as you don’t muck with the interface names in userspace.

Next up, I’ll talk about the MAC addresses based naming and my personal preference when setting up servers and routers. Have fun in the mean time figuring out what your interface names will be.

Gentoo Linux-based network routing, again

It seems like I’m specializing in setting up Gentoo-based routers. In my work here in California (for the short time I’ll be here, as it looks like my next destination is London by the end of the year), there was the need to change the previous network setup from the previous router (a Juniper ScreenOS device) to something more apt to work with FiOS as the uplink — in particular, we just got our 150Mbit down, 65Mbit up link and the router we had, from Juniper, is only rated up to a very optimistic 40Mbps in either direction.

After trying, and failing, to get the FiOS router/access-point and the VPN provided by the Juniper router, to play nice together, I picked up one of the (extremely old) HPs we had around (a desktop, not a server), ordered a couple of PCI gigabit network cards, and simply set up Gentoo on it. Actually, since the cards took a couple of days to arrive I first set everything up “dry” and then got the network cards in. The bright side is that the cards arrived at 11am, and by 4pm the whole thing was running better than before; by the end of the day I also got an IPv6 tunnel and we finally have support for IPv6 here in the office — which is important for me because of how my Excelsior is setup (I’ll write more about that later on).

Getting Linux to play nice with the Juniper router and its VPN has been the most bothersome part of the whole. Luckily this wasn’t Juniper’s “SSL VPN”, which requires their Java-based tool to run as root to work as a client on Linux — instead the VPN, completely unmarked, is using IPsec. It’s a bit of a burden to know what to tweak between the kernel and the userland, and everything is up.. unfortunately it seems like the racoon init script is a bit of a pain in the butt, as it failed to work properly for me, while my improvements fail to work for others — if you’re using it and feel like testing it, I’m pretty sure Anthony would be happy to have more hands on deck.

I have yet to set up OpenVPN to be honest, and there is another problem with VPN Tracker behind this router as there is no IPsec connection tracking helper, which means that the UDP packets required for negotiation are not working (the client does not support UPnP/IGD for port forwarding which is a definite pain). In general though it’s much easier for me to deal with a Gentoo Linux-based router than it is dealing with the stupid Juniper ScreenOS.

I’ve been doing some reading around on which parameters to tweak, but since I haven’t had much time to experiment with it yet, and on the other hand the office is now basically running with three people in at any time, there’s very little that doesn’t work out of the box. The one thing that I noticed, though, is that somehow IPv6 (over the tunnel) feels “snappier” than IPv4. Maybe it’s the NAT that has to be done, or the fact that the iptables rules are more complex for v4 than v6 (as they have DNAT as well) — the ping times are also quite good: they are halved for IPv6: 3ms vs 6ms over v4, to Google’s homepage; similar (but much higher) results happen for Yahoo! but they are reversed for Facebook.

A good reason not to use network bridges

So one of the things I’m working on for my job is to look to set up Linux Containers to separate some applications — yes I know I’m the one who said that they are not ready for prime time but please note that what I was saying is that I wouldn’t give root inside a container to anybody I would trust — which is not the same as to say that they are not extremely useful to limit the resource consumption of various applications.

Anyway, there is one thing that has to be considered, of which I already quickly wrote about : networking. The simplest way to set up a LXC host, if your network is a private one, with a DHCP server or something along those lines, is to create one single bridge between your public network interface and the host-side of virtual Ethernet pairs — this has one unfortunate side effect: to make it working, it puts the network interface in promiscuous mode, which means that it receives all the packets directed to any other interface, which slows it down quite a bit.

So how do you solve the issue? Well, I’m honestly not sure whether macvlan improves the situation, I’m afraid not. What I decided for Excelsior, since it is not on a private network, was to set up an internal bridge, and have static IP addresses set to internal IPs. When i need to jump into one of the containers, I simply use the main public IP as an SSH jumphost and then connect to the correct address. I described the setup before although I made then a further change so now I don’t have to bother with the private IP addresses in the configuration file: I use the public IPv6 AAAA record for the containers, which simply resolve as usual once inside my jumphosts.

Of course with the exception of jumphosts, that kind of settings, which involve using NAT on iptables, has no way to receive connections from the outside.

So what other options are there? One thing I’ve been thinking about was to use a level-3 managed switch and set it to route a subnet to the LXC host — but that wouldn’t fly too much. So at the end the question would be “what is it that I need access on the containers form the outside?” and the answer is simply “the websites”. The containers provide a number of services, but only the websites are mapped to the outside. So, do I need IPs that are even partially public? Not really.

The solution I’m planning right now is that I’ll set up a box with either an Apache reverse-proxy or some other reverse proxy (depending on how much we want to handle on the proxy itself), and have that contact the internal containers, the same way it would be if you had one reverse proxy on the Internet, and the servers on the internal network.

I guess at some point I should overhaul the LXC wiki page for what concerns networking; I already spent some time to remove some duplicated content and actually sync it with what’s going on on the ebuild…

My problem with networking

After my two parter on networking, IPv6 and wireless, I got a few questions on why I don just use a cable connection rather than dealing with wireless bridges. The answer is, unfortunately, that I don’t have a clean way to reach with a cable from the point where my ADSL is and where my office is, on the floor above.

This is mostly due to bad wiring in the house: too little space to get cables through, and too many cables already in there. One of the projects we have going on the house now (we’ve been working on a relatively long list of chores that has to be done since neither me nor my mother foresee leaving this house soon), is to rewire the burglar alarm system, in which case, I should get more space for my cables — modern burglar alarms do not require the equivalent of four Ethernet cables running throughout the house.

Unfortunately that is not going to be the end of the trouble. While I might be able to get the one cable running from my office to the basement (where the cable distribution ties up) and from there to the hallway (where the ADSL is), I’m not sure of how many metres of cables that would be. When I wired with cat5e cable between my office and bedroom (for the AppleTV to stream cleanly), I already had to sacrifice Gigabit speed. And I’m not even sure if passing the cable through there will allow the signal to pass cleanly, as it’ll be running together with the mains’ wires — the house is almost thirty years old, I don’t have a chance to get separate connection for the data cable and the power; I’m lucky enough that the satellite cable fits. And I should shorten that.

To be honest, I knew a way around my house if I wanted to pass a cable to reach here already. But the problem with that is that it would require me to go the widest route possible: while my office is stacked on top of the hallway (without a direct connection, that would have been too easy), to get from one to the other, without the alarm rewiring, I would have to get to the opposite side of the house, bring the cable upstairs and then back, using a mixture of passageways designed for telephone, power and aerial wiring; and crawling outside the wall for a few metres as well.

The problem with that solution, beside the huge amount of time that it would require me to invest in it, is that the total cable length is almost certainly over a hundred metres, which is the official physical limit of cat5e Ethernet cables. Of course many people would insist telling me that “it’s okay, there are high chance it would still work” .. sure, and what if it doesn’t? I mean I have to actually make a hole in the wall at one place, then spend more than a day (I’m sure I wouldn’t be able to do this in just a day, already had to deal with my wiring before), with the risk of not getting a clear enough signal for the connection to be established. No thanks.

I also considered the option of going fibre optic. I have no clue about the cabling itself, and I know it requires even more specific tools than the RJ45 plugs to be wired, but I have looked at the prices of the hardware capable of converting the signal between fibre and good old RJ45 cabling… and it’s way out of my range.

Anyway, back on topic of the current plan for getting the cable running. As I said the current “cable hub” is in the basement, which is mostly used as a storage room for my mother’s stuff. She’s also trying to clean that up, so in a (realistically, remote) future I might actually move most of my hardware down there rather than in the office — namely Yamato, the router itself (forwarding the ADSL connection rather than the whole network) and Archer, the NAS. Our basement is not prone to floods, and is generally cool in the summer, definitely cooler than my office is. Unfortunately for that to work out, I’ll probably need a real-life rack, and rackmount chassis, neither of which is really cheap.

Unfortunately with that being, as I said, in the future, if I were to pass the cable next month from there, and the signal wouldn’t be strong enough, the only option I’d have would be to add a repeater. Adding a repeater there, though, is troublesome. As I said in the other posts, and before as well, my area is plagued with a very bad power supply situation. To the point that I have four UPS units in the house, for a total of 3750 VA (which is, technically, probably more than the power provided by supplier). I don’t really like the idea of having to make room for yet another UPS unit just for a repeater; even less so considering that the cables would end up being over my head, on the stairs’ passage (yes it is a stupid position to add a control panel in the first place), and while most repeaters seem to be wall-mountable, UPS units are a different story.

So the only solution I can think for such a situation would be to add a PoE repeater there, if needed, and then relay its power through a switch, either in my office (unlikely) or in the hallway near the router (most likely), behind the UPS. Once again here, the factor is the cost.

Honestly, even though I decided not to get an office after seeing costs jumping higher and higher – having an office would increase my deductibles of course, but between renting the office, daily transportation, twice the power bill, and so on so forth, it’s not the taxes that worry me – I wonder if it is really as cheap as I prospected it to be, to keep working at home.

Sigh. I guess it’s more paid work, less free time next year as well.

The problem with wireless bridging

I want to pick up where I left with my previous post and expand a bit upon the issue with wireless bridging, and why “just use dd-wrt” is not an answer to the problem.

As I said a number of issues I learnt the hard way, by trying to get them to work… and failing. In particular, there is a limitation in 802.11, that even the dd-wrt documentation notes:

Client Bridge mode will only recognize one mac address on the bridged setup, due a limitation in the 802.11 protocol, even if there are multiple clients (with multiple mac addresses) connected to the client router. If you want to bridge a full LAN you must use WDS. The problem is that the 802.11 protocol just supports one MAC address, but in a LAN there is the possibility for more than one MAC address. It may cause ARP table problems, if you connect more than one computer on the far end of a Client Bridge mode setup. You will not be able to, for example, block mac addresses of client of the bridged routers or set access restrictions based on mac addresses in the bridged router

This is actually putting it more bright than it is. Anything relying on proper mac address communication will fail. Indeed, if you wish to use a single DHCP server, your only choice is to run dhrelay on the bridge itself. And that’s not a good idea.

Due to the fact that 802.11 decides where to send the packets depending on the mac address, you only have two choices for this to work: you either go with what OpenRG/Linksys do, and translate addresses at second level (with probably a dhrelay to make sure that dhcp still works), or you do what D-Link did with the DAP-1160 and create a custom work mode, which I guess encapsulates the packets to preserve their addresses (I could probably have tried AP+Bridge mode and sniffed the traffic to find that out but I didn’t care), probably something along the lines of a generic Ethernet-in-Ethernet encapsulation.

Interestingly enough, there is an RFC describing Ethernet-in-IP encapsulation, and then there is a patch for Linux 2.6.10 that implements it .. it would be quite an interesting approach, to have the router listen to an EtherIP device, and have another EtherIP device here to encapsulate the packets.. unfortunately this would still require a very shallow router up here, which is what I’m trying to avoid altogether. And as it happens, looks like the patch never made it to the Kernel, and the author’s website seems to be gone as well (the domain does not have an answering webserver, even though the whois data confirms its registration .. I should try to see if the email address is still valid or not — there is a valid mx record and an answering mail server at least).

I guess I can add this to the long list of projects I’ll work with once I made enough money not to have to work twelve hours a day to pay the bills…

IPv6 and networking pain

I’m honestly reconsidering my scepticism towards curses.. mostly because the past two months don’t make much sense without taking that into consideration. I’ve had a long list of hardware, network and power issues, and jobs ended up being bottled up due to that.

Not the latest, and not the worse (but there on the upper side of the list) of said issue happened with the DAP-1160 bridges/access points I used to connect the network segment in my office to the router downstairs. The problem there is that for a long series of reasons I can’t reach it with either an ethernet cable or a powerline adapter, and so I decided to use gigabit within the office, and jump with wireless to the router.

I’ve got those two bridges for about two years now, and they worked mostly well. Mostly, not perfectly. In the past month, though, they started acting up, requiring too often a reboot… the problem is likely tied with them running continuously for a few months and then being turned on and off repeatedly due to the power company blacking me out (14 hours in 14 days.. two lumps of 5 hours, plus a number of on-and-off spikes).

My original implementation for getting this setup to work involved an OpenWRT powered router, and subnetting the office.. but the subnetting became easily a bother, as it added one more router for me to manage, and I didn’t intend to proceed that way. I then replaced said router with Enterprise/Yamato with a WLAN card, but that had its share of troubles as well. At the end I went with the two D-Link devices that created a seamless Ethernet bridge between the two segments, yai!

And now they started failing, so I had to replace them. And since I was out to replace them I wanted to use 11n hardware to run on the 5GHz band rather than 2.4, to avoid most of the interference otherwise present. So after a bit of googling around I ended up buying two Cisco Linksys devices, a WAP610N access point and a WET610N bridge. They are designed to work together, and thus they should have been perfect. Should being the keyword.

What happens with these? Well, the throughput is nice indeed, it’s much faster to connect to the router now. But at the same time.. I lost all IPv6 capabilities.

Now, I learnt the hard way at the time that the 802.11 specifications do not include provisions for wireless-to-Ethernet transparent bridges, and all implementations of those are custom implementations of the manufacturers. I thought Linksys solved that in such a level as well.. but it turns out it didn’t. It actually did something a tad smarter, for the kind of usage they foresaw their hardware to be used for. They parse the third level packages, in particular it seems they parse the ARP packets, to tell the access point which address to send their way… a sort of Network Address Translation at the second level.

Unfortunately, they do not do the same for what concern the IPv6 NDP, so IPv6 is simply broken here. To be honest, IPv6 works fine in the network segment, becaues the router advertisement is sent in broadcast, and thus received probably, but all the unicast IPv6 traffic from the router to the bridge (not the other way around, btw) is dropped.

I’m not sure if I should just live with it or if I should find a more proper replacement for the 1160 devices. If somebody know hardware capable of doing such a transparent bridge between wireless and ethernet on the 5GHz band, it would definitely be welcome.. in that case, the Linksys bridge will just limit itself to my bedroom (where it would connect just the consoles and TV, none of which is IPv6 compatible anyway), and the access point would replace the current 11g public network I use for the devices outside of my office.

In the mean time I have more issues to solve. Sigh.

LXC and why it’s not prime-time yet

Lately I got a number of new requests about the status of LXC (Linux Containers) support in Gentoo; I guess this is natural given that I have blogged a bit about it and my own tinderbox system relies on it heavily to avoid polluting my main workstation’s processes with the services used by the compile – and especially test – phases. Since a new version was released on Sunday, I guess I should write again on the subject.

I said before that in my opinion LXC is not ready yet for production use, and I maintain that opinion today. I would also rephrase it in something that might make it easier to understand what I think: I would never trust root on a container to somebody I wouldn’t trust root with on the host. While it helps a great deal to reduce the nasty effects of an application mistakenly growing rogue, it neither removes the option entirely, nor it strengthen the security for intentional meddling with the system. Not alone at least. Not as it is.

The first problem is something I have already complained about: LXC shares the same kernel, obviously and by design; this is good because you don’t have to replicate drivers, resources, additional layers for filesystem and all the stuff, so you have real native performance out of it; on the other hand, this also means that if the kernel does not provide namespace/cgroup isolation, it does not allow you to make distinct changes between the host system and the container. For instance, the kernel log buffer is still shared among the two, which causes no little problems to run a logger from within the container (you can do so, but you have to remember to stop it from accessing the kernel’s log). You also can’t change sysctl values between the host and the container, for instance to disable the brk() randomizer that causes trouble with a few LISP implementations.

But there are even more interesting notes that make the whole situation pretty interesting. For instance, with the latest release (0.7.0), networking seems to have slightly slowed down; I’m not sure what’s the problem exactly, but for some reason it takes quite a bit longer to connect to the container than it used to; nothing major so I don’t have to pay excessive attention to it. On the other hand, I took the chance to try again to make it work with the macvlan network rather than the virtual Ethernet network, this time even googling around to find the solution about my problem.

Now, Virtual Ethernet (veth) is not too bad; it creates a peer-to-peer connection between the host and the container; you can then manage that as you see fit; you can then set up your system as a router, or use Linux ability to work as a bridge to join container’s network with your base network. I usually do that, since it reduces the amount of hops I need to add to reach Internet. Of course, while all the management is done in-kernel, I guess there are a lot of internal hops that have to be passed, and for a moment I thought that might have been slowing down the connection. Given that the tinderbox accesses the network quite a bit (I use SSH to control it), I thought macvlan would be simpler: in that case, the kernel is directing the packets coming toward a specific MAC address through the virtual connection of the container.

But the way LXC does it, it means that it’s one-way. By default, actually, each macvlan interface you create, isolates the various containers one from the other as well; you can change the mode to “bridge” in which case the containers can chat one with the other, but even then, the containers are isolated from the host. I guess the problem is that when they send packets, they get sent out from the interface they are bound to but the kernel will ignore them if they are directed back in. No there is currently no way to deal with that, that I know of.

Actually upstream has stated that there is no way to deal with that right now at all. Sigh.

An additional problem with LXC is that even when you do blacklist all the devices so that the container’s users don’t have access to the actual underlying hardware, it can mess up your host system quite a bit. For instance, if you were to start and stop the nfs init script inside the container.. you’d be disabling the host’s NFS server.

And yes, I know I have promised multiple time to add an init script to the ebuild; I’ll try to update it soonish.

The routed network broadcast problem

Fragment of my topology

You might remember my network diagram that has shown you the absurd setup I have at home to connec tall the rooms where computers are located. Since then, something was reduced, and indeed now the network section between my bedroom and the office is over the usual Ethernet (should be Gigabit, but something doesn’t look right) cable. This actually should also reduce the power consume at home since the old Powerline adaptors were still an extra powered appliance; the main reason why I replaced the, though, was that the green LEDs definitely bothered me while trying to sleep, and at the same time, speed was quite an issue with some files’ streaming.

The result is that only two media are used here: WiFi and cabled Ethernet; unfortunately, I still lack a way to connect Yamato and Deep Space 9 (the router) via Ethernet directly, so they are connected via a standard infrastructure WiFi. This is not really exceptional, in the sense that the connection between them is not very stable (I use an ath9k card on Yamato, with 2.6.32rc7 kernel), and when I’m downloading stuff with bittorrent or similar, I need to restart the network connections about once every five minutes to keep it going properly, which you can guess is not that fun.

Now unfortunately there is one problem here, which I ignored for quite a while but I cannot ignore any longer (because I finally got the table I needed to play with Unreal Tournament 3 with my PlayStation 3!): the cabled ethernet segments fail to get UPnP support.

The whole network inside a single Class B IP range (172.28.0.0/16), fractioned into four main subnetworks (direct, and behind Yamato, known and unknown computers, they have different filters on the firewall) by the DHCP server running on Deep Space 9 (for simplicity, Yamato is the only box in the network to have a static IP address, in an unused subnetwork range together with Deep Space 9, beside the router/DHCP server itself). Yamato has two interfaces enabled: wlan0 which connects to the AP and then to DS9, and br0 which is the bridge of the remaining interfaces (eth0 and eth1 for the cabled network segments – the latter I only bring up when I need more ports for work devices – and vde0 for the virtual networks). Here start the problem: while a WiFi network is usually akin to a switched network, and of course my cabled segment is also switched, the two together are not switched but routed together, by Yamato which is a second router in the network.

Of course I built DS9 to reduce the load of Yamato (even though my original planning involved linking the cabled with that through a another, very long, cable), so the services are currently mostly running on DS9 rather than Yamato: DHCP server, DNS server, UPnP server and so on. The problem is that almost all the “zeroconf” kind of services, which include not only Apple’s Bonjour protocol, but UPnP and DHCP as well use the UDP transport and the broadcast address to look for the servers. And UDP broadcast only works within switched networks, not routed ones.

The obvious solution in these cases, which is more or less the only solution you’ll ever read proposed around when people ask about broadcast repeaters, is to use bridging instead of routing to merge the two networks together; a switch is, after all, just a multi-port bridge, so the result is again a switched network. Unfortunately this brings two issues with it: the first is that you effectively lose the boundary between the two networks, even when that was very transparent, like I’d like it to be, the filtering can still be useful for some things; the latter is that bridging WLAN interfaces is complex and pretty much suboptimal.

The problem with bridging WLAN is that putting the network card in promiscuous mode is not enough: the access point by default only sends over the air the PDUs whose destination is an associated mac address. And telling the access point to send all the PDUs might not be good either; while in my setup the problem is relatively small (the only two devices connected via Ethernet to DS9 are the AP and the Siemens VoIP phone — the Linux bridge software will still understand to only send the VoIP phone data to the connected network card and the rest to the AP), it doesn’t look like a very good long-term solution.

To solve part of the problem, at least the most common part of it, both ISC DHCP and Avahi provide support for transparently join two routed networks that would otherwise be isolated: dhcrelay and Avahi’s refector. The former is not just a simple repeater of DHCP requests, but it also adds a “circuit-id” to the requests, so that requests coming from behind it are tagged and can be treated differently (this is how I handle differently the clients behind Yamato — of course those have to get to a subnet that is routed through Yamato); the latter just picks up the service broadcasts and copy them to the various interfaces it listens on… but neither is perfect.

With dhcrelay the problem is deep inside the way it has been implemented: it has to listen on both the interface the requests will come from, and that where the responses come from… and it doesn’t discriminate between them; this means in the case of Yamato that I have to listen to both br0 and wlan0, but then the requests sent by the clients on WiFi will still reach the relay and would be sent back to DS9 through the relay; for this reason the “circuit-id” contains the interface the request came from, so I only check for that id to be br0 instead of just checking if it exists, before deciding how to divide the clients. The alternative is using iptables to filter the requests from the wlan0 interface, but let’s leave that for a moment.

The problem with Avahi seems more to be a bug, or rather an untested corner case; I have found no way to stop Linux from issuing link-local IPv6 addresses to the interfaces that result “up”; this unfortunately means that eth0, vde0 and br0 all have their IPv6 address… so the broadcasts coming from wlan0 are reflected on all three of them, and all the clients connected to the cabled (or virtual) segment will receive the broadcast twice. This wouldn’t be much of an issue if Apple’s compuers didn’t decide to rename themselves to “Whatever (2)” when they felt somebody else was using their hostname in the network. I should speak with Lennart about it but I haven’t had time to deal with that just yet.

There remains a third protocol there that I found no solution for yet: UPnP; with UPnP the problem is relatively easy: SSDP uses UDP broadcasts on port 1900 to find the router, before talking directly with it, so the only thing that I’d be needing is a repeater over that particular port. The best solution to me would have been using iptables directly, but since that’s not implemented for what I can see, I guess I’ll end up either writing my own UDP repeater, or look for something working, and properly written. If somebody has a clue about that, I’d be happy to hear the solutions.

Interestingly enough, UPnP during my analysis proven to be the only protocol I’m interested in that actually could be just re-broadcasted with a generic repeater; for DHCP, I need to discern proxied requests to assign them to properly routed subnetworks; for Bonjour, the port wouldn’t be free for a repeater since Avahi itself would be using it to begin with.

So bottom-line, I’d have three needs that somebody might want to help me with: get a better dhcrelay the current implementation sucks in more ways than a few, starting for the not being able to specify which is the input and which the output interface, or the lack of a configurable circuit-id string; fix the Avahi IPv6 reflector over bridged network, although I have no idea how (alternative: find a way to tell Linux/OpenRC not to issue a link-local IPv6 address to the interfaces); write a generic UDP broadcast repeater so that UPnP can work with a routed network — the last one is what I’ll probably work on tomorrow so I can get the PS3 to pass through the ports with DS9.