Random quality

RFC 1149.5 specifies 4 as the standard IEEE-vetted random number.

xkcd’s Random Number comic © Randall Munroe

We all know that random numbers might not be very random unless you are very careful. Indeed, as the (now old) Debian OpenSSL debacle, a not-enough-random random number generator can be a huge breach in your defences. The other problem is that if you want really random numbers you need a big pool of entropy otherwise code requiring a huge chunk of random bytes would stall until enough data is available.

Luckily there are a number of ways to deal with this; one is to use the EntropyKey while other involves either internal sources of entropy (which is what timer_entropyd and haveged do), or external ones (audio_entropyd, but a number of custom circuitry and software exist as well). These fill in the entropy pool, hopefully at a higher rate than it is depleted, providing random data that is still of high quality (there are other options such as prngd, but as far as I can tell those are slightly worse in term of quality).

So, the other day I was speaking with Jaervosz, who’s also an EntropyKey user, and we were reflecting on whether, if there is not enough entropy during crypto operations, the process would stall or cause the generation to be less secure. In most cases, this shouldn’t be a problem: any half-decent crypto software will make sure not to process pseudo-random numbers (this is why OpenSSL key generation tells you to move your mouse or something).

What we ended up wondering about, was how much software uses /dev/urandom (that re-uses the entropy when it’s starving) rather than /dev/random (which blocks on entropy starvation). Turns out there are quite a few. For instance on my systems, I know that Samba uses /dev/urandom, and so does netatalk — neither of which make me very happy.

A few ebuilds allow you to choose which one you want to use through the (enabled-by-default) urandom USE flag… but these I noted above aren’t among those. I suppose, one thing we could be doing would be going over a few ebuilds and see if we can make it configurable which one to use.. for those of us who make sure to have a stable source of entropy, this change should be a very good way to be safe.

Are you wondering if any of your mission-critical services are using /dev/urandom ? Try this:

# fuser -v /dev/{,u}random
                     USER        PID ACCESS COMMAND
/dev/random:         root      12527 F.... ekey-egd-linux
/dev/urandom:        root      10129 f.... smbd
                     root      10141 f.... smbd
                     root      10166 f.... afpd
                     flame     12356 f.... afpd

Also, if you want to make sure that any given service is started only after the entropy services, you can simply make it depend on the virtual service entropy (provided by haveged, or ekeyd if set to kernel output, or ekey-egd-linux if set to EGD output). A quick way to do so without having to edit the init script yourself, is to add the following line to /etc/conf.d/$SERVICENAME:

rc_after="entropy"

The entropy factor

I have been dedicating my last few posts to the EntropyKey limitations and to shooting down Entropy Broker so I guess it is time for me to bring good news to my users, or at least some more usable news.

Let’s start with some positive news about EntropyKey; I have already suggested in the latter post that the problem I was getting with the key not working after a reboot could have been linked to the use of the userland USB access method. I can now confirm this: my router reboots fine with the CDC-based access. Unfortunately this brings up a different point: Linux’s CDC-ACM driver is far from perfect; as I pointed out, Simtec has a compatibility table for the kernel versions that are known to be defective. And even the latest release (2.6.38) is only partially fine — there are fixes in 2.6.38.2 that relates to possible memory corruption and null pointer dereferences in the drivers. But it definitely makes the EntropyKey much more useful than it might have been suggested by my post there.

I also wanted to explain a bit better what I’m trying to look at with entropy, processes creation, and load. One of the two EntropyKey devices I own is connected to Yamato, which is the same box that runs the tinderbox. The entropy on Yamato tends to be quite low in general, and when I was using it as my main workstation, it was feeling sluggish — thus why I got the Key. I started monitoring it lately because I was noticing the Key not behaving properly, and beside the already noted issues with the userland access method, I have noticed a different pattern: when the load (and the running processes) spikes, the entropy fell for a while, without it being replenished in time. I’m still not sure if it’s simply the process creation eating away the entropy (for the ASLR mostly, since Yamato, unlike my router, doesn’t use SSP and thus doesn’t require canaries), or if the build process for ChromiumOS (which is what triggers those spikes) eats up the entropy by itself.

Thanks to upstream as I said before, I know there are somewhere around 4KiB of entropy gathered by the key per second, which means if something drains more than 4KiB/second, /dev/urandom will start re-using entropy. I have discarded the notion that it might have been a priority issue since I made sure to decrease the niceness of the ekeyd daemon, and it didn’t make any significant issue; on the other hand I’m now experimenting with using the same interface designed for vservers that Jaervosz helped getting into Gentoo. This method splits the process reading data from the key and that of sending it to the kernel — a quick check shows me that it does have better results, which would suggest that there are indeed performance-related problems, more than high-level of crypto consumption, at least on basic usage — I have yet to run a full rebuild of ChromiumOS in this configuration, since it’s over 6am and I’ll probably not sleep today.

Actually, moving to the split EGD server/client interface hopefully should also help me by providing more entropy for Raven (my frontend system) without using timer_entropyd, which has a nasty, completely load-linked dip in entropy generated when the load spikes. as Alex points out in the comments of the latter post, the quality of entropy also matters.

In parallel with the entropy issues in Yamato, I have entropy issues with my two vservers, one of which runs this blog, the other runs xine’s bugzilla — and both of which seem to have huge trouble with missing entropy, although in slightly different ways. The former, Vanguard, never reached 1KiB of entropy during the last four days when I monitored it, and its daily range is usually below 300 bytes. The latter, Midas, has a spike in entropy availability during the night, when maintenance tasks run (mostly Bugzilla’s); I’m not sure if it’s because Apache then throttles HTTPS access or something else, but otherwise it also has the same range of entropy.

Since both Vanguard and Midas have an average amount of accesses over HTTPS, it made sense to me to consider this lack of entropy as a problem. Unfortunately running timer_entropyd there is not an option: the virtualisation software used does not let me inject further entropy into the system, which is quite a pain in the lower back side. When I first inspected Entropy Broker, I was thrilled to read that OpenSSL supports accessing EGD sockets directly rather than through the kernel-exposed /dev/yrandom device, since the EGD sockets are handled entirely in userland, and thus wouldn’t hit the limitation of the virtualised envrionment. With Entropy Broker being.. broken, I was still able to track down an alternative to that: PRNGD a pseudo-random number generation daemon that works entirely in user-space by using a number of software sources to provide fast and decent random number sources.

OpenSSL, OpenSSH and other software explicitly provide support for accessing EGD sockets, and explicitly refer to PRNGD as an alternative to the OS-provided interfaces. Unfortunately, neither would work out of the box for Gentoo — the former will look into EGD sockets only if the /dev/urandom and /dev/random devices wouldn’t have enough data to provide enough random bytes as requested (which is never going to happen with /dev/urandom), and has to have its sources’ edited to actually have a chance to use EGD, even so it would (rightly) privilege the /dev/random device; OpenSSH instead requires for an extra parameter to be passed to the ./configure call. For those interested, the ekey-egd-linux package can inject this data back into the kernel, like Entropy Broker would have done if it worked.

A tentative ebuild for prngd is in my overlay but I’m not sure if and when I’ll add it to the main tree. First of all, I haven’t compared the entropy’s quality yet, and it might well be that the PRNGD method is going to give very bad results, but I’m also quite concerned about the sources’ license: it seems to be a mess of as-is, MIT and other licenses altogether. While Gentoo is not as strict as Debian in the license handling, it doesn’t sound like something I’d love to keep around with my name tagged on it.

Sigh, 7am, I’ll probably be awakened in two hours again. I really need to get myself some kind of vacation, somewhere I can spend time reading and/or listening to music all day long for a week. Sigh. I wish!

Entropy Broken

In my previous post I noted the presence of Entropy Broker — software designed to gather entropy from a number of sources, and then concentrate it to be sent to a number of consumers. My main interest in this was to make use of the EGD interface to feed new entropy to OpenSSL so that it wouldn’t deplete the little one available on my two vservers — where I cannot simply push it via timer_entropyd.

Unfortunately it turned out not to be as simple as building it and writing init scripts for it. The software seems abandoned, with the last release back in 2009, but most importantly it doesn’t work.

I have already hinted in the other post that the website lies about the language the software is written in. In particular, the package declares itself as being written in C, when instead it is composed of a number of C++ source files, and uses the C++ compiler to build. Looking at the code, all of it is written in C style; a quick glance to the compiled results shows that nothing is used from the STL; the only two C++ language symbols that the compiled binaries rely on are the generic new and delete operators (_Znwm and _ZdlPv in mangled form). This alone spells bad.

After building, and setting up the basic services needed by Entropy Broker (the eb hub, server_timers ­­– that takes the place of timer_entropyd – and, in a virtual machine, client_linux_kernel), the results aren’t promising. The entropy pool is not replenished on the virtual machine, ever; network traffic is very very limited. The same more or less goes when using the EGD client (which is actually an EGD server acting as an eb-client). Even worse with the server_audio that seems to exit with error after reading a few data points. server_video doesn’t even build since it relies on V4L1 that has been dropped out of Linux 2.6.38 and later.

Returning a moment about my EntropyKey problems with the entropy not staying full, I’ve spoken with the EntropyKey developers briefly today. If those downward spikes happen, it usually is because something is consuming entropy faster than EntropyKey can replenish it, and since the EntropyKey can produce around 4KiB/s of entropy, that means a fast consumption of random data.

As an example, I was told that spawning a process eats 8 bytes of entropy, so something around 500 process spawned in a second would be enough to beat the Key’s ability to replenish the entropy. This might sound a lot but it really isn’t, especially when doing parallel builds, just think that a straight gcc invocation in Gentoo spawns about five processes (the gcc-config wrapper, gcc as the real frontend, cpp to preprocess, cc1 as the real compiler, and as which is the assembler), and that libtool definitely calls many more for handling inputs and outputs. And remember that the tinderbox builds with make -j12 whenever it can.

This seems to match the results I see from the Munin graphs, where entropy is depleted when load spikes, for instance when kicking off a build for ChromiumOS. But now I’m also wondering if the problem is that the ekeyd daemon gets a too low priority when trying to replenish it, which leaves it uncovered — I guess my next step is to add Munin monitoring for the ekeyd data as well as the entropy to see if I can link the two of them. Do note that the load on Yamato can easily reach 60 and over…

And a final word about timer_entropyd… a quick check seems to suggest that it only works correctly on systems that are mostly idle… my frontend system seems to be just fine in such a context, and indeed it does seem to do a good job there (load during the day never reached 1). It doesn’t seem to be a good idea for Yamato with its high load.

What’s up with entropy? (lower-case)

You might remember me writing about the EntropyKey — a relatively cheap hardware device that is designed to help the kernel generating random numbers sold by SimTec for which I maintain the main software package (app-crypt/ekeyd).

While my personal crypto knowledge is pretty limited, I have noticed before how bad a depleted entropy can hit ssh sessions and other secure protocols that rely on good random numbers, which is why I got interested in the EntropyKey when I read about it, and ordered it as soon as it was possible, especially as I decided to set up a Gentoo-based x86 router for my home/office network, which would lack most of the usual entropy-gathering sources (keyboard, mouse, disk seeks).

But that’s not been all I cared about, with entropy, anyway: thanks to Pavel, Gentoo gathered support for timer_entropyd as well, which uses the system timers to provide further entropy for the pool. This doesn’t require any special software to work (contrarily to the audio_entropyd and video_entropyd software), even though it does make use of slightly more CPU. This makes it a very good choice for those situations where you don’t want to care about further devices, including the EntropyKey.

Although I wonder if it wouldn’t have made more sense to have a single entropyd software and modules for ALSA, V4L and timer sources. Speaking about which, video_entropyd needs a bit of love, since it fails to build with recent kernel headers because of the V4L1 removal.

Over an year after having started using the EntropyKey, I’m pretty satisfied, although I have a couple of doubts about it, which are mostly related to the software side of the equation.

The first problem seems to be mixed between hardware and software: both on my router and on Yamato when I reboot the system, the key is not properly re-initialised, leaving it in “Unknown” state until I unplug and re-plug it; I haven’t found a way to do this entirely in software, which, I’ll be honest, sucks. Not only it means that I have to have physical access to the system after it reboots, but the router is supposed to work out of the box without me having to do anything, and that doesn’t seem to happen with the EntropyKey.

I can take care of that as long as I live in the same house where the keys are, but that’s not feasible for the systems that are at my customers’. And since neither my home’s gateway (for now) nor (even more) my customers’ are running VPNs or other secure connections requiring huge entropy pools, timer_entropyd seems a more friendly solution indeed — even on an Atom box the amount of CPU that it consumes is negligible. It might be more of an issue once I find a half-decent solution to keep all the backups’ content encrypted.

Another problem relates to interfacing the key with Linux itself; while the kernel’s entropy access works just fine for sending more entropy to the pool, the key needs to be driven by an userland daemon (ekeyd); that daemon can drive the key in either of two options: one is by using the CDC-ACM interface (USB serial interface, used by old-school modems and cellphones), and the other is by relying on a separate userland component using libusb. The reason for two options to exist is not that it provides different advantages, as much as neither does always work correctly.

For instance the CDC-ACM option depends vastly on the version of Linux kernel that one is running, as can be seen by the compatibility table that is on the Key’s website. On the other hand, the userland method seems to be failing on my system, as the userland process seems to be dying from time to time. To be fair, the userland USB method is considered a nasty workaround and is not recommended; even a quick test here graphing entropy data shows that moving from the libusb method to CDC-ACM provides a faster replenishing of the poll.

Thanks to Jeremy I’ve now started using Munin and keeping an eye on the entropy available on the boxes I admin, which includes Yamato itself, the router, the two vservers, the customer’s boxes but, most importantly here, the frontend system I’ve been using daily. The reason why I consider it important to graph the entropy on Raven is that it doesn’t have a Key, and it’s using timer_entropyd instead. On the other hand, it is also a desktop system, and thanks to the “KB SSL” extension for Chrome, most of my network connection go through HTTPS, which should be depleting the entropy quite often. Indeed, what I want to do, once I have a full day graph of Raven’s entropy without interruptions, is comparing that with the sawtooth pattern that you can find on Simtec’s website.

For those wondering why I said “without interruptions”. — I have no intention to let Munin node data be available to any connection, and I’ve not yet looked into binding it up with TLS and verification of certificates, so for now the two vservers are accessed only through SSH port forwarding (and scponly). Also, it looks like Munin doesn’t support IPv6, and my long time customer is behind a double-NAT, with the only connection point I have is creating an SSH connection and forward ports with that. Unfortunately, this method relies on my previously noted tricks which leaves processes running in the background when connection to a box needs to be re-established (for instance because routing to my vservers went away or because the IPv6 address of my customer changed). When that happens, fcron considered the script still running and refuses to start a new job, since I didn’t set the exesev option to allow starting multiple parallel jobs.

And to finish it off, I found a link by looking around about Entropy Broker a software package designed to do more or less what I said above (handle multiple sources of entropy) and distribute them to multiple clients, with a secure transport whereas the Simtec-provided ekey-egd application allows sending data over an already secured channel. I’ll probably look into it, for my local network but more importantly for the vservers, which are very low on entropy, and for which I cannot use timer_entropyd (I don’t have privileges to add data to the pool). As it happens, it OpenSSL, starting version 0.9.7, tries accessing data through the EGD protocol before asking it to the kernel, and that library is the main consumer of entropy for my servers.

Unfortunately packaging it doesn’t look too easy, and thus will probably be delayed for the next few days, especially since my daily job is eating away a number of hours; Plus the package neither use standard build systems, lies about the language it is written in (“EntropyBroker is written in C”, when it’s written in C++), but most importantly, there are a number of daemon services for which init scripts need to be written…

At any rate, that’s a topic for a different post once I can get it to work.

On Virtual Entropy

Last week Jaervosz blogged about using the extras provided by the ekeyd package (which contains the EntropyKey drivers) — the good news is that as soon as I have time to take a breath I’ll be committing this egd ebuild into the tree for general availability. But before doing so, I wanted to explain a few more details about the entropy problem, since I’m pretty sure it’s not that easy to follow for most people.

First problem, what the heck is entropy, and why is it useful to have an EntropyKey? Without going in much of the details that escape me as well, for proper cryptography you have a need for a good source of random data; and with random data we mean data that cannot be predicted given a seed. To produce such good random data, kernels like Linux make it possible to gather some basic unpredictable data and condition that to transform it into a source of good random data; that unpredictable data is, basically, entropy.

Now, Linux gathers entropy from a number of sources; these include the changes in seek time on a standard hard disk; on the typing rate of the user at the keyboard, on the mouse movements, and so on. For old-style servers and most desktops, even modern ones, this is quite feasible; on the other hand for systems like embedded routers, headless servers, and so on you start to lack many of these sources: CompactFlash cards and Solid State Disks have mostly-predictable seek time; headless systems don’t have an user typing on them; I don’t remember whether modern Linux uses the network as source of entropy, but even if it did, it would be opinable as a source of entropy since the data is predictable if you’re sniffing it out, not random.

Taking this into consideration, you reach a point where entropy-focused attack become interesting; especially with the modern focus on providing SSL- or TLS-protected protocols, which need good random sources, you can create a denial of service situation simply forcing the software to deplete quickly the entropy reserve of the kernel. When the kernel has not enough entropy, reading from /dev/random is no longer immediate and becomes blocking (do not confuse this with /dev/urandom that is the pseudo-random number generator from the kernel, a totally different beast!). If it takes too much time to fetch the random data, requests will start timing out, and you have a DoS served.

To overcome this problem, we have a number of options: audio_entropyd, video_entropyd, timer_entropyd and the EntropyKey. The first two as the name says gather the data from the audio and video input; they condition the sound and video read from there into a suitable source of entropy; I sincerely admit I was unable to get the first working, and the second requires a video source that is not commonly found on servers and embedded systems (and drains battery power on laptops). On the other hand timer_entropyd does not require hardware directly but it rather uses the information on the timers that various software add to the system, such as timeouts, read callbacks, and so on so forth. Quite often, these are not really predictable so it’s a decent source of entropy. EntropyKey is instead designed to be a hardware device whose only aim is that of providing the kernel with high-quality entropy to use for random-number generation.

Obviously, this is not the first device that is designed to do this; high-end servers and dedicated embedded systems have had for a very long time support for the so-called hardware RNGs: random number generators that are totally separated from the kernel itself, and provide it with a string of random data. What might not be known here is that, just like the EntropyKey, there is need for a daemon that translates the data coming from the device in the form the kernel is going to serve to the users connecting to /dev/random; as it is, I cannot say there is an unified interface to do so (you can note the fact that the rngd init scripts have to go through a number of different device files to find which one applies to the current system.

At any rate, EntropyKey or dedicated RNG hardware tend to hide the problem quite nicely; with full-on Kerberos enabled on my system, I could feel the difference between having and not having the EntropyKey running. But how does this pair itself with the modern virtualisation trend? Not so well as it is, and as Sune said. While the EntropyKey can provide by itself a good random pool for the host system, as it is it doesn’t cover KVM hosts, for that you have to go around one of two different paths (in theory, in practice, you only have one). But why is it that I haven’t worked on this myself before? Well, since LXC shares almost all of the kernel, including the random device and the entropy state, pushing the host’s entropy means pushing the LXC guests’ entropies as well: they are the same pool.

Thankfully, EntropyKey was actually designed keeping in mind these problems; it provides a modified Entropy Gathering Daemon (ekeyd-egd) that allows to send the entropy coming from the EntropyKey to the virtual machines to use to produce their own random numbers. What Jaervosz was sad about was the need to set up and run another daemon on each of the KVM guests; indeed there should have been a different way to solve the problem, since the recent kernels support a device called virtio-rng that, as the name implies, provides a random number generator through the virtio interface that is used to reduce the abstraction between KVM virtual devices and the host’s kernel. Unfortunately, it seems like no current version of QEmu, even patched with KVM support, have a way to define a virtio-rng device, so for now it’s unusable. Further, as I said above, you still have to run rngd to fetch the data from hardware RNG and feed it to the kernel, so we’re still back at setting up and running a new service on all the KVM guests.

At any rate, I hope I’ll be able to commit Sune’s ebuild tomorrow, I’ll also be probably committing a few more things. I also have to say that running the rngtest as he noted in his post on my laptop running timer_entropyd takes definitely too much time to be useful, and actually scares me quite a bit because it would mean it cannot keep up with the EntropyKey at all, but using an external EntropyKey on the laptop is… not exactly handy. I wonder if Simtec is planning on a EntropyKey with SDIO (to use in the SD card readers on most laptops), ExpressCard or PCMCIA interfaces. I’d love to have the EntropyKey effect hidden within the laptop!

Random Gentoo fixes

In the past week or so, I’ve been working in parallel on two Gentoo-related project; one is that I wrote about, for which I need to bypass NATs and allow external IPv6 access while the other will probably be more known in the next few days, when I deploy in Gentoo the code I’m developing.

Both works are something I’m paying to do, even though for the former I’m paid not nearly enough I should, and interestingly, both seem to require me to make some seemingly random, but to my aim quite important changes to Gentoo. Since they are unlikely to show up on anybody’s radar as they are, but some might be of interest to other people doing similar things, I thought I could give you a couple of heads’ up on these:

  • first of all, speaking about Quagga I changed the init scripts to make use of logger if available; this should depend on the configuration files (you can decide whether to use syslog or not) but since that only works with OpenRC and I’m not sure how to test for OpenRC within an init script, I decided to simply add a use requirement on all of them; as it is, it won’t require the logger, but if it’s in the runlevel, it’ll run after it;
  • again on topic of init script, I tweaked the OpenSSH sshd init script so that it doesn’t forcefully regenerate the RSA1 host key, as that’s only used by the version 1 of the SSH protocol, which is not enabled by default; if you do enable it, then RSA1 host key is regenerated, but it’s no longer happening if you don’t request it in sshd_config; I wanted to make it possible to disable RSA/DSA keys for SSH2 but it’s unclear how that would work; this reduces the amount of time needed to start up a one-off instance of Gentoo, such as a SysRescueCD live, or EC2, and reduces the entropy consumption at boot time;
  • tying it up, I’ve tried running audio-entropyd on the two boxes I have to deploy, hoping that it would make it possible for me to replenish the entropy without having to buy two more EntropyKeys (which would erode my already narrow profit margin); unfortunately it didn’t work at all, turning up a number of I/O errors, but the good news is that the timer_entropyd software – that Pavel is proxy-maintaining through me – works pretty nicely for this usage, and I’ve now opened a stable request for it;
  • also while trying to debug audio-entropyd, I’ve noted that strace didn’t really help when calls to ioctl() are made with ALSA-related operations; the problem is that by default, the upstream package is sent down with a prebuilt list of ioctl definitions; I’ve found a way to remove this limitation, which is now in tree as 4.5.20-r1, although I did break build with Estonian language — because the upstream code is not compatible in the first place; I’ll be fixing the Estonian problem tomorrow as soon as I have time;
  • I have started looking into the pwgen init script that is used by our live CDs and by SysRescueCD; again, there are a few useful things with that approach, but I really want to look more into it because I think there is room for improvement;
  • tomorrow and in the next days, depending on how much time I’m left with, I’ll be starting to try again the PKCS#11 authentication — last time I ended up with a system that let me in as root with any password, now I think how I can solve it, but it’ll require me to rewrite pambase almost from scratch;
  • not really related to my work project but I helped a bit our Hardened team to fix a suhosin bug and sent the patch upstream; I’ll be writing in deeper details about this – again as I find time – since it actually motivates me to resume my work on Ruby-ELF to solve the problem at the root.

On a different note, I switched my router from using pdnsd to unbound; while documentation leaves a lot to be desired, unbound seems to perform much better, and also does work with both IPv4 and IPv6 socket listening, which doesn’t seem to be the case at all for pdnsd. I also taken down some notes about using the bind ddns protocol since the documentation “robbat2’:http://robbat2.livejournal.com/ pointed me at is probably out of date now.