Everybody heard about the KISS principle I guess — the idea is the less complex a moving part is, the better. This is true in software as much as mechanics. Unix in particular, and all the Unix-like projects including GNU, also tended to follow that principle as it can be shown by the huge amount of small utilities that only do one particular text or file editing functions — that is until you introduce
Now we all know that the main sophistication that is afoot in the Linux world nowadays is Lennart’s systemd. I have no intention to discuss it now, or at any later time I’d say. I really don’t care as long as I have a choice not to use it, and judging from a given thread I think we’ll always have an alternative, no matter what some people said before and keep saying.
No, my problem today is not with udev deciding it’s time to stop using the same persistent rules that people had to fight with for years and that now are no longer usable, and instead it’s a problem with util-linux, and in particular with the
losetup utility that manages the loop devices. See, the loop devices have been quite a big deal in the past, mostly because they started as a fixed amount, then the kernel let you decide how many, and then finally code was enabled that would let you change dynamically the amount of loop devices you want to have available. Great, but it required a newer version of util-linux, and at the time when it was introduced, there wasn’t one that actually worked as intended.
Anyway, in the past week I’ve been working on building a new firmware image for the device I’m working on, and when it comes down to run the script that generates the image to burn on the SSD, it locked up with 100% CPU usage (luckily the system is multicore so I could get in to kill it). The problem was to be found in
losetup, so today with enough time on my hands, I went to check it out. Turns out that the reason why it failed was a joint issue between my setup, OpenRC updates, and util-linux updates, but let’s proceed with order.
The build happen on a container for which I was not mounting
/sys — or at least so I intended, although it is possible that OpenRC mounted it on its own; this has changed recently, but I don’t think those changes hit stable yet, so I’m not sure that’s the case. I had created static nodes for the loop devices and for
/dev/loop-control — but this latter was not to be found at first today. Maybe I deleted it by mistake or something along those lines. But the point is it worked before, and nothing changed beside an
So, what happens is that the script is running something along the lines of
losetup --find --show file which is intended to find the first available loop device, set up the file, and then print the loop device that was found. It’s a bit more complex than this as I’m explicitly setting up only the partition on the loop device (getting partitioned loop devices to play cool with LXC is a pain), but the point stands. Unfortunately, when both
/sys are unreachable, the looping around that should give us the first available device is looping over the same device over and over and over again, never trying the next. This causes the problem noted above, of
losetup locking at 100% CPU usage.
And it’s definitely not the only problem! If you just execute
losetup --find, which should give you the first available device, it provides you
/dev/loop0 even if that device is already in use. Not content enough with these problems?
losetup -a lists no device, even when they are present, and still returns with a valid, zero exit status. Which is definitely not the case!
Okay you can say that
losetup is already trying its best by using not one but three different sources (the third one is
/proc/partitions) to find the data to use, but when the primary two are not usable, you shouldn’t expect it to give you proper information, should you? Well, that’s not the point. The big problem is that it should tell me “man, I can’t get you the data you requested because I need more sources, give me the sources!” instead of trying its best, failing, and locking up.
The next question is obviously “why are you ranting, instead of fixing it?” — the answer is that I tried, but the code I was reading made me cry. The problem is that nowadays,
losetup is just a shallow interface to some shared code in util-linux .. and the design of said code makes it very difficult to make it clear whether a non-zero return value from a function is a “we reached the end of the list” or “I couldn’t see anything because I lack my sources”. And it really didn’t feel like a good idea for me to start throwing away that code to replace it with something more KISS-compliant.
So at the end of the day, I fixed my container to mount
/sys and everything works, but util-linux is still broken upstream.