Beurer GL50, Linux and Debug Interfaces

In the previous post when I reviewed the Beurer GL50, I have said that on Windows this appears as a CD-Rom with the installer and portable software to use to download the data off it. This is actually quite handy for the users, but of course leaves behind users of Linux and macOS — except of course if you wanted to use the Bluetooth interface.

I did note that on Linux, the whole device does not work correctly. Indeed, when you connect this to a modern Linux kernel, it’ll fail to mount at all. But because of the way udev senses a new CD-Rom being inserted, it also causes an infinite loop in the userspace, making udev use most of a single core for hours and hours, trying to process CD in, CD out events.

When I noticed it I thought it would be a problem in the USB Mass Storage implementation, but at the end of the day the problem turned out to be one layer below that and be a problem in the SCSI command implementation instead. Because yes, of course USB Mass Storage virtual CD-Rom devices still mostly point at SCSI implementations below.

To provide enough context, and to remind myself how I went around this if I ever forget, the Beurer device appears to use a virtual CD-Rom interface on a chip developed by either Cygnal or Silicon Labs (the latter bought the former in 2003). I only know the Product ID of the device as 0x85ED, but I failed trying to track down the SiliconLabs model to figure out why and how.

To find may way around the Linux kernel, and try to get the device to connect at all, I ended up taking a page off marcan’s book, and used the qemu’s ability to launch a Linux kernel directly, with a minimum initramfs that only contains the minimum amount of files. In my case, I used the busybox-static binary that came with OpenSuse as the base, since I didn’t need any particular reproduction case beside trying to mount the device.

The next problem was figuring out how to get the right debug information. At first I needed to inspect at least four separate parts of the kernel: USB Mass Storage, the Uniform (sic) CD-Rom driver, the SCSI layer, and the ISO9660 filesystem support — none of those seemed a clear culprit at the very beginning, so debugging time it was. Each of those appear to have separate ideas of how to do debugging at all, at least up to version 5.3 which is the one I’ve been hacking on.

The USB Mass Storage layer has its own configuration option (CONFIG_USB_STORAGE_DEBUG), and once enabled in the kernel config, a ton of information on the USB Mass Storage is output on the kernel console. SCSI comes with its own logging support (CONFIG_SCSI_LOGGING) but as I found a few days of hacking later, you also need to enable it within /proc/sys/dev/scsi/logging_level, and to do so you need to calculate an annoying bitmask — thankfully there’s a tool in sg3_utils called scsi_logging_level… but it says a lot that it’s needed, in my opinion. The block layer in turn has its own CONFIG_BLK_DEBUG_FS option, but I didn’t even manage to look at how that’s configured.

The SCSI CD driver (sr), has a few debug outputs that need to be enabled by removing manual #if conditions in the code, while the cdrom driver comes with its own log level configuration, a module parameter to enable the logging, and overall a complicated set of debug knobs. And just enabling them is not useful — at some point the debug output in the cdrom driver was migrated to the modern dynamic debug support, which means you need to enable the debugging specifically for the driver, and then you need to enable the dynamic debug. I sent a patch to just remove the driver-specific knobs.

Funnily enough, when I sent the first version of the patch, I was told about the ftrace interface, which turned out to be perfect to continue sorting out the calls that I needed to tweak. This turned into another patch, that removes all the debug output that is redundant with ftrace.

So after all of this, what was the problem? Well, there’s a patch for that, too. The chip used by this meter does not actually include all the MMC commands, or all of the audio CD command. Some of those missing features are okay, and an error returned from the device will be properly ignored. Others cause further SCSI commands to fail, and that’s why I ended up having to implement vendor-specific support to mask away quite a few features — and gate usage in a few functions. It appears to me that as CD-Rom, CD-RW, and DVDs became more standard, the driver stopped properly gating feature usage.

Well, I don’t have more details of what I did to share, beside what is already in the patches. But I think if there’s a lesson here, is that if you want to sink your teeth into the Linux kernel’s code, you can definitely take a peek at a random old driver, and figure out if it was over-engineered in a past that did not come with nice trimmings such as ftrace, or dynamic debug support, or generally the idea that the kernel is one big common project.

Introducing usbmon-tools

A couple of weeks ago I wrote some notes about my work in progress to implement usbmon captures handling code, and pre-announced I was going to publish more of my extraction/inspection scripts.

The good news is that the project is now released, and you can find it on GitHub as usbmon-tools with an Apache 2.0 license, and open to contributions (with a CLA, sorry about that part). This is the first open source project I release using my employer’s releasing process (for other projects, I used the IARC process instead), and I have to say I’m fairly pleased with the results.

This blog post is meant mostly as a way to explain what’s going on my head regarding this project, with the hope that contributors can help it become reality. Or that they can contribute other ideas to it, even when they are not part of my particular plans.

I want to start with a consideration on the choice of language. usbmon-tools is written in Python 3. And in particular it is restricted to Python 3.7, because I wanted to have access to type annotations, which I found extremely addictive at work. I even set up Travis CI to run mypy as part of the integration tests for the repository.

For other projects I tend to be more conservative, and wait for Debian stable to have a certain version before requiring that as a minimum, but as this is a toolset for developers primarily, I’m going to expect its public to be able to deal with Python 3.7 as the requirement. This version was released nearly a year ago, and that should be plenty of time for people to have one at hand.

As for what the project should achieve in my view, is an easy way for developers to dissect an USB snooping trace. I started by building a simplistic tool that recreates a text format trace from the pcapng file, based on the official documentation of usbmon in the kernel (I have some patches to improve on that, too, but that probably will become a post in by itself next week). It’s missing isochronous support, and it’s not totally tested, but it at least gave me a few important insight on the format itself, including the big caveat that the “id” (or tag) of the URBs is not unique.

Indeed, I think that alone is one of the most important pieces of the puzzle in the library: in addition to parsing the pcapng file itself, the library can re-tag the events so that they get a real unique identifier (UUID), making it significantly easier to analyze the traces.

My next steps on the project are to write a more generic tool to convert a USB capture into what I call my “chatter format” (similar to the one I used to discuss serial protocols), and a more specific one that converts HID traces (because HID is a more defined protocol, and we can go a level deeper in exposing this into a human-readable source). I’m also considering if it would be within reach to provide the tool a HID descriptor blob, parse it and have it used to parse the HID traffic based on it. It would make some debugging particularly easier, for instance the stuff I did when I was fixing the ELECOM DEFT trackball.

I would also love to be able to play with a trace in a more interactive manner, for instance by loading this into Jupyter notebook, so that I could try parsing the blobs interactively, but unless someone with more experience with those contributes the code, I don’t expect I’ll have much time for it.

Pull requests are more than welcome!

Yak Shaving: Silicon Labs CP2110 and Linux

One of my favourite passtimes in the past years has been reverse engineering glucometers for the sake of writing an utility package to export data to it. Sometimes, in the quest of just getting data out of a meter I end up embarking in yak shaves that are particularly bothersome, as they are useful only for me and no one else.

One of these yak shaves might be more useful to others, but it will have to be seen. I got my hands on a new meter, which I will review later on. This meter has software for Windows to download the readings, so it’s a good target for reverse engineering. What surprised me, though, was that once I connected the device to my Linux laptop first, it came up as an HID device, described as an “USB HID to UART adapter”: the device uses a CP2110 adapter chip by Silicon Labs, and it’s the first time I saw this particular chip (or even class of chip) in my life.

Effectively, this device piggybacks the HID interface, which allows vendor-specified protocols to be implemented in user space without needing in-kernel drivers. I’m not sure if I should be impressed by the cleverness or disgusted by the workaround. In either case, it means that you end up with a stacked protocol design: the glucometer protocol itself is serial-based, implemented on top of a serial-like software interface, which converts it to the CP2110 protocol, which is encapsulated into HID packets, which are then sent over USB…

The good thing is that, as the datasheet reports, the protocol is available: “Open access to interface specification”. And indeed in the download page for the device, there’s a big archive of just-about-everything, including a number of precompiled binary libraries and a bunch of documents, among which figures AN434, which describe the full interface of the device. Source code is also available, but having spot checked it, it appears it has no license specification and as such is to be considered proprietary, and possibly virulent.

So now I’m warming up to the idea of doing a bit more of yak shaving and for once trying not to just help myself. I need to understand this protocol for two purposes: one is obviously having the ability to communicate with the meter that uses that chip; the other is being able to understand what the software is telling the device and vice-versa.

This means I need to have generators for the host side, but parsers for both. Luckily, construct should make that part relatively painless, and make it very easy to write (if not maintain, given the amount of API breakages) such a parser/generator library. And of course this has to be in Python because that’s the language my utility is written in.

The other thing that I realized as I was toying with the idea of writing this is that, done right, it can be used together with facedancer, to implement the gadget side purely in Python. Which sounds like a fun project for those of us into that kind of thing.

But since this time this is going to be something more widely useful, and not restricted to my glucometer work, I’m now looking to release this using a different process, as that would allow me to respond to issues and codereviews from my office as well as during the (relatively little) spare time I have at home. So expect this to take quite a bit longer to be released.

At the end of the day, what I hope to have is an Apache 2 licensed Python library that can parse both host-to-controller and controller-to-host packets, and also implement it well enough on the client side (based on the hidapi library, likely) so that I can just import the module and use it for a new driver. Bonus points if I can sue this to implement a test fake framework to implement the tests for the glucometer.

In all of this, I want to make sure to thank Silicon Labs for releasing the specification of the protocol. It’s not always that you can just google up the device name to find the relevant protocol documentation, and even when you do it’s hard to figure out if it’s enough to implement a driver. The fact that this is possible surprised me pleasantly. On the other hand I wish they actually released their code with a license attached, and possibly a widely-usable one such as MIT or Apache 2, to allow users to use the code directly. But I can see why that wouldn’t be particularly high in their requirements.

Let’s just hope this time around I can do something for even more people.

Linux desktop and geek supremacists

I have written about those who I refer to as geek supremacists a few months ago, discussing the dangerous prank at FOSDEM — as it turns out, they overlap with the “Free Software Fundamentalists” I wrote about eight years ago. I have found another one of them at the conference I was at the day this draft is being written. I’m not going to refer the conference because the conference does not deserve to be associated with my negative sentiment here.

The geek supremacist in this case was the speaker of one of the talks. I did not sit through the whole talk (which also run over its allotted time), because after the basic introduction, I was so ticked off by so many alarm bells that I just had to leave and find something more interesting and useful to do. The final drop for me was when the speaker insisted that “Western values didn’t apply to [them]” and thus they felt they could “liberate” hardware by mixing leaked sources of the proprietary OS with the pure Free (obsolete) OS of it. Not only this is clearly illegal (as they know and admitted), but it’s unethical (free software relies on licenses that are based on copyright law!) and toxic to the community.

But that’s not what I want to complain about here. The problem was a bit earlier than that. The speaker defined themselves as a “freedom fighter” (their words, not mine!), and insisted they can’t see why people are still using Windows and macOS despite Linux and FreeBSD being perfectly good options. I take a big issue with this.

Now, having spent about half my life using, writing and contributing to FLOSS, you can’t possibly expect me to just say that Linux on the desktop is a waste of time. But at the same time I’m not delusional, and I know there are plenty of reasons to not use Linux on the desktop.

While there has been huge improvements in the past fifteen years, and SuSE or Ubuntu are somewhat usable as desktop environments, there is still no comparison with using macOS or Windows, particularly in terms of applications working out of the box, and support from third parties. There are plenty of applications that don’t work on Linux, and even if you can replace them, sometimes that is not acceptable, because you depend on some external ecosystem.

For instance, when I was working as a sysadmin for hire, none of my customers could possibly have used a pure-Linux environment. Most of them were Windows only companies, but even the one that was a mixed environment (the print shop I wrote about before), could not do without macOS and Windows. From one point, the macOS environment was their primary workspace: Adobe software is not available for Linux, nor is QuarkXpress, nor the Xerox print queue software (ironic, since it interfaces with a Linux system on board the printer, of course). The accounting software, which handled everything from ordering to invoicing to tax report, was developed by a local company, – and they had no intention to build a version for Linux – and because tax regulations in Italy are… peculiar, no off-the-shelf open source software is available for that. As it happens, they also needed a Mandriva workstation – no other distribution would do – because the software for their large-format inkjet printer was only available for either that, or PPC Mac OS X, and getting it running on a modern PC with the former is significantly less expensive than trying to recover the latter.

(To make my life more complicated, the software they used for that printer was developed by Caldera. No, not the company acquired by SCO, but Caldera Graphics, a French company completely unrelated to the other tree of companies, which was recently acquired again. It was very confusing when the people at the shop told me that they had a “Caldera box running Linux”.)

Of course, there are people who can run a Linux-only shop, or can only run Linux on their systems, personal or not, because they do not need to depend on external ecosystems. More power to them, and thank you for their support on improving desktop features (because they are helping, right?). But they are clearly not part of the majority of the population, as it’s clear by the fact that people are indeed vastly using Windows, macOS, Android and iOS.

Now, this does not mean that Linux on the desktop is dead, or will never happen. It just means that it’ll take quite a while longer, and in the mean time, all the work of Linux on the desktop is likely going to profit other endeavours too. LibreOffice and KDE are clearly examples of “Linux on the desktop”, but at the same time they provide Free Software with the visibility (and energy, to a point) even when being used by people on Windows. The same goes for VLC, Firefox, Chrome, and a long list of other FLOSS software that many people rely upon, sometimes realising it is Free Software. But even that, is not why I’m particularly angry after encountering this geek supremacist.

The problem is that, again in the introduction to the talk, that was about mobile phones, they said they don’t expect things changed significantly in the proprietary phones for the past ten years. Ten years is forever in computing, let alone mobile! Ten years ago, the iPhone was just launched, and it still did not have an SDK or apps! Ten years ago the state of the art in smartphones you could develop apps for was Symbian! And this is not the first time I hear something like this.

A lot of people in the FLOSS community appear to have closed their eyes to what the proprietary software environment has been doing, under any area. Because «Free Software works for me, so it has to be working for everyone!» And that is dangerous under multiple point of views. Not only this shortsightedness is what, in my opinion, is making distributions irrelevant but it’s also making Linux on the desktop worse than Windows, and is why I don’t expect FSF will come up with an usable mobile phone any time soon.

Free desktop environments (KDE and GNOME, effectively) have spent a lot of time in the past ten (and more) years, first trying to catch up to Windows, then to Mac, then trying to build new paradigms, with mixed results. Some people loved them, some people hated them, but at least they tried and, ignoring most of the breakages, or the fact that they still try to have semantics nobody really cares about (like KDE’s “Activities” — or the fact that KDE-the-desktop is no more, and KDE is a community that includes stuff that has nothing to do with desktops or even barely Linux, but let’s not go there), a modern KDE system is fairly close in usability to Windows… 7. There is still a lot of catch up to do, particularly around security, but I would at least say that for the most part, the direction is still valid.

But to keep going, and to catch up, and if possible to go beyond those limits, you also need to accept that there are reasons why people are using proprietary software, and it’s not just a matter of lock-in, or the (disappointingly horrible) idea that people using Windows are “sheeple” and you hold the universal truth. Which is what pissed me off during that talk.

I could also add another note here about the idea that non-smart phones are a perfectly valid option nowadays. As I wrote already, there are plenty of reasons why a smartphone should not be considering a luxury. For many people, a smartphone is the only access they have to email, and the Internet at large. Or the only safe way to access their bank account, or other fundamental services that they rely upon. Being able to use a different device for those services, and only having a ten years old dumbphone is a privilege not the demonstration that there is no need for smartphones.

Also, I sometimes really wonder if these people have any friends at all. I don’t have many friends myself, but if I was stuck on a dumbphone only able to receive calls or SMS, I would probably have lost those few I have as well. Because even with European, non-silly tariffs on SMS, sending SMS is still inconvenient, and most people will use WhatsApp, Messenger, Telegram or Viber to communicate with their friends (and most of these applications are also more secure than SMS). That may be perfectly fine, I mean if you don’t want to be easily reachable by people, that is a very easy way to do so, but it’s once again a privilege, because it means you either don’t have people who would want to contact you in different ways, or you can afford to limit your social contacts to people who accepts your quirk — and once again, a freelancer could never do that.

ELECOM DEFT and the broken descriptor

Update (2017-05-26): Jiri merged the patch, which may land in 4.12 or 4.13.

In my previous post reviewing the ELECOM DEFT I noted that I had to do some work to get the three function buttons on the mouse to work on Linux correctly. Let me try to dig a bit into this one so it can be useful to others in the future.

The simptoms: the three top buttons (Fn1, Fn2, Fn3) of the device are unresponsive on Linux, they do not show up on xev and evtest.

My first guess was that they were using the same technique they do for gaming mice, by configuring on the device itself what codes to send when the buttons are pressed. That looked likely because the receiver is appearing as a big composite device. But that was not the case. After installing the Windows driver and app on my “sacrificial laptop”, and using USBlyzer to figure out what was going on, I couldn’t see the app doing anything to the device. Which meant they instead remapped the behaviour of the buttons on the software side.

This left open only the option that the receiver needs a “quirk driver” to do something. Actually, since I have looked into HID (the protocol used for USB mice and keyboards, among others), I already knew the problem was the HID Report Descriptor is reporting something broken and the Linux kernel is ignoring it. I’m not sure if Windows is just ignoring the descriptor, or if there is a custom driver to implement the quirk there. I did not look too much into this.

But what is this descriptor? If you have not looked into HID before, you have to understand that the HID protocol in USB only specifies very little information by itself, and is mainly a common format for both sending “reports” and to describe said reports. The HID Report Descriptor is effectively bytecode representing the schema that those reports should follow. As it happens, sometimes it’s not the case at all, and the descriptor itself can even be completely broken and unparsable. But that is not the case here.

The descriptor is fetched by the operating system when you connect the device, and is then used to parse the reports coming as interrupt transfer. The first byte of each transfer refers to the report used, and that is looked up in the descriptor to understand it. In most mice, your reports will all look vastly the same: state of the buttons, relative X and Y displacement, wheel (one or two dimensional) displacement. But, since the presence of one or more wheels is not a given, and the amount of buttons to expect can be significantly high, even without going to the ludicrous extent of the OpenOffice mouse, the report descriptor will tell you the size of each field in the structure.

So, looking at USBlyzer, I could tell that the report number 1 was clearly the one that gives the mouse data, and even without knowing much about HID and having not seen the report descriptor, I can tell what’s going on:

button1: 01 01 00 00 00 00 00 00
button2: 01 02 00 00 00 00 00 00
button3: 01 04 00 00 00 00 00 00
fn1:     01 20 00 00 00 00 00 00
fn2:     01 40 00 00 00 00 00 00
fn3:     01 80 00 00 00 00 00 00

So quite obviously, the second byte is a bitmask of which button is being pressed. Note that this is the first of two reports you receive every time you click on the button (and everything is zero because on a trackball you can click the buttons without even touching the ball, and so there is no movement indication in the report).

But, when I looked at the Analysis tab, I found out that USBlyzer is going to parse the reports based on the descriptor as well, showing the button number from the bitmask, the X and Y displacement and so on. For the bitmasks of the three buttons at the top of the device, no button is listed in the analysis. Bingo, we have a problem.

The quest items. Thinking of it like a quest in a JRPG, I now needed two items to complete the quest: a way to figure out what the report descriptor of the device is and what it means. Let’s start from the first item.

There are a number of ways that you find documented for dumping a USB HID report descriptor on Linux. Most of them rely on you unbinding the device from the usbhid driver and then fetching it by sending the right HID commands. usbhid-dump does that and it does well, but I’m going to ignore that. Instead I’m going to read the report descriptor as is presented by sysfs. This may not be the one reported by the hardware, but rather the one that the quirk may have already “fixed” somehow.

So how can you tell where to find the report descriptor? If you look when you plug in a device:

% dmesg | tail -n 3
[13358.058915] elecom 0003:056E:00FF.000C: Fixing up Elecom DEFT Fn buttons
[13358.059721] input: ELECOM ELECOM TrackBall Mouse as /devices/pci0000:00/0000:00:14.0/usb3/3-2/3-2.1/3-2.1:1.0/0003:056E:00FF.000C/input/input45
[13358.111673] elecom 0003:056E:00FF.000C: input,hiddev0,hidraw1: USB HID v1.11 Mouse [ELECOM ELECOM TrackBall Mouse] on usb-0000:00:14.0-2.1/input0
% cp /sys/devices/pci0000:00/0000:00:14.0/usb3/3-2/3-2.1/3-2.1:1.0/0003:056E:00FF.000C/report_descriptor my_report_descriptor.bin

You can tell from this dmesg that I’m cheating, and I’m looking at it after the device has been fixed already. Otherwise it would probably be saying hid-generic rather than elecom.

I have made a copy of the original report descriptor of course, so I can look at it even now, but the binary file is not going to be very useful by itself. But, from the same author as the tool listed above, hidrd makes it significantly easier to understand what’s going on. The full spec output includes a number of report pages that are vendor specific, and may be interesting to either fuzz or figure out if they are used for reporting things such as low battery. But let’s ignore that for the immediate and let’s look at the “Desktop, Mouse” page:

Usage Page (Desktop),               ; Generic desktop controls (01h)
Usage (Mouse),                      ; Mouse (02h, application collection)
Collection (Application),
    Usage (Pointer),                ; Pointer (01h, physical collection)
    Collection (Physical),
        Report ID (1),
        Report Count (5),
        Report Size (1),
        Usage Page (Button),        ; Button (09h)
        Usage Minimum (01h),
        Usage Maximum (05h),
        Logical Minimum (0),
        Logical Maximum (1),
        Input (Variable),
        Report Count (1),
        Report Size (3),
        Input (Constant),
        Report Size (16),
        Report Count (2),
        Usage Page (Desktop),       ; Generic desktop controls (01h)
        Usage (X),                  ; X (30h, dynamic value)
        Usage (Y),                  ; Y (31h, dynamic value)
        Logical Minimum (-32768),
        Logical Maximum (32767),
        Input (Variable, Relative),
    End Collection,
    Collection (Physical),
        Report Count (1),
        Report Size (8),
        Usage Page (Desktop),       ; Generic desktop controls (01h)
        Usage (Wheel),              ; Wheel (38h, dynamic value)
        Logical Minimum (-127),
        Logical Maximum (127),
        Input (Variable, Relative),
    End Collection,
    Collection (Physical),
        Report Count (1),
        Report Size (8),
        Usage Page (Consumer),      ; Consumer (0Ch)
        Usage (AC Pan),             ; AC pan (0238h, linear control)
        Logical Minimum (-127),
        Logical Maximum (127),
        Input (Variable, Relative),
    End Collection,
End Collection,

This is effectively a description of the structure in the reported I showed earlier, starting from the buttons and X/Y displacement, followed by the wheel and the “AC pan” (which I assume is the left/right wheel). All the sizes are given in bits, and the way the language works is a bit strange. The part that interests us is at the start of the first block. Refer to this tutorial for the nitty gritty details, but I’ll try to give a human-readable example.

Report ID is the constant we already know about, and the first byte of the message. Following that we can see it declaring five (Count = 5) bits (Size = 1) used for Buttons between 1 and 5. Ignore the local maximum/minimum in this case, as they are of course either on or off. The Input (Variable) instruction is effectively saying “These are the useful parts”. Following that it declares one (Count = 1) 3-bit (Size = 3) constant value. Since it’s constant, the HID driver will just ignore it. Unfortunately those three bits are actually the three bits needed for the top buttons.

The obvious answer is to change the descriptor so that it describe eight one-bit entries for eight buttons, and no constant bits (if you forget to remove the constant bits, the whole message gets misparsed and moving the mouse is taken as clicks, ask me how I know!). How do you do that? Well, you need a quirk driver in the Linux kernel to intercept the device, and rewrite the descriptor on the fly. This is not hard, and I know of plenty of other drivers doing so. As it happens Linux already has a hid-elecom driver, which was fixing a Bluetooth mouse that also had a wrong descriptor; I extended that to fix the descriptor. But how do you fix a descriptor exactly?

Some of the drivers check for the size of the descriptor, and for some anchor values (usually the ones they are going to change), others replace the descriptor entirely. I prefer the former, as they make it clear that they are trying to just fix something rather than discard whatever the manufacturer is doing. Particularly because in this case the fix is quite trivial, just three bytes need to be changed: change the Count and Maximum for the Buttons input to 8, and make the Count of the constant import zero. hidrd has a mode where it outputs the whole descriptor as a valid C array that you can just embed in the kernel source, with comments what each byte combination does. I used that during testing, before changing my code to do the patching instead. The actual diff, in code format, is:

@@ -4,15 +4,15 @@
 0x09, 0x01,         /*      Usage (Pointer),                */
 0xA1, 0x00,         /*      Collection (Physical),          */
 0x85, 0x01,         /*          Report ID (1),              */
-0x95, 0x05,         /*          Report Count (5),           */
+0x95, 0x08,         /*          Report Count (8),           */
 0x75, 0x01,         /*          Report Size (1),            */
 0x05, 0x09,         /*          Usage Page (Button),        */
 0x19, 0x01,         /*          Usage Minimum (01h),        */
-0x29, 0x05,         /*          Usage Maximum (05h),        */
+0x29, 0x08,         /*          Usage Maximum (08h),        */
 0x15, 0x00,         /*          Logical Minimum (0),        */
 0x25, 0x01,         /*          Logical Maximum (1),        */
 0x81, 0x02,         /*          Input (Variable),           */
-0x95, 0x01,         /*          Report Count (1),           */
+0x95, 0x00,         /*          Report Count (0),           */
 0x75, 0x03,         /*          Report Size (3),            */
 0x81, 0x01,         /*          Input (Constant),           */
 0x75, 0x10,         /*          Report Size (16),           */

And that’s enough to make all the buttons work just fine. Yay! So I sent the first patch to the linux-input mailing list… and then I had a doubt “Am I the first ever Linux user of this device?” As it happens, I’m not, and after sending the patch I searched and found that there was already a patch by Yuxuan Shui sent last year that does effectively the same thing, except with a new module altogether (rather than extending the one already there) and by removing the Constant input declaration altogether, which requires a memmove() of the rest of the input. It also contains the USB ID for the wired version of the DEFT, adding the same fix.

So I went and sent another (or three) revision of the patch, including the other ID. Of course I would argue that mine is cleaner by reusing the other module, but in general I’ll leave it to the maintainers to decide which one to use. One thing that I can say at least for mine is that I tried to make it very explicit what’s going on, in particular by adding as a comment the side-by-side diff of the Collection stanza that I change in the driver. Because I always find it bothersome when I have to look into one of those HID drivers and they seem to just come up with magical constants to save the day. Sigh!

Updating firmware on a Crucial M4 drive, with Linux, no luck

When you think of SSD manufacturers, it might be obvious to think of them as Linux friendly, given they target power users, and Linux users are for the most part are power users. Seems like this is not that true for Crucial. My main personal laptop has, since last year, a 64GB Crucial M4 SSD – given I’ve not been using a desktop computer for a while it does start to feel small, but that’s a different point – which underwent a couple of firmware update since then. In particular, there is a new release 070H that is supposed to fix some nasty power saving issues.

Crucial only provide firmware update utilities for Windows 7 and 8 (two different versions of them), and then they have an utility for “Windows and Mac” — the latter is actually a ZIP file that contains an ISO file… well, I don’t have a CD to burn with me, so my first option was to run the Windows 7 file from my Windows 7 install, which resides on the external SATA harddrive. No luck with that, from what I read on the forums what the upgrader does is simply setting up an EFI application to boot, and then reboot. Unfortunately in my case there are two EFI partitions, because of the two bootable drives, and that most likely messes up with the upgrader.

Okay strike one, let’s look at the ISO file. The ISO is very simple and very small.. it basically is just an ISOLINUX tree that uses memdisk to launch a 2.88MB image file (2.88MB is a semi-standard floppy disk size, which never really went popular for real disks, but has been leveraged by most virtual floppy disk images in bootable CD-Roms for the expanded size). Okay what’s in the image file then? Nothing surprising, it’s a FreeDOS image, with Crucial’s own utility and the firmware.

So if you remember, I had some experience with trying to update BIOS through FreeDOS images, and I have my trusty USB stick with FreeDOS arriving on Tuesday with most of my personal effects that were in Italy waiting for me to find a final place to move my stuff on. But I wanted to see if I could try to boot the image file without needing the FreeDOS stick, so I checked. Grub2 does not include a direct way to load image — the reference to memdisk in their manual refers to the situation where you’re loading a standalone or rescue Grub image, nothing to do with what we care about.

The most obvious way to run the image is through SYSLINUX’s memdisk loader. Debian even has a package that allows you to just drop the images in /boot/images and adds them to the grub menu. Quick and easy, no? Well, no. The problem is that memdisk needs to be loaded with linux16 — and, well, Grub 2 does not support it if you’re booting via EFI, like I am.

I guess I’ll wait until Tuesday night, when my BIOS disk will be here, and I’ll just use it to update the SSD. It should solve the issue once and for all.

Me and a RaspberryPi: Don’t Open That I/O Port

The article’s title is a play on the phrase “don’t open that door”, and makes more sense in Italian as we use the same word for ‘door’ and ‘port’…

So you left your hero (me) working on setting up a Raspberry Pi with at least a partial base of cross-compilation. The whole thing worked to a decent extent, but it wasn’t really as feasible as I hoped. Too many things, including Python, cannot cross-compile without further tricks, and the time it takes to figure out how to cross-compile them, tend to be more than that needed to just wait for it to build on the board itself. I guess this is why there is that little interest in getting cross-compilation supported.

But after getting a decent root, or stage4 as you prefer to call it, I needed to get a kernel to boot the device. This wasn’t easy.; there is no official configuration file published — what they tell you is, if you want to build a new custom kernel, to zcat /proc/config.gz from Raspian. I didn’t want to use Raspian, so I looked further. The next step is to check out the defconfig settings that the kernel repository includes, a few, different of them exist.

You’d expect them to be actually thought out to enable exactly what the RaspberryPi provides, and nothing more or less. Some leeway can be expected for things like network options, but at least the “cutdown” version should not include all of IrDA, Amateur Radio, Wireless, Bluetooth, USB network, PPP, … After disabling a bunch of options, since the system I need to run will have very few devices connected – in particular, only the Davis Vantage Pro station, maybe a printer – I built the kernel and copied it over the SD card. It booted, it crashed. Kernel panicked right away, due to a pointer dereference.

After some rebuild-copy-test cycles I was able to find out what the problem is. It’s a problem that is not unique to the RPi actually, as I found the same trace from an OMAP3 user reporting it somewhere else. The trick was disabling the (default-enabled) in-kernel debugger – which I couldn’t access anyway, as I don’t have an USB keyboard at hand right now – so that it would print the full trace of the error .That pointed at the l4_init function, which is the initialization of the Lightning 4 gameport controller — an old style, MIDI game port.

My hunch is that this expansion card is an old-style ISA card, since it does not rely on PCI structures to probe for the device — I cannot confirm it because googling for “lightning 4” only comes up with images of iPad and accessories. What it does, is simply poking at the 0x201 address, and the moment when it does, you get a bad dereference from the kernel exactly at that address. I’ve sent a (broken, unfortunately) patch to the LKML to see if there is an easy way to solve this.

To be honest and clear, if you just take a defconfig and build it exactly as-is, you won’t be hitting that problem. The problem happens to me because in this kernel, like in almost every other one I built, I do one particular thing: I disable modules so that a single, statically build kernel. This in turn means that all the drivers are initialized when you start the kernel, and the moment when the L4 driver is started, it crashes the kernel. Possibly it’s not the only one.

This is most likely not strictly limited to the RaspberryPi but it doesn’t help that there is no working minimal configuration – mine is, by the way, available here – and I’m pretty sure there are other similar situations even when the arch is x86… I guess it’s just a matter of reporting them when you encounter them.

Predictable persistently (non-)mnemonic names

This is part two of a series of articles looking into the new udev “predictable” names. Part one is here and talks about the path-based names.

As Steve also asked on the comments from last post, isn’t it possible to just use the MAC address of an interface to point at it? Sure it’s possible! You just need to enable the mac-based name generator. But what does that mean? It means that your new interface names will be enx0026b9d7bf1f and wlx0023148f1cc8 — do you see yourself typing them?

Myself, I’m not going to type them. My favourite suggestion to solve the issue is to rely on rules similar to the previous persistent naming, but not re-using the eth prefix to avoid collisions (which will no longer be resolved by future versions of udev). I instead use the names wan0 and lan0 (and so on), when the two interfaces sit straddling between a private and a public network. How do I achieve that? Simple:

SUBSYSTEM=="net", ACTION=="add", ATTR{address}=="00:17:31:c6:4a:ca", NAME="lan0"
SUBSYSTEM=="net", ACTION=="add", ATTR{address}=="00:07:e9:12:07:36", NAME="wan0"

Yes these simple rules are doing all the work you need if you just want to make sure not to mix the two interfaces by mistake. If your server or vserver only has one interface, and you want to have it as wan0 no matter what its mac address is (easier to clone, for instance), then you can go for

SUBSYSTEM=="net", ACTION=="add", ATTR{address}=="*", NAME="wan0"

As long as you only have a single network interface, that will work just fine. For those who use Puppet, I also published a module that you can use to create the file, and ensure that the other methods to achieve “sticky” names are not present.

My reasoning to actually using this kind of names is relatively simple: the rare places where I do need to specify the interface name are usually in ACLs, the firewall, and so on. In these, the most important part to me is knowing whether the interface is public or not, so the wan/lan distinction is the most useful. I don’t intend trying to remember whether enp5s24k1f345totheright4nextothebaker is the public or private interface.

Speaking about which, one of the things that appears obvious even from Lennart’s comment to the previous post, is that there is no real assurance that the names are set in stone — he says that an udev upgrade won’t change them, but I guess most people would be sceptic, remembering the track record that udev and systemd has had over the past few months alone. In this situation my personal, informed opinion is that all this work on “predictable” names is a huge waste of time for almost everybody.

If you do care about stable interface names, you most definitely expect them to be more meaningful than 10-digits strings of paths or mac addresses, so you almost certainly want to go through with custom naming, so that at least you attach some sense into the names themselves.

On the other hand, if you do not care about interface names themselves, for instance because instead of running commands or scripts, you just use NetworkManager… well what the heck are you doing playing around with paths? If it doesn’t bother you that the interface for an USB device changes considerably between one port and another, how can it matter to you whether it’s called wwan0 or wwan123? And if the name of the interface does not matter to you, why are you spending useless time trying to get these “predictable” names working?

All in all, I think this is just an useless nice trick, that will only cause more headaches than it can possibly solve. Bahumbug!

Predictably non-persistent names

This is going to be fun. The Gentoo “udev team”, in the person of Samuli – who seems to suffer from 0-day bump syndrome – decided to now enable by default the new predictable names feature that is supposed to make things so much nicer in Linux land where, especially for people coming from FreeBSD, things have been pretty much messed up. This replaces the old “persistent” names, that were often enough too fragile to work, as they did in-place renaming of interfaces, and would cause way too often conflicts at boot time, since swapping two devices’ names is not an atomic operation for obvious reasons.

So what’s this predictable name all around? Well, it’s mostly a merge of the previous persistent naming system, and the BIOS label naming project which was developed by RedHat for a few years already so that the names of interfaces for server hardware in the operating system match the documentation of said server, so that you can be sure that if you’re connecting the port marked with “1” on the chassis, out of four on the motherboard, it will bring up eth2.

But why were those two technologies needed? Let’s start first with explaining how (more or less) the kernel naming scheme works: unlike the BSD systems, where the interfaces are named after the kernel driver (en0, dc0, etc.), the Linux kernel uses generic names, mostly eth, wlan and wwan, and maybe a couple more, for tunnels and so on. This causes the first problem: if you have multiple devices of the same class (ethernet, wlan, wwan) coming from different drivers, the order of the interface may very well vary between reboots, either because of changes in the kernel, if the drivers are built-in, or simply because of locking and execution of modules load (which is much more common for binary distributions).

The reason why changes in the kernel can change the order is that the order in which drivers are initialized has changed before and might change again in the future. A driver could also decide to change the order with which its devices are initialized (PCI tree scanning order, PCI ID order, MAC address order, …) and so on, causing it to change the order of interfaces even for the same driver. More about this later.

But here’s my first doubt arises: how common is for people to have more than one interface of the same class from vendors different enough to use different drivers? Well it depends on the class of device; on a laptop you’d have to search hard for a model with more than one Ethernet or wireless interface, unless you add an ExpressCard or PCMCIA expansion card (and even those are not that common). On a desktop, I’ve seen a few very recent motherboards with more than one network port, and I have yet to see one with different chips for the two. Servers, that’s a different story.

Indeed, it’s not that uncommon to have multiple on-board and expansion card ports on a server. For instance you could use the two onboard ports as public and private interfaces for the host… and then add a 4-port card to split between virtual machines. In this situation, having a persistent naming of the interfaces is indeed something you would be glad of. How can you tell which one of eth{0..5} is your onboard port #2, otherwise? This would be problem number two.

Another situation in which having a persistent naming of interfaces is almost a requirement is if you’re setting up a router: you definitely don’t want to switch the LAN and WAN interface names around, especially where the firewall is involved.

This background is why the persistent-net rules were devised quite a few years ago for udev. Unfortunately almost everybody got at least one nasty experience with them. Sometimes the in-place rename would fail, and you’d end up with the temporary names at the end of boot. In a few cases the name was not persistent at all: if the kernel driver for the device would change, or change name at least, the rules wouldn’t match and your eth0 would become eth1 (this was the case when Intel split the e1000 and e1000e drivers, but it’s definitely more common with wireless drivers, especially if they move from staging to main).

So the old persistent net rules were flawed. What about the new predictable rules? Well, not only they combined the BIOS naming scheme (which is actually awesome when it works — SuperMicro servers such as Excelsior do not expose the label; my Dell laptop only exposes a label for the Ethernet port but doesn’t for either the wireless adapter or the 3G one), but it has two “fallbacks” that are supposed to be used when the labels fail, one based on the MAC address of the interface, and the other based on the “path” — which for most PCI, PCI-E, onboard, ExpressCard ports is basically the PCI address; for USB… we’ll see in a moment.

So let’s see, from my laptop:

# lspci | grep 'Network controller'
03:00.0 Network controller: Intel Corporation Centrino Advanced-N 6200 (rev 35)
# ifconfig | grep wlp3
wlp3s0: flags=4163  mtu 1500

Why “wlp3s0”? It’s the Wireless adapter (wl) PCI (p) card at bus 3, slot 0 (s0): 03:00.0. Matches lspci properly. But let’s see the WWAN interface on the same laptop:

# ifconfig -a | grep ww
wwp0s29u1u6i6: flags=4098  mtu 1500

Much longer name! What’s going on then? Let’s see, it’s reporting it’s card at bus 0, slot 29 (0x1d) — lspci will use hexadecimal numbers for the addresses:

# lspci | grep '00:1d'
00:1d.0 USB controller: Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller (rev 05)

Okay so it’s an USB device, even though the physical form factor is a mini-PCIE card. It’s common. Does it match lsusb?

# lsusb | grep Broadband
Bus 002 Device 004: ID 413c:8184 Dell Computer Corp. F3607gw v2 Mobile Broadband Module

Not the Bus/Device specification there, which is good: the device number will increase every time you pop something in/out of the port, so it’s not persistent across reboots at all. What it uses is the path to the device standing by USB ports, which is a tad more complex, but basically means it matches /sys/bus/usb/devices/2-1.6:1.6/ (I don’t pretend to know how the thing works exactly, but it describe to which physical port the device is connected).

In my laptop’s case, the situation is actually quite nice: I cannot move either the WLAN or WWAN device on a different slot so the name assigned by the slot is persistent as well as predictable. But what if you’re on a desktop with an add-on WLAN card? What happens if you decide to change your video card, with a more powerful one that occupies the space of two slots, one of which happen to be the place where you WLAN card is? You move it, reboot and .. you just changed the interface name! If you’ve been using Network Manager, you’ll just have to reconfigure the network I suppose.

Let’s take a different example. My laptop, with its integrated WWAN card, is a rare example; most people I know use USB “keys”, as the providers give them away for free, at least in Italy. I happen to have one as well, so let me try to plug it in one of the ports of my laptop:

# lsusb | grep modem
Bus 002 Device 014: ID 12d1:1436 Huawei Technologies Co., Ltd. E173 3G Modem (modem-mode)
# ifconfig -a | grep ww
wwp0s29u1u2i1: flags=4098  mtu 1500
wwp0s29u1u6i6: flags=4098  mtu 1500

Okay great this is a different USB device, connected to the same USB controller as the onboard one, but at different ports, neat. Now, what if I had all my usual ports busy, and I decided to connect it to the USB3 add-on ExpressCard I got on the laptop?

# lsusb | grep modem
Bus 003 Device 004: ID 12d1:1436 Huawei Technologies Co., Ltd. E173 3G Modem (modem-mode)
# ifconfig -a | grep ww
wwp0s29u1u6i6: flags=4098  mtu 1500
wws1u1i1: flags=4098  mtu 1500

What’s this? Well, the USB3 controller provides slot information, so udev magically uses that to rename the interface, so it avoids using the otherwise longer wwp6s0u1i1 name (the USB3 controller is on the PCI bus 6).

Let’s go back to the on-board ports:

# lsusb | grep modem
Bus 002 Device 016: ID 12d1:1436 Huawei Technologies Co., Ltd. E173 3G Modem (modem-mode)
# ifconfig -a | grep ww
wwp0s29u1u3i1: flags=4098  mtu 1500
wwp0s29u1u6i6: flags=4098  mtu 1500

Seems the same, but it’s not. Now it’s u3 not u2. Why? I used a different port on the laptop. And the interface name changed. Yes, any port change will produce a different interface name, predictably. But what happens if the kernel decides to change the way the ports are enumerated? What happens if the USB 2 driver is buggy and is supposed to provide slot information, and they fix it? You got it, even in these cases, the interface names are changed.

I’m not saying that the kernel naming scheme is perfect. But if you’re expected to always only have an Ethernet port, a WLAN card and a WWAN USB stick, with it you’ll be sure to have eth0, wlan0 and wwan0, as long as the drivers are not completely broken as they are now (like if the WLAN is appearing as eth1), and as long as you don’t muck with the interface names in userspace.

Next up, I’ll talk about the MAC addresses based naming and my personal preference when setting up servers and routers. Have fun in the mean time figuring out what your interface names will be.

Sophistication can be bad

Everybody heard about the KISS principle I guess — the idea is the less complex a moving part is, the better. This is true in software as much as mechanics. Unix in particular, and all the Unix-like projects including GNU, also tended to follow that principle as it can be shown by the huge amount of small utilities that only do one particular text or file editing functions — that is until you introduce sed, awk and find.

Now we all know that the main sophistication that is afoot in the Linux world nowadays is Lennart’s systemd. I have no intention to discuss it now, or at any later time I’d say. I really don’t care as long as I have a choice not to use it, and judging from a given thread I think we’ll always have an alternative, no matter what some people said before and keep saying.

No, my problem today is not with udev deciding it’s time to stop using the same persistent rules that people had to fight with for years and that now are no longer usable, and instead it’s a problem with util-linux, and in particular with the losetup utility that manages the loop devices. See, the loop devices have been quite a big deal in the past, mostly because they started as a fixed amount, then the kernel let you decide how many, and then finally code was enabled that would let you change dynamically the amount of loop devices you want to have available. Great, but it required a newer version of util-linux, and at the time when it was introduced, there wasn’t one that actually worked as intended.

Anyway, in the past week I’ve been working on building a new firmware image for the device I’m working on, and when it comes down to run the script that generates the image to burn on the SSD, it locked up with 100% CPU usage (luckily the system is multicore so I could get in to kill it). The problem was to be found in losetup, so today with enough time on my hands, I went to check it out. Turns out that the reason why it failed was a joint issue between my setup, OpenRC updates, and util-linux updates, but let’s proceed with order.

The build happen on a container for which I was not mounting /sys — or at least so I intended, although it is possible that OpenRC mounted it on its own; this has changed recently, but I don’t think those changes hit stable yet, so I’m not sure that’s the case. I had created static nodes for the loop devices and for /dev/loop-control — but this latter was not to be found at first today. Maybe I deleted it by mistake or something along those lines. But the point is it worked before, and nothing changed beside an emerge -avuDN.

So, what happens is that the script is running something along the lines of losetup --find --show file which is intended to find the first available loop device, set up the file, and then print the loop device that was found. It’s a bit more complex than this as I’m explicitly setting up only the partition on the loop device (getting partitioned loop devices to play cool with LXC is a pain), but the point stands. Unfortunately, when both /dev/loop-control and /sys are unreachable, the looping around that should give us the first available device is looping over the same device over and over and over again, never trying the next. This causes the problem noted above, of losetup locking at 100% CPU usage.

And it’s definitely not the only problem! If you just execute losetup --find, which should give you the first available device, it provides you /dev/loop0 even if that device is already in use. Not content enough with these problems? losetup -a lists no device, even when they are present, and still returns with a valid, zero exit status. Which is definitely not the case!

Okay you can say that losetup is already trying its best by using not one but three different sources (the third one is /proc/partitions) to find the data to use, but when the primary two are not usable, you shouldn’t expect it to give you proper information, should you? Well, that’s not the point. The big problem is that it should tell me “man, I can’t get you the data you requested because I need more sources, give me the sources!” instead of trying its best, failing, and locking up.

The next question is obviously “why are you ranting, instead of fixing it?” — the answer is that I tried, but the code I was reading made me cry. The problem is that nowadays, losetup is just a shallow interface to some shared code in util-linux .. and the design of said code makes it very difficult to make it clear whether a non-zero return value from a function is a “we reached the end of the list” or “I couldn’t see anything because I lack my sources”. And it really didn’t feel like a good idea for me to start throwing away that code to replace it with something more KISS-compliant.

So at the end of the day, I fixed my container to mount /sys and everything works, but util-linux is still broken upstream.