Windows 10, NVMe SSDs, VROC

It sounded like an easy task: my main SSD was running out of space and I decided to upgrade to a 2TB Samsung 970 NVMe drive. It would usually be an easy task, but clearly I shouldn’t expect for things to be easy with the way I use a computer, still 20 years after starting doing incredibly rare stuff.

It ended up with me reinstalling Windows 10 three times, testing the Acronis backup restore procedure, buying more adapters than I ever thought I would need, and cursing my laziness when I set up a bunch of stuff in the past.

Let’s start with a bit of setup information: I’m talking about the gamestation, which I bought after moving to London because someone among the moving companies (AMC Removals in Ireland, and Simpsons Removals in London) stole it. It uses an MSI X299 SLI PLUS motherboard, and when I bought it, I bought two Crucial M.2 SSDs, for 1TB each — one dedicated to the operating system and applications, and the other to store the ever-expanding photos library.

At some point a year or so ago, the amount of pictures I took crossed the 1TB mark, and I needed more space for the photos. So thanks to the fact that NVMe SSDs became more affordable, and that you can pretty much turn any PCIe 3.0 x4 slot into an NVMe slot with a passive adapter, I decided to do just that, and bought a Samsung 970 EVO Plus 1TB, copied the operating system to it, and made the two older Crucial SSDs into a single “Dynamic Volume” to have more space for pictures.

At first I used a random passive adapter that I bought on Amazon, and while that worked perfectly nice to connect the device, it had some trouble with keeping temperature: Samsung’s software reported a temperature between 68°C and 75°C which it considers “too high”. I ended up spending a lot of time trying to find a way around this, and I ended up replacing all the fans on the machine, adding more fans, and managed to bring it down to around 60°C constantly. Phew.

A few months later, I found an advertisement for the ASUS Hyper M.2 card, which is a pretty much passive card that allows to use up to four NVMe SSDs on a PCI-E x16 slot as long as your CPU supports “bifurcation” — which I checked my CPU and motherboard both to support. In addition to allowing adding a ton of SSDs to a motherboard, the Hyper M.2 has a big aluminium heatsink and a fan, that makes it interesting to make sure the temperature of the SSD is kept in control. Although I’ll be honest and say that I’m surprised that Asus didn’t even bother adding a PWM fan control: it has an on/off switch that pokes out of the chassis and that’s about it.

Now fast forward a few more months, and my main drive is also full, and also Microsoft has deprecated Dynamic Volumes in favour of Storage Spaces. I decided that I would buy a new, bigger SSD for the main drive, and then use this to chance to migrate the photos to a storage space bundling together all three of the remaining SSDs. Since I already had the Hyper M.2 and I knew my CPU supported the bifurcation, I thought it wouldn’t be too difficult to have all four SSDs connected together…

Bifurcation and VROC

The first thing to know is that the Hyper M.2 card, when loaded with a single NVMe SSD, behaves pretty much the same way as a normal PCI-E-to-M.2 adapter: the single SSD gets the four lanes, and is seen as a normal PCI-E device by the firmware and operating system. If you connect two or more SSDs, now things are different, and you need bifurcation support.

PCI-E bifurcation allows splitting an x8 or x16 slot (8 or 16 PCI-E lanes) into two or four x4 slots, which are needed for NVMe. It requires support from the CPU (because that’s where PCI-E lanes terminate), and from the BIOS (to configure the bifurcation), and from the operating system, for some reason that is not entirely clear to me, not being a PCI-E expert.

So the first problem I found with trying to get the second SSD to work on the Hyper M.2 is that I didn’t realise how complicated the whole selection of which PCI-E slot has how many lanes is on modern motherboards. Some slots are connected to the chipset (PCH) rather than the CPU directly, but you want the videocard and the NVMe to go to the CPU instead. When you’re using the M.2 slots, they take some of the lanes away, and it depends on whether you’re using SATA or NVMe mode which lanes they take away. And it depends on your CPU how many lanes you have available.

Pretty much, you will need to do some planning and maybe some pen-and-paper diagram to follow through. In particular, you need to remember that where the lanes are distributed is statically chosen. Even though you do have a full x16 slot at the bottom of your motherboard, and you have 16 free lanes to connect, that doesn’t mean those two are connected. Indeed it turned out that the bottom slot only has x8 at best on my CPU, and instead I needed to move the Hyper M.2 two slots up. Oops.

The next problem was that despite Ubuntu Live being able to access both NVMe drives transparently, and the firmware able to boot out of them, Windows refused to boot complaining about inaccessible boot device. The answer for this one is to be found in VROC: Virtual RAID on CPU. It’s Intel’s way to implement bifurcation support for NVMe drives, and despite the name it’s not only there if you plan on using your drives in a RAID configuration. Although, let me warn here, from what I understand, bifurcation should work fine without VROC, but it looks like most firmware just enables the two together, so at least on my board you can’t use bifurcated slots without VROC enabled.

The problem with VROC is that while Ubuntu seems to pass through it natively, Windows 10 doesn’t. Even 20H1 (which is the most recent release at the time of writing) doesn’t recognize SSDs connected to a bifurcated host unless you provide it with a driver, which is why you end up with the inaccessible boot device. It’s the equivalent of building your own Linux kernel, and forgetting the disk controller driver or the SCSI disk driver. I realized that when I tried doing a clean install (hey, I do have a back for a reason!), and the installer didn’t even see the drives, at all.

This is probably the closest I’m getting to retrocomputing, by reminding me of installing Windows XP for a bunch of clients and friends, back when AHCI became common, and having to provide a custom driver disk. Thankfully, Windows 10 can take that from USB, rather than having to fiddle around with installation media or CD swap. And indeed, the Intel drivers for VROC include a VMD (Volume Management Device) driver that allows Windows 10 to see the drives and even boot from them!

A Compromising Solution

So after that I managed to get a Windows 10 installed and set up — and one of my biggest worries went away: back when my computer was stolen and I reinstalled Windows 10, the license was still attached to the old machine, I had to call tech support to get it activated, and I wasn’t sure if it would let me re-activate it; it did.

Now, the next step for me was to make sure that the SSD had the latest firmware and was genuine and correctly set up, so I installed Samsung Magician tools, and… it didn’t let me do any of that, because it reported Intel as the provider for the NVMe driver, despite Windows reporting the drive to be supported by their own NVMe driver. I guess what they mean is that the VROC driver interferes with direct access to the devices. But it means you lose access to all SMART counters from Samsung’s own software (I expect other software might still be able to access it), with no genuinity checks and in particular no temperature warning. Given I knew that this had been an issue in the past, this worried me.

As far as I could tell, when using the Hyper M.2, you not only lose access to the SSD manufacturer tooling (like Magician), but I’m not even sure if Windows can still access the TRIM facilities — I didn’t manage to confirm for good, I got an error when I tried using it, but it might have been related to another issue that will become apparent later.

And to fit this all up, if you do decide to move the drives out of the Hyper M.2 card, say to bring them back to the motherboard, you are back to square one with the boot device being inaccessible, because Windows will look for the VROC VMD, which will be gone.

At that point I pretty much decided that the Hyper M.2 card and the whole VROC feature wouldn’t work out for me, too many compromises. I decided to take a different approach, and instead of bringing the NVMe drives away from the M.2 slots, I planned to take the SATA drives away from the M.2 slots.

You see, the M.2 slots can carry either NVMe drives using PCI-E directly, or still common SATA SSDs — the connector is keyed, although I’m not entirely sure why, as there’s nothing preventing to try connecting a SATA M.2 SSD in a connector that only supports NVMe (such as the Hyper M.2), but that’s a different topic that I don’t care to research myself. What matters is that you can buy passive adapters that convert an M.2 SSD to a normal 2.5″ SATA one. You can find those on AliExpress, obviously, but I needed them quickly, so I ordered them from Amazon instead — I got Sabrent ones because they were available for immediate dispatching, but be also careful because they sell both M.2 and mSATA converters, as they all use the same protocol and you just need a passive adapter.

Storage Space and the return of the Hyper M.2

After installing with the two Samsung SSDs on the motherboard’s M.2 slots I finally managed to get the Samsung Magician working, which confirmed not only that the drive is genuine, but also that it already has the latest firmware (good). Unfortunately it also told me that the temperature of the SSD was “too high”, at around 65°C.

The reason for that is that the motherboard predates the more common NVMe drives, and unlike LGR’s, it doesn’t have full aluminium heatsinks to bolt on top of the SSDs to keep the temperature. It came instead with a silly “shield” that might be worse than not having it, and it positioned the first M.2 slot… right underneath the videocar. Oops! Thankfully I do have an adapter with a heatsink that allows me to connect the single SSD to a PCI-E slot without needing to use VROC… the Hyper M.2 card. So I counted for re-opening the computer, moving the 2TB SSD to the Hyper M.2, and be done with that. Easy peasy, and since I already had the card this is probably worth it.

Honestly if I didn’t have the card I would probably have gone for one of those “cards” that have both a passive NVMe adapter and a passive SATA adapter (needing the SATA data cable, but not the power), since at that point I would have been able to keep one SATA SSD on the motherboard (they don’t get as hot it seems), but again, I worked with what I had at hand.

Then, as I said above, I also wanted to take this change to migrate my Dynamic Volumes to the new Storage Spaces, which are supposed to be better supported and include more modern features for SSDs. So once I got everything reinstalled, I tried creating a new pool and setting it up… to no avail. The UI didn’t let me create the pool. Instead I ended up using the command line via PowerShell, and that worked fine.

Though do note the commands on Windows 10 2004/20H1 are different from older Server versions. Which makes looking for solutions on ServerFault and similar very difficult Also it turns out that between deleting Dynamic Volumes from two disks and adding them to a Storage Spaces Pool, you need to reboot your computer. And the default way to select the disk (the “Friendly Name” as Windows 10 calls it) is to use the model number — which makes things interesting when you have two pairs of SSDs with the same name (Samsung doesn’t bother adding the size to the model name as reported by Windows).

And then there’s the kicker, which honestly got me less angry than everything else that went on, but did make me annoyed more than I showed up: Samsung Magician lost access to all the disks connected to the Storage Spaces pool! I assume this is because the moment when they are added to the pool, Windows 10 does not show them in the Disk Management interface either, and Magician is not updated to identify disks at a lower level. It’s probably a temporary situation, but Storage Spaces are also fairly uncommon, so maybe they will not bother fixing that.

The worst part is that even the new SSD disappeared, probably for the reason noted above: it has the same name as the disk that is in the Storage Spaces Pool. Which is what made me facepalm — given I once again lost access to Samsung’s diagnostics, although I confirmed the temperature is fine, the firmware has not changed, and the drive is genuine. I guess VROC would have done just as well, if I confirmed the genuineness before going on with the reinstalling multiple times.

Conclusion

Originally, I was going to say that the Hyper M.2 is a waste of time on Windows. The fact that you can’t actually monitor the device with the Samsung software is more than just annoying — I probably should have looked for alternative monitoring software to see if I could get to the SMART counters over VROC. On Linux of course there’s no issue with that given that Magician doesn’t exist.

But if you’re going to install that many SSDs on Windows, it’s likely you’re likely going to need to use Storage Spaces — in which case the fact that Magician doesn’t work is also moot, as it wouldn’t work either. The only thing you need to do is making sure that you have the drivers to install this correctly in the first place. Using the Hyper M.2 – particularly on slightly older motherboards that don’t have good enough heatsinks for their M.2 slots – turns out to be fairly useful.

Also Storage Spaces, despite being a major pain in the neck to set up on Windows 10, appear to do a fairly good job. Unlike Dynamic Volumes they do appear to balance the writing to multiple SSDs, they support TRIM, and there’s even support for preparing a disk to be removed from the pool, moving everything onto the remaining disks (assuming there’s enough space), and freeing up the drive.

If I’m not getting a new computer any time soon (and I would hope I won’t have to), I have a feeling I’ll go back to use the Hyper M.2 for VROC mode, even if it means reinstalling Windows again. Adding another 2TB or so of space for pictures wouldn’t be the cheapest idea, but it would allow expansion at a decent rate until whatever next technology arrives.

Windows 10, OpenSSH and YubiKey

Update 2020-08-23: I found WinCryptSSH by chance, and that seems to take care of having an actual agent system set up as well, so that this works with WSL! Give that a try, instead of following the advice on most of this post! You can still read it for context, though.

You may remember that a few months ago I suggested that Windows 10 is an interesting FLOSS development platform now, and that I decided to start using Windows 10 on my Dell XPS laptop (also in the hope that the problem I had with the battery would be caused by Linux — and the answer to that is “nope”, indeed the laptop’s battery is terrible.) One of the things I realised setting all of those up, is that I found myself unable to use my usual OpenPGP-based token, and I thought I would try using a YubiKey 5 instead.

Now, between me and Yubico there’s not much love lost, but I thought I would try to make my life easier by using a smartcard that seemed to have a company interested in this kind of usage behind it. Turns out that this was only partially working, unfortunately.

The plan was to set up the PIV mode of the YubiKey 5 to provide the authentication certificate, rather than trying to use the OpenPGP mode. The reason for that is to be found on Yubico’s own website:

GPG4Win’s smart card support is not rock solid; occasionally you might get error messages when trying to access the YubiKey. It might happen after removing and re-inserting the YubiKey, or after your computer has been in sleep mode, etc. This can be resolved by restarting gpg-agent [snip]

Given that GnuPG’s own smartcard support is kind of terrible already, and not wanting to get into the yak shaving of getting that to work on Windows, I was hoping that using the more common (on Windows) interface of PKCS#11, which OpenSSH supports natively (sort of). To give a very quick and oversimplified summary, PKCS#11 is the definition of an API/ABI that end user software, such as OpenSSH, can use to interface with middleware that provides access to PKI-related functions. Many smartcard manufacturers provide ready made middleware implementing a PKCS#11 interface, which I thought Windows supported directly, but I may be wrong. Mozilla browsers rely on this particular interface to handle CA certificates as well, to the point that the NSS library that Mozilla uses is pretty much a two-part component with a PKCS#11 provider and a PKCS#11 client.

As it turns out, Yubico develops a PKCS#11 middleware for YubiKey as part of yubiko-piv-tool, and provides documentation on how to use it for SSH authentication. Unfortunately the instructions don’t really expand to including needed information for using this on Windows, as they explicitly say at the top of the page. But that would never stop me, after all. Most of the setup described in that document is perfectly applicable to Windows, by the way — until you get to the first issue…

The first issue with setting this up is that while Windows 10 does ship with OpenSSH client (and server), it does not ship with PKCS#11 support enabled. Indeed, the version provided even with 20H1 (the current most recent non-Insider build) is 7.7p1, while the current upstream release would be 8.3p1. Thankfully, Microsoft is providing a more up to date build, although that’s also still blocked at 8.1p1. The important part is that these binaries do include PKCS#11 support.

For this whole to work, you need to have both the OpenSSH binaries provided by Microsoft, and the Yubico libraries (DLL) in folders that are part of the PATH environment variable. And they also need to match the ABI. So if you’re setting this up on an x64 system, and used the 64-bit OpenSSH build, you should install the 64-bit Yubico PIV Tool, and vice-versa for 32-bit installs.

Now, despite the installer warning you that to use the PKCS#11 provider you need to have the bin folder in the PATH variable, and that loading the provider will full path will not be enough… the installer does not offer to modify the PATH itself, unlike the Git installer that does, to make it easy to use globally. This is not too terrible, because you also need to add the new OpenSSH in the PATH. For myself, I decided to use a simple OpenSSH folder in my home.

Modifying the environment variables in (English) Windows 10 is fairly straightforward: hit the Search function, and type Environment — it’ll come up with the right control panel, and you can then edit the PATH variable and just browse for the right folder.

There is one more thing you need to do, and that is to create a .ssh/config file in your home directory with the following content:

PKCS11Provider libykcs11.dll

This instructs OpenSSH to look for the Yubico PKCS#11 provider automatically instead of having to specify it on the command line. Note once again that while you could provide the full path to the DLL file, if you didn’t add it to the PATH, it would likely not load — Windows 10 is stricter in where to look for dependencies when dynamically loading a DLL. And also, you’ll get a “not a valid win32 application” error if you installed/configured the wrong version of the Yubico tool (32-bit vs 64-bit).

After that is done, ta-dah! It should work fine!

Screenshot of Windows PowerShell using a YubiKey 5 to authenticate to a Gentoo Linux system.

This works, when using PowerShell. You get asked to enter the PIN for the YubiKey, and you login just fine. Working exactly as intended there.

Unfortunately, the next step I wanted to use this for is to use VSCode to connect to my NUC, and work on things like usbmon-tools remotely, so for that to work, I needed to be able to use this authentication method through the Visual Studio Code remote host mode… and that’s not working at the time of writing. The prompt comes up, but VSCode does not appear to proxy it to anything into its UI for me to answer it.

I’m surprised, because as far as I can tell, the code responsible for the prompt uses the correct read_passphrase() function call for it to be a prompt proxied to the askpass implementation, which I thought was already taken care of by VSCode. I have not spent too much time debugging this problem yet, but if someone is more familiar than me with VSCode and can track down what should happen there, I’d be very happy to hear about it. For now, I filed an issue.

Update 2020-08-04: Rob Lourens from Microsoft moved the issue to the the right repository and pointed to another issue (filed later but in the right place).

The workaround to use this from VSCode, it’s to make sure that "remote.SSH.useLocalServer": true is set, and click on the Details link at the bottom-right corner when it’s trying to connect, to type in the PIN. At which point everything seem to work fine, and even use the connection multiplexer to avoid requesting it all the time.

Screenshot of Visual Studio Code showing a remote SSH connection to my Linux NUC with usbmon-tool open.

Was Acronis True Image 2020 a mistake?

You may remember that a few months ago I complained about Acronis True Image 2020. I have since been mostly happy with the software, despite it being still fairly slow when uploading a sizable amount of changed files, such as after shooting a bunch of pictures at home. This would have been significantly more noticeable if we had actually left the country since I started using it, as I usually shoot at least 32GB of new pictures on a trip (and sometimes twice as much), but with lockdown and all, it didn’t really happen.

But, beside for that, the software worked well enough. Backup happened regularly, both on the external drive and the Cloud options, and I felt generally safe with using it. Until a couple of weeks ago, when suddenly it stopped working, and failed with Connection Timeout errors. They didn’t correlate with anything: I did upgrade to Windows 10 20H1, but that was a couple of weeks before, and backups went through fine until then. There was no change in network, there was no change from my ISP, and so on.

So what gives? None of the tools available from Acronis reported errors, ports were not marked as blocked, and I was running the last version of everything. I filed a ticket, was called on the phone by one of their support people who actually seemed to know what he was doing — TeamViewer at hand, he checked once again for connectivity, and once again found that everything is alright, the only thing he found to change was disabling the True Image Mounter service, which is used to get quick access to the image files, and thus is not involved in the backup process. I had to disable tha tone because, years after Microsoft introducing WSL, enabling it breaks WSL filesystem access altogether, so you can’t actually install any Linux distro, change passwords in the ones you already installed, or run apt update on Debian.

This was a week ago. In the meantime support asked me to scan the disks for errors because their system report reported one of the partitions as having issues (if I read their log correctly, that’s one of the recovery images so it’s not at all related to the backup), and the more recent one to give them a Process Monitor log while running the backup. Since they don’t actually give you a list of process to limit to, I ended up having to kill most of the other running application to take the log, as I didn’t want to leak more information that I was required to. It still provided a lot of information I’m not totally comfortable with having provided. And I still have no answer, at the time of writing.

It’s not all here — the way you provide all these details to them is a fairly clunky: you can’t just mail them, or attach them through their web support interface, as even their (compressed) system report is more than 25MB for my system. Instead what they instruct you to do is to take the compressed files and uploaded them through FTP with the username/password pair they provide to you.

Let me repeat that. You upload compressed files, that include at the very least most of the filenames you’re backing up, and possibly even more details of your computer, with FTP. Unencrypted. Not SFTP, not FTPS, not HTTPS. FTP. In 2020.

This is probably the part that makes my blood boil. Acronis has clearly figured out that the easiest way for people to get support is to use something that they can use very quickly. Indeed you can still put an FTP URL In the location bar of your Windows 10 File Explorer, and it will allow you to upload and download files over it. But it does that in a totally unencrypted, plain-text manner. I wonder how much more complicated it would be to use at least FTPS, or to have an inbound-only password-protected file upload system, like Google Drive or Dropbox, after all they are a cloud storage solution provider!

As for myself, I found a temporary workaround waiting for the support folks to figure out what they likely have screwed up on their London datacenter: I’m backing up my Lightroom pictures to the datacenter they provide in Germany. It took three days to complete, but it at least gives me peace of mind that, if something goes horribly wrong, at least the most dear part of my backup is saved somewhere else.

And honestly, using a different backup policy than the rest of the system just for the photos is probably a good idea: I set it to “continuous backup”, because generally speaking it usually stays the same all the time, until I go and prepare another set to publish, then a lot of things change quickly and then nothing until the next time I can do it.

Also, I do have the local backup — that part is still working perfectly fine. I might actually want to use it soon, as I’m of two minds between trying to copy over my main OS drive from a 1TB SSD to a 2TB SSD, and just getting a 2TB SSD, and installing everything anew onto it. If I do go that route, I also will reuse the 1TB SSD onto my NUC instead, which right now is running with half SATA and half NVMe storage.

Conclusions? Well, compared to the Amazon Glacier + FastGlacier (that has not been updated in just over two years now, and still sports a Google+ logo and +1 button!), it’s still good value for money. I’m spending a fraction of what I used to spend with Amazon, and even in the half-broken state it’s backing up more data and has significantly faster access. The fact that you can set different policies for different parts of the backup is also a significant plus. I just wish there was a way to go from a “From Folders to Cloud” backup to a tiered “From Folders to External, plus Cloud” — or maybe I’ll bite the bullet and, if it’s really this broken, just go and re-configure also the Lightroom backup to use the tiered option.

But Acronis, consider cleaning up your support act. It’s 2020, you can’t expect your customers to throw you all their information via unencrypted protocols, for safety’s sake!

Update 2020-06-30: the case is now being escalated to the “development and cloud department” — and if this is at all in the same ballpark as the companies I worked for it means that something is totally messed up in their datacenter connectivity and I’m the first one to notice enough to report to them. We’ll see.

Update 2020-07-16: well, the problem is “solved”. In the sense that after I asked them, they moved my data out of the UK (London) datacenter into the Germany one. Which works fine and has no issues. They also said before they will extend my payment to the month that I didn’t have the backup working. But yeah, turns out that nobody seems to have very clear on their side what was going on, but the UK datacenter just disappeared off my dashboard. I wonder how many had this problem.

Paperless home, sorted home

You probably don’t remember, but I have been chasing the paperless office for many years. At first it was a matter of survival, as running my own business in Italy meant tons of paperwork, and sorting it all out while being able to access it was impossible. By scanning and archiving the invoices and other documents, the whole thing got much better.

I continued to follow the paperless path when I stopped running a company and just working, but by then, the world started following me and most services started insisting on paperless billing anyway, which was nice. In Dublin I received just a few pieces of paper a month, and it was easy to scan them, and then bring them to the office to dispose of in the secure shredding facilities. I kept this on after moving to London, despite the movers steaming my scanner, using a Brother ADS-1100W instead.

But since the days in Italy, my scanning process changed significantly: in Dublin I never had a Linux workstation, so the scanner ended up connected to my Gamestation using Windows — using PaperPort which was at the time marketed by Nuance. The bright side of this was that PaperPort applies most of the same post-processing as Unpaper while at the same time running OCR over the scanned image, making it searchable on Google Drive, Dropbox and so on.

Unfortunately, it seems like something changed recently, either in Windows 10, the WIA subsystem or something else altogether, and from time to time after scanning a page, PaperPort or the scanner freeze, and don’t terminate the processing, requiring a full reboot of the OS. Yes I tried powercycling the scanner, yes I tried disconnecting the USB and reconnecting, none seem to work except a full reboot, which is why I’m wondering if it might be a problem with the WIA subsystem.

The current workaround I have is to use the TWAIN system, which is the same that I used with my scanner on Windows 98, which is surprising and annoying — in particular I need to remember to turn on the scanner before I open PaperPort, otherwise it fails to scan and the process will need to be killed with the Task Manager. So I’m actually considering switching the scanning to Linux again.

My old scan2pdf command-line tool would help, but it does not include the OCR capabilities. Paperless seems more interesting, and it uses Unpaper itself. But it assumes you want the document stored on the host, as well as scanned and processed. I would have to see if it has integration with Google Drive, or otherwise figure out how to get that integration going with something like rclone. But, well, that would be quite a bit of work that I’m not sure I want to do right now.

Speaking of work, and organizing stuff — I released some hacky code which I wrote to sort through the downloaded PDF bills from various organizations. As I said on Twitter when I released it, it is not a work of engineering, or a properly-cleaned-up tool. But it works for most of the bills I care about right now, and it makes my life (and my wife’s) easier by having all of our bank statements and bills named and sorted (particularly when just downloading a bunch of PDFs from different companies once a month, and sorting them all.)

Funnily enough, writing that tool also had some surprises. You may remember that a few years ago I leaked my credit card number by tweeting a screenshot of what I thought was uninitialized memory in Dolphin. Unlike Irish credit card statements, British card statements don’t include the full PAN in any of the pages of a PDF. So you could think it’s safe to provide a downloaded PDF as proof of address to other companies. Well, turns out it isn’t, at least for Santander: there’s an invisible (but searchable and highlightable) full 16-digit PAN at the top of the first page of the document. You can tell it’s there when you run the file over pdf2text or similar tools (there’s a similar invisible number on bank statements, but that’s also provided visible: it’s the sort-code and account number).

Oh and it looks like most Italian bills don’t use easily-scrapeable layouts, which is why there’s none of them right now in the tool. If someone knows of a Python library that can extract text from pages using “Figure” objects, I’m all ears.

Don’t Ignore Windows 10 as a Development Platform for FLOSS

Important Preface: This blog post was written originally on 2020-05-12, and scheduled for later publication, inspired by this short Twitter thread. As such it well predates Microsoft’s announcement of expanding support of WSL2 to graphical apps. I considered trashing, or seriously re-editing the blog post in the light of the announcement, but I honestly lack the energy to do that now. It left a bad taste in my mouth to know that it will likely get drowned out in the noise of the new WSL2 features announcement.

Given the topic of this post I guess I need to add a preface to point out my “FLOSS creds” — because I have seen already too many attacks to people who even use Windows at all. I have been an opensource developer for over fifteen years now, and part of the reason why I left my last bubble was because it made it difficult for me to contribute to various opensource projects. I say this because I’m clearly a supporter of Free Software and Open Source, wherever possible. I also think that’s different people have different needs, and that ignoring that is a failure of the FLOSS movement as a whole.

The “Year of Linux on the Desktop” is now a meme that has been running its course to the point of being annoying. Despite what FLOSS advocates keep saying, “Linux on the Desktop” is not really moving, and while I do have some strong opinions on this, that’s for another day. Most users, and in particular newcomers to FLOSS (both as users and developers) are probably using a more “user friendly” platform — if you leave a comment with the joke on UNIX being selective with its friends, you’ll end up on a plonkfile, be warned.

About ten years ago, it seemed like the trend was for FLOSS developers to use MacBooks as their daily laptops. I did that for a while myself — an UNIX-based platform with all the tools of the trade, which allowed quite a bit of work being done without having access to a Linux platform. SSH, Emacs, GCC, Ruby, and so on. And at the same time, you had the stability of Mac OS X, with the battery life and all the hardware worked great out of the box. But then more recently, Apple’s move towards “walled gardens” seemed to be taking away from this feasibility.

But back to the main topic. Over the past many years, I’ve been using a “mixed setup” — using a Linux laptop (or more recently desktop) for development, and a Windows (7, then 10) desktop for playing games, editing photos, designing PCBs, and for logic analysis. The latter is because Saleae Logic takes a significant amount of RAM when analysing high-frequency signals, and I have been giving my gamestations as much RAM as I can just for Lightroom, so it makes sense to run it on the machine with 128GB of RAM.

But more recently I have been exploring the ability of using Windows 10 as a development platform. In part because my wife has been learning Python, and since also learning a new operating system and paradigm at the same time would have been a bloody mess, she’s doing so on Windows 10 using Visual Studio Code and Python 3 as distributed through the Microsoft Store. While helping her, I had exposure to Windows as a Python development platform, so I gave it a try when working on my hack to rename PDF files, which turned out to be quite okay for a relatively simple workflow. And the work on the Python extension keeps making it more and more interesting — I’m not afraid to say that Visual Studio Code is better integrated with Python than Emacs, and I’m a long-time user of Emacs!

In the last week I have actually stepped up further how much development I’m doing on Windows 10 itself. I have been using HyperV virtual machines for Ghidra, to make use of the bigger screen (although admittedly I’m just using RDP to connect to the VM so it doesn’t really matter that much where it’s running), and in my last dive into the Libre 2 code I felt the need to have a fast and responsive editor to go through executing part of the disassembled code to figure out what it’s trying to do — so once again, Visual Studio Code to the rescue.

Indeed, Windows 10 now comes with an SSH client, and Visual Studio Code integrates very well with it, which meant I could just edit the files saved in the virtual machine and have the IDE also build them with GCC and executing them to get myself an answer.

Then while I was trying to use packetdiag to prepare some diagrams (for a future post on the Libre 2 again), I found myself wondering how to share files between computers (to use the bigger screen for drawing)… until I realised I could just install the Python module on Windows, and do all the work there. Except for needing sed to remove an incorrect field generated in the SVG. At which point I just opened my Debian shell running in WSL, and edited the files without having to share them with anything. Uh, score?

So I have been wondering, what’s really stopping me from giving up my Linux workstation for most of the time? Well, there’s hardware access — glucometerutils wouldn’t really work on WSL unless Microsoft is planning a significant amount of compatibility interfaces to be integrated. Similar for using hardware SSH tokens — despite PC/SC being a Windows technology to begin with. Screen and tabulated shells are definitely easier to run on Linux right now, but I’ve seen tweets about modern terminals being developed by Microsoft and even released FLOSS!

Ironically, I think it’s editing this blog that is the most miserable experience for me on Windows. And not just because of the different keyboard (as I share the gamestation with my wife, the keyboard is physically a UK keyboard — even though I type US International), but also because I miss my compose key. You may have noticed already that this post is full of em-dashes and en-dashes. Yes, I have been told about WinCompose, but last time I tried using it, it didn’t work and even screwed up my keyboard altogether. I’m now trying it again, at least on one of my computers, and if it doesn’t explode in my face again, I may just give it another try later.

And of course it’s probably still not as easy to set up a build environment for things like unpaper (although at that point, you can definitely run it in WSL!), or to have a development environment for actual Windows applications. But this is all a matter of different set of compromises.

Honestly speaking, it’s very possible that I could survive with a Windows 10 laptop for my on-the-go opensource work, rather than the Linux one I’ve been using. With the added benefit of being able to play Settlers 3 without having to jump through all the hoops from the last time I tried. Which is why I decided that the pandemic lockdown is the perfect time to try this out, as I barely use my Linux laptop anyway, since I have a working Linux workstation all the time. I have indeed reinstalled my Dell XPS 9360 with Windows 10 Pro, and installed both a whole set of development tools (Visual Studio Code, Mu Editor, Git, …) and a bunch of “simple” games (Settlers, Caesar 3, Pharaoh, Age of Empires II HD); Discord ended up in the middle of both, since it’s actually what I use to interact with the Adafruit folks.

This doesn’t mean I’ll give up on Linux as an operating system — but I’m a strong supporter of “software biodiversity”, so the same way I try to keep my software working on FreeBSD, I don’t see why it shouldn’t work on Windows. And in particular, I always found that providing FLOSS software on Windows a great way to introduce new users to the concept of FLOSS — focusing more on providing FLOSS development tools means giving an even bigger chance for people to build more FLOSS tools.

So is everything ready and working fine? Far from it. There’s a lot of rough edges that I found myself, which is why I’m experimenting with developing more on Windows 10, to see what can be improved. For instance, I know that the reuse-tool has some rough edges with encoding of input arguments, since PowerShell appears to still not default to UTF-8. And I failed to use pre-commit for one of my projects — although I have not taken notice yet much of what failed, to start fixing it.

Another rough edge is in documentation. Too much of it assumes only a UNIX environment, and a lot of it, if it has any support for Windows documentation at all, assumes “old school” batch files are in use (for instance for Python virtualenv support), rather than the more modern PowerShell. This is not new — a lot of times modern documentation is only valid on bash, and if you were to use an older operating system such as Solaris you would find yourself lost with the tcsh differences. You can probably see similar concerns back in the days when bash was not standard, and maybe we’ll have to go back to that kind of deal. Or maybe we’ll end up with some “standardization” of documentation that can be translated between different shells. Who knows.

But to wrap this up, I want to give a heads’ up to all my fellow FLOSS developers that Windows 10 shouldn’t be underestimated as a development platform. And that if they intend to be widely open to contributions, they should probably give a thought of how their code works on Windows. I know I’ll have to keep this in mind for my future.

Windows Backup Solutions: trying out Acronis True Image Backup 2020

One of my computers is my Gamestation, which to be honest has not ran a game in many months now, which runs Windows out of necessity, but also because honestly sometimes I just need something that works out of the box. The main usage of that computer nowadays is Lightroom and Photoshop for my photography hobby.

Because of the photography usage, backups are a huge concern to me (particularly after movers stole my previous gamestation), and so I have been using a Windows tool called FastGlacier to store a copy of most of the important stuff to Amazon Glacier service, in addition to letting Windows 10 do its FileHistory magic on an external hard drive. Not a cheap option, but (I thought) a safe and stable one. Unfortunately the software appears to not being developed anymore, and with one of the more recent Windows 10 updates it stopped working (and since I had set it up as a scheduled operation, it failed silently, which is the worst thing that can happen!)

My original plan for last week (at the time of writing), was to work on pictures, as I have shots from trip over three years ago that I have still not wandered through, rather than working on reverse engineering. But when I noticed the lacking backups, I decided to put that on hold until the backup problem was solved. The first problem was finding a backup solution that would actually work, and that wouldn’t cost an arm and a leg. The second problem was that of course most of the people I know are tinkerers that like rube-goldberg solutions such as using rclone on Windows with the task scheduler (no thanks, that’s how I failed the Glacier backups).

I didn’t have particularly high requirements: I wanted a backup solution that would do both local and cloud backups — because Microsoft has been reducing the featureset of their FileHistory solution, and so relying on it feels a bit flaky. And the ability to store more than a couple of terabytes on the cloud solution (I have over 1TB of RAW shots!), even at a premium. I was not too picky on price, as I know features and storage are expensive. And I wanted something that would just work out of the box. A few review reads later, I found myself trying Acronis True Image Backup. A week later, I regret it.

I guess my best lesson learnt from this is that Daniel is right, and it’s not just about VPNs: most review sites seem to be scoring higher the software they get more money from via affiliate links (you’ll notice that in this blog post there won’t be any!) So while a number of sites had great words for Acronis’s software, I found it sufficiently lacking that I’m ranting about it here.

So what’s going on with the Acronis software? First of all, while it does support both “full image” and “selected folders” modes, you need to be definitely aware that the backup is not usable as-is: you need the software to recover the data. Which is why it comes with bootable media, “survival kits”, and similar amenities. This is not a huge deal to me, but it’s still a bit annoying, when FileHistory used to allow direct access to the files. It also locks you in in accessing the backup with the software, although Acronis makes the restore option available even after you let your subscription expire, which is at least honest.

Then the next thing that was clear to me was that the speed of the cloud backup is not the strongest suit of Acronis. The original estimate for backing up the 2.2TB of data that I expected to back up was on the mark at nearly six days. To be fair to Acronis, the process went extremely smoothly, it never got caught up, looped, crashed, or slowed down. The estimate was very accurate, and indeed, running this for about 144 hours was enough to have the full data backed up. Their backup status also shows the average speed of the processes, that matched my estimate while the backup was running, of 50Mbps.

The speed is the first focus of my regret. 50Mbps is not terribly slow, and for most people this might be enough to saturate their Internet uplink. But not for me. At home, my line is provided by Hyperoptic, with a 1Gbps line that can sustain at least 900Mbps upload. So seeing the backup bottlenecked by this was more than a bit annoying. And as far as I can tell, there’s no documentation of this limit on the Acronis website at the time of writing.

When I complained on Twitter about this, it was mostly in frustration for having to wait, but I was considering the 50Mbps speed at least reasonable (although I would have considered paying a premium for faster uploads!) the replies I got from support have gotten me more upset than before. Their Twitter support people insisted that the problem was with my ISP and sent me to their knowledgebase article on using the “Acronis Cloud Connection Verification Tool” — except that following the instruction showed I was supposed to be using their “EU4” datacenter, for which there is no tool. I was then advise to file a ticket about it. Since then, I appear to have moved back to “EU3” — maybe EU4 was not ready yet.

The reply to the ticket was even more of an absurdist mess. Beside a lot of words to explain “speed is not our fault, your ISP may be limiting your upload” (fair, but I already noted to them that I knew that was not the case), one of the steps they request you to follow is to go to one of their speedtest apps — which returns a 504 error from nginx, oops! Oh yeah and you need to upload the logs via FTP. In 2020. Maybe I should call up Foone to help. (Windows 10, as it happens, still supports FTP write-access via File Explorer, but it’s not very discoverable.)

Both support people also kept reminding me that the backup is incremental. So after the first cloud backup, everything else should be a relatively small amount of data to be copied. Except that I’m not sold onto that either, still: 128GB of data (which is the amount of pictures I came back from Budapest with), would take nearly six hours to back up.

When I finally managed to get a reply that was not directly from a support script, they told me to run the speedtest on a different datacenter, EU2. As it turns out, this is their “Germany” datacenter. This was very clear by tracerouting the IP addresses for the two hosts: EU3 is connected directly to LINX, EU2 goes back to AMS, then FRA (Frankfurt). The speedtest came out fairly reasonable (around 250Mbps download, 220Mbps upload), so I shared the data they requested in the ticket… and then wondered.

Since you can’t change the datacenter you backup to once you started a backup, I tried something different: I used their “Archive” feature, and tried to archive a multi-gigabyte file, but to their Germany datacenter, rather than the United Kingdom one (against their recommendation of «select the country that is nearest to your current location»). Instead of a 50Mbps peak, I got a 90Mbps peak, with a sustained of 67Mbps. Now this is still not particularly impressive, but it would have cut down the six days to three, and the five hours to around two. And clearly it sounds like their EU3 datacenter is… not good.

Anyway, let’s move on and look at local backups, which Acronis is supposed to take care of by itself. For this one at first I wanted to use the full image backup, rather than selecting folders like I did for the cloud copy, since it would be much cheaper, and I have a 9T external harddrive anyway… and when you do that, Acronis also suggests you to create what they call the “Acronis Survival Kit” — which basically means turning the external hard drive bootable, so that you can start up and restore the image straight from it.

The first time I tried setting it up that way, it formatted the drive, but it didn’t even manage to get Windows to connect the new filesystem. I got an error message linking me to a knowledgebase article that… did not exist. This is more than a bit annoying, but I decided to run a full SMART check on the drive to be safe (no error to be found), and then try again after a reboot. Then it finally seemed to work, but here’s where things got even more hairy.

You see, I’ve been wanting to use my 9TB external drive for the backup. A full image of my system was estimated at 2.6TB. But after the Acronis Survival Kit got created, the amount of space available for the backup on that disk was… 2TB. Why? It turned out that the Kit creation caused the disk to be repartitioned as MBR, rather than the more modern GPT. And in MBR you can’t have a (boot) partition bigger than 2TB. Which means that the creation of the Survival Kit silently decreased my available space to nearly 1/5th!

The reply from Acronis on Twitter? According to them my Windows 10 was started in “BIOS mode”. Except it didn’t. It’s set up with UEFI and Secure Boot. And unfortunately it doesn’t seem like there’s an easy way to figure out why the Acronis software thinks it’s that way. But worse than that, the knowledgebase article says that I should have gotten a warning, which I never did.

So what is it going to be at the end of the day? I tested the restore from Acronis Cloud, and it works fine. Acronis has been in business for many years, so I don’t expect them to disappear next year. So the likeliness of me losing access to these backups is fairly low. I think I may just stick to them for the time being, and hope that someone in the Acronis engineering or product management teams can read this feedback and think about that speed issue, and maybe start considering the idea of asking support people to refrain from engaging with other engineers on Twitter with fairly ridiculous scripts.

But to paraphrase a recent video by Techmoan, these are the type of imperfections (particularly the mis-detected “BIOS booting” and the phantom warning), that I could excuse to a £50 software package, but that are much harder to excuse in a £150/yr subscription!

Any suggestions for good alternatives to this would be welcome, particularly before next year, when I might reconsider if this was good enough for me, or a new service is needed. Suggestions that involve scripts, NAS, rclone, task scheduling, self-hosted software will be marked as spam.

MSI X299 SLI PLUS problems and solutions

Last year, I posted about an issue with missing BitLocker and PIN authentication with my replacement Gamestation build. While it does not look like this is a particularly popular post, I did confirm that at least a couple of people managed to get good use out of that blog post.

As usual, my Twitter feed contains spoilers of this blog post, as I have ranted, complained, and asked questions (mostly to Jo) trying to figure out my Windows problems. The reason I’m writing this down is as usual as a reference to myself, so I don’t repeat the same mistakes over and over again, and as a reference for others, since particularly one of the error codes I’m going to talk about appears to find almost exclusively scammy “PC fixing” websites. And yes I know that I’m repeating the word BIOS later while this is clearly an UEFI board, but MSI calls it as such, and to be honest for most non-technical folks the differences between the two terms don’t exist.

All long help threads should have a sticky globally-editable post at the top saying ‘DEAR PEOPLE FROM THE FUTURE: Here’s what we’ve figured out so far …’

First of all, as noted in the previous post, it looks like nearly all of the settings in the BIOS are lost at any upgrade of the firmware. This is particularly annoying when it looks like a lot of the updates are early boot microcode updates to cover the increasing complexity of mitigating Spectre-style vulnerabilities, and reasonably shouldn’t need to change the semantics or format of settings such as Secure Boot, TPM settings, or smart fan configuration.

So make sure to take good screenshots of all your settings before updating your firmware, as otherwise you’ll fight for hours trying to reconfigure it as you had it before.

Your computer is not resuming from sleep when you press the power button. This appears to be common, I’ve found a bunch of forums posts by people complaining about this behaviour on a number of MSI motherboards. Most of them appears to be in the form of DenverCoder9, although with a little more detail: people claiming they solved the issue by either downgrading or upgrading the motherboard’s BIOS. Not wanting to downgrade my BIOS and having just upgraded it, I wanted to find a better answer, and turns out I probably did find it. Here’s the solution: disable GO2BIOS feature.

Some more details, which can be useful for others in the future if they encounter similar issues and the solution I’m providing is not helping them. The GO2BIOS feature by MSI is a shortcut to enter the BIOS configuration screen without using the keyboard, and it’s particularly handy once you enable all the fast-boot options, as the keyboard might not respond at all. To force entering the BIOS configuration, then, you just need to keep pressed the power button for four seconds when you turn on the computer. That’s what clued me to the connection between the setting and the failure to resume, as they both related to the power button.

The reason why downgrading or upgrading the BIOS appeared to solve the issue is the one I noted above: all firmware updates on these boards appear to completely reset the settings to defaults, and the GO2BIOS feature is not enabled by default (and probably few people would consider re-enabling it in the hurry.)

Windows 10 bluescreens with WHEA_UNCORRECTABLE_ERROR. This is trickier, mostly because all of the search hits for this particular code appears to point at very dodgy websites, and the only hit I could find on the Microsoft website was for a forum post where it was suggested that the particular code I was saying was related to AMD CPUs. Since my machine is an i7, that made no sense whatsoever.

The WHEA in the name stands for Windows Hardware Error Architecture, which suggested that the cause of the bluescreen is caused by something like a Machine-Check Exception. This was particularly scary because it started happening right after I installed a new NVMe SSD, which appeared to get very warm, leading me to first install two more fans, and then replacing the original fans with PWM ones.

During this “ordeal” I also had been installing and updating quite a few pieces of software, related to CPU, motherboard, the Kraken cooler, and so on. And since I had just updated the BIOS I also had been tweaking a lot of parameters around, including tried re-enabling the auto-over-clock feature that, as I discussed previously, appears to be implemented mostly in firmware.

Eventually, I found that I solved the problem by uninstalling MSI’s Control Center software. I had already previously disabled the OC assistant, but even with that I kept receiving random blue screens when browsing websites, or just opening Lightroom. Since I uninstalled the Control Center software I have not experienced a single one for a few days. And that including a “torture test” with Prime95 that brought the CPU to 100C and to thermal throttling.

I’m not sure what the root cause for this is. I can only imagine that there’s some strange interaction between the firmware and the software that was not quite well tested. Or maybe there’s a new update on Windows 10 that caused Control Center to fight for resources. But whatever the reason it seems the right thing to do was to remove MSI’s software, which anyway does not really do anything you can’t do in the BIOS configuration screen.

I hope this post can find its way to those looking for answers for these (or similar enough) issues. And if you find that there are other possible causes for this, feel free to leave a comment on the post.

Windows 10: what to do if BitLocker and PIN stop working after update

I don’t really like the idea of having to write about proprietary software here, but I only found terrible alternative suggestions on the eb so I thought I would at least try to write down about it in the hope to avoid people falling for very bad advice.

The problem: after updating my BIOS, BitLocker asks for the key at boot, and PIN login for Windows 10 (Microsoft Account) fails, requiring to log in with the full Microsoft account password. Trying to re-enable the PIN fails with the error message “Sorry, there was a problem signing you in”.

The amount of misleading information I found on the Internet is astonishing, including a phrase from what appeared to be a Microsoft support person, stating «The operating system should always go with the BIOS. If the BIOS is freshly updated, then the OS has to be fresh as well.» Facepalms all over the place.

The solution (for me): go back in the BIOS and re-enable the TPM (“Security Module”).

Some background is needed. The 2017 Gamestation I’m using nowadays is built using a MSI X299 SLI PLUS with a plug-in TPM which is a requirement to use BitLocker (and if you think that makes it very safe, think again).

I had just updated the firmware of the motherboard (that at this point we all still call “BIOS” despite being clearly “UEFI” based), and it turns out that MSI just silently drop most of the customization to the BIOS settings after update. In particular this disabled a few required settings, including the TPM itself (and Secure Boot — I wonder if Matthew Garrett would have some words about the implementation of it in this particular board at this point).

I see reports on this for MSI and Gigabyte boards alike, so I can assume that Gigabyte does the same, and requires re-enabling the TPM in the settings when updating the BIOS version.

I would probably say that the MSI firmware engineering does not quite fully convince me. It’s not just the dropping all the settings on update (which I still find is a bad thing to do for users), but also the fact that one of the settings is explicitly marked as “crypto miner mode” — I’m sure looking forward for the time after the bubble bursts so that we don’t have to pay premium for decent hardware just because someone thinks they can make money from nothing. Oh well.