Paperless home, sorted home

You probably don’t remember, but I have been chasing the paperless office for many years. At first it was a matter of survival, as running my own business in Italy meant tons of paperwork, and sorting it all out while being able to access it was impossible. By scanning and archiving the invoices and other documents, the whole thing got much better.

I continued to follow the paperless path when I stopped running a company and just working, but by then, the world started following me and most services started insisting on paperless billing anyway, which was nice. In Dublin I received just a few pieces of paper a month, and it was easy to scan them, and then bring them to the office to dispose of in the secure shredding facilities. I kept this on after moving to London, despite the movers steaming my scanner, using a Brother ADS-1100W instead.

But since the days in Italy, my scanning process changed significantly: in Dublin I never had a Linux workstation, so the scanner ended up connected to my Gamestation using Windows — using PaperPort which was at the time marketed by Nuance. The bright side of this was that PaperPort applies most of the same post-processing as Unpaper while at the same time running OCR over the scanned image, making it searchable on Google Drive, Dropbox and so on.

Unfortunately, it seems like something changed recently, either in Windows 10, the WIA subsystem or something else altogether, and from time to time after scanning a page, PaperPort or the scanner freeze, and don’t terminate the processing, requiring a full reboot of the OS. Yes I tried powercycling the scanner, yes I tried disconnecting the USB and reconnecting, none seem to work except a full reboot, which is why I’m wondering if it might be a problem with the WIA subsystem.

The current workaround I have is to use the TWAIN system, which is the same that I used with my scanner on Windows 98, which is surprising and annoying — in particular I need to remember to turn on the scanner before I open PaperPort, otherwise it fails to scan and the process will need to be killed with the Task Manager. So I’m actually considering switching the scanning to Linux again.

My old scan2pdf command-line tool would help, but it does not include the OCR capabilities. Paperless seems more interesting, and it uses Unpaper itself. But it assumes you want the document stored on the host, as well as scanned and processed. I would have to see if it has integration with Google Drive, or otherwise figure out how to get that integration going with something like rclone. But, well, that would be quite a bit of work that I’m not sure I want to do right now.

Speaking of work, and organizing stuff — I released some hacky code which I wrote to sort through the downloaded PDF bills from various organizations. As I said on Twitter when I released it, it is not a work of engineering, or a properly-cleaned-up tool. But it works for most of the bills I care about right now, and it makes my life (and my wife’s) easier by having all of our bank statements and bills named and sorted (particularly when just downloading a bunch of PDFs from different companies once a month, and sorting them all.)

Funnily enough, writing that tool also had some surprises. You may remember that a few years ago I leaked my credit card number by tweeting a screenshot of what I thought was uninitialized memory in Dolphin. Unlike Irish credit card statements, British card statements don’t include the full PAN in any of the pages of a PDF. So you could think it’s safe to provide a downloaded PDF as proof of address to other companies. Well, turns out it isn’t, at least for Santander: there’s an invisible (but searchable and highlightable) full 16-digit PAN at the top of the first page of the document. You can tell it’s there when you run the file over pdf2text or similar tools (there’s a similar invisible number on bank statements, but that’s also provided visible: it’s the sort-code and account number).

Oh and it looks like most Italian bills don’t use easily-scrapeable layouts, which is why there’s none of them right now in the tool. If someone knows of a Python library that can extract text from pages using “Figure” objects, I’m all ears.

Scanning documents with the GT-S50

You might remember that a couple of months ago I bought a high-end scanner for my ongoing mission to make my office as paperless as humanly possible to me.

In due time, I also replaced my previous scan2pdf sh script with something a bit more reliable, written in Ruby. The original reason was that the options supported by the four scanners I have worked with up to now – an Epson Perfection 2480/2580, a HP OfficeJet MFP with a “fake feeder”, a HP LaserJet network MFP, and now the GT-S50 – varied considerably, and I needed something that would actually allow me to choose the options to pass depending on the device itself. The end result is a Ruby script scan2pdf which is still not implementing all I wished for, but it comes much nearer than what I used before.

One of the most recent features came from a requirement I found when trying to get rid of some legal paperwork from some years ago: for whatever reason, in Italy, a number of legally-binding papers seem to still be printed and set in folded sheets (what is usually called foglio protocollo in Italian); this means that you got a single A3-sized sheet, folded in the middle to provide four A4-sized sides to write on. And actually cutting this down to reduce it to two A4 sheets is not an option. How do you scan that?

With a standard flatbed scanner it is relatively easy to scan it: you just scan the four sides one by one. On a sheet-fed scanner like the GT-S50, you can’t do so unless you use the plastic envelope that they give you, which allows to scan irregularly-shaped sheets. And even doing so, it is a bit of a problem because you either have to run four scans, or you have to interleave the external and internal sides. So the end result was implementing a --folded switch to the script, and there you go.

What I’m missing now is some kind of “presets” system, so that I don’t have to repeat the same options each time when I’m doing common scans: a folded scan requires me to use the envelope, which in turn means that the first 20mm of the scan need to be discarded (as it’s the envelope’s detection zone). And all the scans from the GT-S50 (as well as the Perfection 2580) are to be done in the middle-area, rather than starting from the far left or right (as it was done, instead, on the HP MFPs.

For this one, I’m actually open to suggestions: do anybody know a decent Ruby library to handle ini-style configuration files? I’m not keen on using YAML here, not only because I can’t stand the format, but also because it seems more natural to keep using the key/value pair files when there is no need for anything particularly complicated (multiple sections are welcome for presets themselves).

Gentoo and EPSON scanners

“Harry,” Susan said. “Have you ever heard of the paperless office?”

“Yeah,” I said. “It’s like Bigfoot. Someone says he knows someone who saw him, but you don’t ever actually see him yourself.”

Jim Butcher — Changes

Because of the way Italian bureaucracy is designed, most of the business communications I deal with are still in paper form. And since my business lately has been covering a number of different customers, with different degrees of formality, I’ve been pretty much swamped by paper. I have been looking for ways to reduce the amount of paper I have around, or at least the amount of paper I have to look at often enough, and the solution was that to get a scanner with auto-document feeder (ADF) and keep all the new paper document scanned before either shredding or archiving them.

I settled first for an HP OfficeJet multi-function printer (which jammed so many times it wasn’t funny, but at least was cheap) and then for a LaserJet one that I was given by a family friend who couldn’t use it via USB any longer. Unfortunately it was a M1522nf, which turns out has a defective formatter card which requires either baking or replacement. HP wouldn’t replace it without paying a couple hundreds of euro, which wouldn’t be worth it at all.

I went to look for a solution, with the understanding that my current laser printer (a Kyocera-Mita FS-1020D) is quite a charm to use, especially now that it is no longer connected to an Apple Airport Extreme, but rather to my Gentoo-based router whose cups instance is directly referenced by both Yamato and Raven so that I don’t have to run it on them any longer. I went to look at more professional, office-oriented scanners, since the only one I could find in “entry level” by HP listed a duty cycle of 100 scans/day (which I’m sure I would bust easily; a bank-related contract is usually longer).

The requirements, beside being able to scan a decent amount of sheets (more than 20) were generally the usual you’d expect for a person like me: it has to work with Linux, possibly 64-bit, with preferably no prebuilt, proprietary software. To be honest, this doesn’t seem to be feasible at all. And before somebody asks, no HP is not much better. Sure HPLIP is open source and Free, but to use the scanner in the aforementioned M1522nf, you have to install their proprietary plugin bits. Canon scanners, which seems to be the cheapest with Linux support for ADF, are supported by their driver, but it’s totally proprietary and more importantly it only works on 32-bit systems.

EPSON, instead, seems to have a more interesting approach. The Avasys-developed epkowa backend for sane is actually quite nice: it encapsulates a decently-sized open-source backend with a number of proprietary plugins (and in some cases firmware files). I already had a bit of experience with that backend before, since my flatbed scanner is a Perfection 24802580, but with that I couldn’t make use of it in a long time simply because the non-basic functionality (film scanning) is supported by a 32-bit only plugin.

After snooping around and poking my hardware supplier about it, I decided to get a GT-S50 (sheet-fed) document scanner, which is shaped more like an inkjet printer or a fax machine than a scanner. This time as well a plugin is needed, so based on the GT-F720 ebuild that was already in tree I wrote one for the required plugin (and also one for the GT-F500 plugin that is used by the Perfection above — but the only task I can use it for is to install the firmware file used by snapscan, so if you want to try it on 32-bit please let me know if it works at all!). You can now find iscan-plugin-gt-s80 (it includes support for both S50 and S80) and iscan-plugin-gt-f500 in tree already.

I got the scanner on Friday evening and set it up; the driver worked at the first try, and the iscan tool provided by Avasys/Epson works just right. On the other hand, my customized script over scanimage (as provided by sane-backends) was not working properly when scanning duplex (i.e. both sides of the sheet at once). In such a configuration, a single pass of the sheet in the ADF causes two pages to be read; the driver (or the firmware, not sure) caches the back side as the second page, and the next time sane is asked for a page, it returns the one already scanned. When doing the second scan, it was random whether it worked or not: a race condition.

I haven’t spent enough time on it yet to know where the problem lies exactly: it might be a bug in the scanimage frontend (given that iscan works fine), or a bug in the drivers. At any rate for now I worked it around by adding a sleep(1) before calling sane_start() in the file. You can find a hacked 1.0.22 ebuild for sane-backends in my overlay if you happen to have similar problems.

At the end of the day, I’m pretty happy with the device, it scans very nicely, very fast and has a high enough duty cycle for the kind of stress I could make it go through. Once I’ll be able to resolve the above-noted issue with scanimage it’ll work much better. But there are a few issues that still need to be solved, such as:

  • iscan needs to be set to use /var/lib/iscan rather than the current /var/lib/lib/iscan, and that requires plugins to be re-registered;
  • I need to find a way to register the plugin properly when installing in a different ROOT, which right now neither of my ebuilds do;
  • the iscan-plugin-gt-f720 ebuild needs to be brought up to speed with the other two, as right now it seems sub-par (among other things it installs the plugins in /usr — which, albeit being what AVASYS does on RedHat systems, is wrong, as they should go in /opt).

I dream the paperless office

And I know pretty well it’s something almost impossible to have; yet I’d like to have it because I’m succumbing in a ocean of paper right now. And paperwork as well.

While I have the Sony Reader to avoid having to deal with tons of dead tree books (although I do have quite a bit still, lots of which are still being consulted), I didn’t try before to clean up my archive of receipts, packaging slips, and stuff like that.

Time has come now since I have to keep some fuller, cleaner archive of invoices sent and received for my new activity as a self-employed “consultant”; I decided to scan and archive away (in a plastic box in my garage, that is) the whole of the job papers I had from before, as well as all my medical records, and the remaining parts of the archive. The idea was that by starting anew I could actually start keeping some time of accountability of what I receive, and spend, both for job and for pleasure. Together with the fact that is less stuff to bring around with me, this makes two things that would get me nearer toward actually moving out of home.

Sunday and Monday I spent about eight hours a day scanning and organising documents, trashing all the stuff I’m not interested in keeping an original of (that is stuff I’m glad I can archive, but that even if I lost is not that important), and putting away in the plastic box the important stuff (job and medical records, receipts for stuff that is already in warranty, etc.). I think I got through around 400 pages, on a flatbed scanner, without document feeder, assigning a name to each, and switching scans between 150 and 300 dpi, colour, grayscale and lineart scans.

I guess I’ll try to keep my archive more updated from now on by scanning everything as it arrives instead of waiting for it to pile up for twelve years (yes I got some receipts dating back to twelve years ago, like my first computer, a Pentium 133) and then trying to crunch it away in a few days. My wrist is aching like it never did before, for the sheer amount of sheets put on and removed from the scanner (I sincerely hope it’s not going to give up on me, it would be a bad thing).

Now I’m looking for a way to archive this stuff in a quick and searchable way, file-based structures don’t work that well, tagging the stuff would work better, but I have no idea what to use for that. If anybody has a free software based solution for archiving, that can be queried by the network too is a bonus, that it works on Mac OS X with Spotlight is a huge bonus, I’d be glad to hear it.

I’m also going to try out some software for accountability; I’ve heard good words of gnucash but never tried it before so I’m merging it right now; for now I don’t have enough invoices to send out that would give me reason to start writing my own software, but if there is something out there customisable enough I’d be glad to bite the bullet and get to use it. Spending my free time to work on software I need to work is not my ideal way to solve the problem.

Up to now I only worked very low profile, without having to invoice or keep records; luckily I have an accountant that can tell me what to do, but there are personal matters, including personal debts, credit cards and other expenses I finally want to take a good look at, so that I can extinguish them as soon as possible, and then start putting some away to pay for a car and a places to move to. Not easy to do I guess, but that’s what I hope to be able to do.