You probably don’t remember, but I have been chasing the paperless office for many years. At first it was a matter of survival, as running my own business in Italy meant tons of paperwork, and sorting it all out while being able to access it was impossible. By scanning and archiving the invoices and other documents, the whole thing got much better.
I continued to follow the paperless path when I stopped running a company and just working, but by then, the world started following me and most services started insisting on paperless billing anyway, which was nice. In Dublin I received just a few pieces of paper a month, and it was easy to scan them, and then bring them to the office to dispose of in the secure shredding facilities. I kept this on after moving to London, despite the movers steaming my scanner, using a Brother ADS-1100W instead.
But since the days in Italy, my scanning process changed significantly: in Dublin I never had a Linux workstation, so the scanner ended up connected to my Gamestation using Windows — using PaperPort which was at the time marketed by Nuance. The bright side of this was that PaperPort applies most of the same post-processing as Unpaper while at the same time running OCR over the scanned image, making it searchable on Google Drive, Dropbox and so on.
Unfortunately, it seems like something changed recently, either in Windows 10, the WIA subsystem or something else altogether, and from time to time after scanning a page, PaperPort or the scanner freeze, and don’t terminate the processing, requiring a full reboot of the OS. Yes I tried powercycling the scanner, yes I tried disconnecting the USB and reconnecting, none seem to work except a full reboot, which is why I’m wondering if it might be a problem with the WIA subsystem.
The current workaround I have is to use the TWAIN system, which is the same that I used with my scanner on Windows 98, which is surprising and annoying — in particular I need to remember to turn on the scanner before I open PaperPort, otherwise it fails to scan and the process will need to be killed with the Task Manager. So I’m actually considering switching the scanning to Linux again.
My old scan2pdf command-line tool would help, but it does not include the OCR capabilities. Paperless seems more interesting, and it uses Unpaper itself. But it assumes you want the document stored on the host, as well as scanned and processed. I would have to see if it has integration with Google Drive, or otherwise figure out how to get that integration going with something like rclone. But, well, that would be quite a bit of work that I’m not sure I want to do right now.
Speaking of work, and organizing stuff — I released some hacky code which I wrote to sort through the downloaded PDF bills from various organizations. As I said on Twitter when I released it, it is not a work of engineering, or a properly-cleaned-up tool. But it works for most of the bills I care about right now, and it makes my life (and my wife’s) easier by having all of our bank statements and bills named and sorted (particularly when just downloading a bunch of PDFs from different companies once a month, and sorting them all.)
Funnily enough, writing that tool also had some surprises. You may remember that a few years ago I leaked my credit card number by tweeting a screenshot of what I thought was uninitialized memory in Dolphin. Unlike Irish credit card statements, British card statements don’t include the full PAN in any of the pages of a PDF. So you could think it’s safe to provide a downloaded PDF as proof of address to other companies. Well, turns out it isn’t, at least for Santander: there’s an invisible (but searchable and highlightable) full 16-digit PAN at the top of the first page of the document. You can tell it’s there when you run the file over pdf2text
or similar tools (there’s a similar invisible number on bank statements, but that’s also provided visible: it’s the sort-code and account number).
Oh and it looks like most Italian bills don’t use easily-scrapeable layouts, which is why there’s none of them right now in the tool. If someone knows of a Python library that can extract text from pages using “Figure” objects, I’m all ears.