Documents Management (Searching For Bigfoot, Again)

Flameeyes

8 months ago

Context for the title: a few years ago, I wrote a post about the paperless office, quoting Jim Butcher and his quip about about the paperless office being more elusive than Bigfoot (and given that Harry Dresden worked for Sasquatch, that makes it very elusive), so whenever I think of the paperless office, I think of it as Bigfoot.

Of course, a paperless office was something that mattered a lot more for me back in Italy, when I was a freelancer and was running my own business, but as I wrote four years back, even in a more “normal” household, the amount of paperwork that still comes in snail mail is significant. And indeed, the load hadn’t improved significantly in the last four years for me, if anything it got worse in the last few months for a number of reasons.

So how do you deal with this, when you’re the kind of annoyed engineer who has grown jaded with a lot of solutions, both proprietary and Free? Up until recently, my solution wasn’t particularly sophisticated at all, and depended on whether the document was originally paper or not.

For scanning, I’m still to this day using PaperPort, although I lost track of which company I bought my license from at this point, as it appears that every time I write a post about this it is now in the hand of a new brand, even though it has basically no UI changes, and no Changelog in terms of compatibility improvement. One thing that I changed over the past four years is that instead of dropping the scanned documents into Google Drive, I’m now dropping them into a shared folder on my NAS, which is actually the input folder for Paperless-ngx.

I have mused about moving to Paperless or one of its derivatives (ngx apparently the only one still out there for what I can tell) in the previous post too, but one of the things that annoyed me at the idea was to have to set up a home server for this. I have indeed gone back and forth with the idea, including running it in the TrueNAS virtualization, but eventually I settled for turning a by-then shelved NUC into a home server, and run it there.

Paperless is actually quite cool, and ironically relies on unpaper, which I haven’t completed refactoring to be less integration-hostile yet, but I still hope to. Scanned PDFs get passed through OCRmyPDF for archival, and while it still has some rough edges (you can’t consume signed PDFs by default, you need to tell it to not worry about the invalidated signatures), its search is decent enough to replace Google Drive’s for me. The whole system maintains not one but two copies of each document: original and archival.

By default, useful metadata is attached in the database, rather than the archive files, which is both a blessing (much easier to change around) and a bit of a curse (sometimes I wish there was a good integration of XMP sidecar files.) And while there are already useful fields in the default schema, it is also very easy to add custom fields for things like account holders, or account number.

I spent probably a year just throwing everything I scanned into Paperless, and only barely organizing it on an ad-hoc basis, and the whole thing has grown to a complete unusable mess. The more organized part of my collection of documents actually stayed in Google Drive, where my pdfrename tool has been renaming the various files I collected from bills, invoices, and so on. This is probably my most useful codebase, despite the fact that it is an absolute mess of a tool with no smarts and just a lot of manually tweaked identification of documents.

But as I said, recently the amount of documents that I found myself managing increased significantly, so I decided to spend some time trying to build up my processes to be a bit more sophisticated, but also more scalable. For this to work, I needed to be able to run the same or better renaming process to the documents in Paperless, so I ended up writing a rough Paperless REST client, which downloads the original PDF of a document, run it through the same processing, but instead of renaming it, it applies the derived metadata as fields to Paperless — which is why if you look at the history of the project, I have been building up a richer schema for it.

Ideally, I would want to run this “mining” on each document as it is uploaded, which is theoretically possible by using a custom processing script, but it does look like a much larger project than what I would want to venture to. I guess I should rather file a feature request to configure a webhook to be called as part of the existing workflow feature, which then would allow me to run the renamer as a separate Web Application on the same Docker system as Paperless is running right now. For now, I just run the script before doing any administrative task, so that any document I recently added to Paperless is suitably analyzed. And of course, I have a Storage Path naming scheme that approximates my previous file naming when the data was stored in Google Drive.

I also have a pre-check that applies tags to documents that have been produced by either PaperPort or other software I previously used to scan my documents, so that I don’t waste time trying to analyze those. This is almost the kind of thing that Paperless’s own Workflow engine should be able to handle for me, but as it happens it does not allow me to query the document’s metadata as it is. It’s another feature that maybe I should find time to request.

This handles most of my management pipeline at this point: my analysis script handles identifying the correspondent, the document type, account holder and number (since some of the services I subscribe to, such as O2, send me separate documents for different lines), as well as document number in some cases (because it’s easy, and sometimes it’s easier to look it up through those.)

That leaves two parts of the pipeline: ingestion, and consumption. For the latter, I leveraged the fact that I’m keeping all of the data on my NAS — TrueNAS has an option to sync to and from Cloud services, including Google Drive. So I’m sending a copy of the original documents folder to my Drive, where I’m also sharing it with my wife, so that she has access to all of our documents independently, no matter whether we’re home or not.

You may be wondering why I need this, and the answer is that no matter what I think of, the “cloud” is the safest, most reliable way for us to have access to those documents anywhere. Running Paperless in a server at home is barely giving us the ability to access it while at home — because I made the mistake of running it over IPv6, and the reliability of it actually being available on any of our workstation is quite low, given UniFi, Android, and Windows all put together. I’m considering finding a way to get Let’s Encrypt certificates via DNS to move it onto a Tailscale IP address, which is definitely a lot more reliable, if a bit (lot in some cases) slower, but that also does not mean we can have access to it at all times: for instance as I’m typing, I’m connected to a in-plane WiFi network that does not let me access Tailscale at all!

So that finally brings me to ingestion. One of the great features that I wanted to leverage with Paperless, is its ability to ingest documents attached to email messages! At first, I thought of setting up an application-specific password in Fastmail so that Paperless can ingest my email directly, but I’m paranoid and I didn’t quite like the idea — even though Fastmail’s ASPs are definitely safer than Google’s and would actually allow me to limit what can be done with the password. So instead I’m now running my own mail server — though a very restricted and limited one, and I’ll get to explain the why and how in a different post.

Unfortunately, only a handful of services, and no banks, actually attach their bills to email. Most of everything requires you to go online and download it yourself from their website, which is more than a little bit annoying. While a friend used to have a script to download bills from Virgin Media Ireland through a WebDriver automation, I have not done that for any of our services or banks just yet — most of our banks don’t even let us do that unfortunately. Which means that for those I do end up still downloading and copying the file into the inbox folder manually, which is annoying from a mobile phone, lacking a Paperless app (or even PWA) for me to share the files to.

I have not changed my scan ingestion path either: I’m still scanning with PaperPort and rely on its OCR feature. One of the things I could improve on, is scanning directly from the ADS-1100W into Paperless. Unfortunately the only reasonable protocol one can use with this scanner is FTP — and as it turns out, the FTP integration in TrueNAS is not particularly well done: it uses proftpd, but that software does not integrate IP-based access controls, suggesting you to “just” configure it in your firewall, but then TrueNAS does not have a way to define IP-bound firewall rules, at least in its most basic configuration. I guess an alternative approach would be to run a “fake” FTP server that only receives the scanned PDF and submits it to Paperless, maybe I’ll write that one day.

You may notice that I have not published any of my Paperless integration yet. This is because while I tried to keep it relatively generic in its shape, it is very much encoding my own workflow into the code as it is. Before I dump all of this code in public, I want to try to at least make it a bit more useful, such as splitting the Paperless REST client from the renamer integration.

I also need to find a way to make sure that the stale email that didn’t include a PDF – and thus weren’t processed by Paperless – do not pile up in the inbox of the dedicated account, because that is what is happening right now, oops.

Overall, it sounds like my Bigfoot is taking shape, and while my wife and me have a lot of documents to go through to make sure they are correctly tagged and organized, particularly the scanned ones that are not processed by my renamer, I think we’re in a much better organized place than we were four years ago for sure. Or even just a couple of months.

Share this: