Integrating Paperless-NGX with my own PDF Renamer

Flameeyes

2 months ago

You may remember that a couple of years ago I wrote a post about trying to find a better workflow to manage the waterfall of PDF documents that me and my wife receive on a regular basis: bank and credit card statements, electricity (and now gas), water, and phone service bills, invoices, and so on.

This became particularly more important as we prepared to buy the house we now live in, and the fact that we ended up moving out of London to a temporary flat, collecting a large number of contracts, agreements, final bills, first bills, and so on along the way. And say nothing about having to produce years worth of bank statement to confirm funds provenance.

I had already written and published my pdfrenamer tool, which parses the various documents we receive, to figure out which service they come from, who the account holders are, and so on. Up to this day, this tool makes zero use of “AI” — I don’t actually think that there’s a sensible way for mixing a computer vision based detection with the current deterministic design, so I’m not going to try.

What I did do, instead, was to build up an integration between Paperless NGX (which I apologize, but will likely improperly call Paperless a few times in this post and in the future anyway), which I’m using as my document management system, and the tool above, which I called, with zero creativity, flameeyes-paperless-automation. It started as a way to run the same renaming processes through an already classified and archived document, as well as having the ability to re-run it on all the stored documents, to make sure once I fixed a renamer it would be able to propagate the fixes to the existing documents. It grown a bit since then.

Please note, that none of the tools I’m writing around this count as either Software Engineering or creativity — I’m releasing them with the most obvious permissive license I could, and I’ll be upfront that I’ve been experimenting using LLMs as CASE tools in both repositories. They are tools that are purpose-designed for my use case specifically — if they do happen to match yours, great! If not, please don’t complain, I will take pull requests for features as long as they don’t affect my workflow.

This turned out to be a learning journey in more ways than one. While the renamer itself was originally developed for, and run on Windows, I wanted to run the automation closer to the Paperless server, to avoid fetching every PDF twice over the network (NAS to Paperless server to desktop/laptop.) Eventually this all became even more important once I upgraded the NAS: Paperless is now running as a virtual machine on the same hardware, using NFS over a (separate) virtio network to avoid hitting the physical network layer: while it’s far from zero-copy (the PDF is being passed over between multiple virtual networks), it never leaves the actual host it’s running on.

Doing this work also allowed me to take a more thoughtful approach to the way pdfrenamer provides the additional details with a schematized format, so that I can streamline the document types in Paperless (this being a first-class feature of Paperless’s schema), and avoid similar, but not identical names. it also allowed me to think through a few other details I can extract rom PDF files easily, namely the account and document numbers (invoice number, bill number, etc…) This is useful and important because I found that our former mobile provider (O2 UK) would issue multiple bills with the same dates, and the same account holder, since we had three lines with them (past tense, because they pulled a bad one, and since our new house is not covered by their network anyway, we just migrated to alternative providers — I’ll write up on that later.)

Note though, this is something that I ended up having to extract myself. Just like I complained seventeen years ago, there is still no provider that seems to provide structured metadata of the PDFs. A few providers (particularly Octopus) appear to at least identify the software used for the creation in a way that is conductive to recognizing the documents, but there is no account, or document level metadata provided. If you’re lucky, you can rely on the generation date to match the document date, but unfortunately that is not even a given, as sometimes the documents are generated on the fly. Unless they use iText.

Thankfully, Paperless thought this out well — you can add custom named fields as part of the schema, which are then indexed, so you can search by them. Which means that instead of using a boatload of tags like I was doing before to distinguish whether the document related to me, my mother, or my wife (or a combination thereof), I can now search (and I have saved searches for) using the Account Holder field.

To avoid having to re-run the identification on on hundreds of documents – yes I’m a data hoarder, it’s a side effect of both having ran my own business, and having had to search through years of paperwork for my parents back in Italy over time – I have implemented a checkpointing feature, which meant I could simply set the script to run repeatedly once an hour to process any new document that was added to the storage, and that worked particularly well once I moved everything to run onto the NAS unit, as the virtual network is a lot faster than the physical gigabit network I had before.

This will take a moment to explain, since it would be a good question on why I would want the identification to run automatically, given I could run it every time I add documents to the storage, right? Well, I wanted to have a little bit more automation: since at least a few of the bills I receive monthly (Octopus Energy, AWS, Hetzner) are attachments on the email, and Paperless has the ability to fetch documents attached to email messages via IMAP, I set up a few email aliases that forward to a self-hosted IMAP server (in addition to my personal address, or in a couple of cases the address shared with my wife), using Mailu.

Because I’m not quite happy with self-hosted products, and in particular I dealt with email servers long enough to know I don’t want to run more than the minimum I need, the IMAP server is not even accessible over the Internet, it’s behind Tailscale. To be honest, a lot of stuff I do nowadays is behind Tailscale plus authentication, maybe because I’m paranoid, and maybe because it’s so easy that the additional security layer doesn’t cause too much grief. It still does sometimes, but the upside is still better than the downside as it is.

What this means is that, without me having to do anything, every month a number of document just come into existence on my Paperless instance — and once a week they get synced over to my Google Drive, at least for the time being. Accessing Paperless on the go is still annoying at times, even with Tailscale working fine, so for the time being I’m keeping a copy of the processed, renamed files onto Google Drive, managed by TrueNAS directly.

But what about the times I go and download all the various bank and credit card statements myself, and drop them in the Paperless inbox folder (also on the NAS)? Well, turns out Paperless has a few integration options. While it’s supposedly possible to run a specific script at the time a document is ingested, that didn’t feel particularly practical. Instead, you can set up a Workflow, that calls a webhook (i.e. makes a GET or POST request to a specific URL — I still don’t understand why we ended up giving a name to this concept, but, I guess) every time a new document is ingested.

So the automation tool has an optional web server now — which I’m running on Docker on the same machine as Paperless. Whenever a new document is ingested (either through email, or the drop folder), it gets called, and then it fetches the actual PDF from Paperless to see if it can identify it through the usual deterministic extraction — as long as the document is not obviously a scan.

That’s another important point. I currently have primarily three ways to add documents to Paperless: receiving them by email, dropping them on the inbox SMB share (write-only on the network), or… scanning them. Unfortunately, either the software I used for the longest time (PaperPort) or the Brother drivers have started fighting with Windows 11, and I couldn’t get my ADS-1100W scanner to scan through the app — previously, I would be scanning the document through my computer, and immediately drop them onto the SMB share. Nowadays, what I’m doing is choosing on the scanner if I want a black-and-white scan or a colour scan, and let it drop it… into the FTP upload folder.

Yes, FTP, classic, unencrypted FTP. You have no idea how annoying it was to find a way to set this up in TrueNAS in such a way that the Brother scanner could write to it — the alternative would have been allowing SMB1 connections just for the drop folder, and I didn’t feel like doing that. It’s a working solution for the time being, but I would be lying if I didn’t say I’d love to find myself an AN335W which should have support for modern protocols and a lot more presets than the two I currently can select from. Maybe this year or next.

For those documents, the deterministic extraction is impossible, so I made sure the service would first identify if the document is a scanned document through the creator software metadata, and in that case not bother trying to process the file at all, instead putting it into the pile of scanned documents I go through every so often to sort through.

And that’s pretty much the state I’m at right now — Paperless NGX has proven itself being more than a decent document management system. While it does have some issues here and there, particularly by depending on Ghostcript which makes it unable to process HMRC’s self-assessment statements (why? I don’t know!), it has plenty features for organization (including a great integration with Tesseract OCR, that I believe includes unpaper, ironically), and a good set of extension points through API and Workflows. Had I had this available when I wrote my old scan2pdf tool, I would have 100% wanted to integrate with it.

What does the future hold for my integrations? Almost certainly some Computer Vision model for document classification. While Paperless NGX attempts to extract document dates, and learn document types, and correspondents (and tags), these appear to be rudimentary and based on the extracted OCR — I feel it’s very rare that these are matched. But I’m fairly sure that a modern Computer Vision approach (which would now be labelled “AI”, even though it’s not an LLM and quite unrelated to it) would be able to be directed at extracting more reliable information.

The questions would be, how much refinement would that need, and would I be able to implement it myself? I can tell you already that for the latter, the answer is “no” — at least not without an “AI” (sigh) assistant, as even just the amount of theory to understand how that works is beyond my current skills, and I have enough things to work on and worry about that I wouldn’t be able to learn this. So it is likely this will be one of those tasks I’ll throw to Claude Code or something along those lines, and see how far it takes me — if it gets me something usable, yay me, if not, well I’m no worse than I was before trying (setting aside the subscription money, which simply put I’m writing down as a “cost of doing business”, or more precisely, cost of wanting to have a career.)

Share this: