This Time Self-Hosted
dark mode light mode Search

PDFs and Metadata

You might remember I was thinking about archiving data a few months ago. Up to now, I only stopped at scanning out the docs in PDF (trying to keep quite current with the inbound flow of paper) so that I could have easier access to the documents, and also getting rid of the high amount of useless paper around home.

The experiment up to now seems to be working out decently well. In the sense that the amount of paper around the house started to fall down, and at the same time I’ve been able to archive most of my stuff in a decent way by just using proper paths. Unfortunately, now stuff starts to get complex as well.

What I’d be needing now is some method to arbitrarily tag PDF files (the archive is all in PDF; while Stuart noted that TIFF would also be a decent way to store the data, there is one problem there in the sense that sometimes TIFF files don’t appear correctly on OS X. And since I mix operating systems I needed something that worked on both). And obviously an easy way to get the data out searching for those tags is also needed.

I have been told that XMP from Adobe should do what I need, I remembered the technology name and I’m pretty sure that yes, the way it was designed allows for what I’m looking for; obviously the problem is whether there’s a software that allows me to write down the type of metadata that I need; I’m not really too keen on writing my own, right now.

There is also the other problem of finding the data; I remember from some years ago Beagle could be used to do some on-disk search for documents. I also remember, though, that it was tremendously heavy, eating up lots of CPU and RAM, and just partly because of Mono, the rest was Beagle itself quite easily. Does anybody know whether it has improved? Or can suggest an alternative software to do something similar? I tried merging Tracker, but it doesn’t seem like it’s interested in indexing anything on my system, I have no idea why…

In theory, I’d like something that, searching for “H3G July 2009” would find me the correct PDF with the cellphone bill for the month of July 2009, and searching for “Amazon Office 2007” would find me the invoice for Office 2007 from Amazon UK. I’m fine with writing my own description to the files to get the right one.

If somebody has suggestions, they are definitely welcome. Thanks!

Comments 7
  1. This makes me think about something:While OCR is not able to correctly scan an entire document and produce the same document as output, it may be just nice to use OCR to scan documents and only keep the plain text output as additional data for the original document.It could be possible to scan a paper document, get the file as PDF (or TIFF or whatever), run OCR on it to extract words, and add all of this in a metadata database.In the end, you could search for words in the document, and it would give the PDF file(s) with the word in it.

  2. KnowledgeTree can store and index your PDFs (and other doc formats too): http://www.knowledgetree.com/I use it to index all my PDF/CHM files (I have a lot of them, mostly IEEE/ISO specs) and is working well; if you are interested I have an updated ebuild too somewhere.p.s.: RSS feeds are not updates

  3. Obvious question: have you let e.g. tesseract run over the data before creating the PDFs (i.e. do all PDFs have text)?Next, beagle may be worth another look. First, you can configure it to not look all over the place.Second, you can get rid of its daemon and only use its static indexing tool to generate the index manually whenever you feel like it.Lastly, thanks to e.g. ionice I guess you can make it behave much better than in the past.That said, I don’t use, I just use grep (or for PDFs a quick and dirty hack combining find, ps2ascii and grep probably, haven’t had the need for that so far).

  4. Nope I didn’t think about running OCR; I need the image original because most of them are bills, invoices and other documents that need to be kept original… although having an invisible index of the content might have been a good idea I guess.A static index for beagle looks like a good idea, I’ll have to try it out, thanks!

  5. A Linux program, gscan2pdf, is quite nice, as it does all the OCR for you (using tesseract) and puts that into the PDF.I have begun tagging PDFs using PDF metadata. That works great, and Beagle seems to effortlessly index the data. So it’s a very quick way to pull up the relevant scanned documents.I’ve tried using knowledgetree and I really like the idea of it. But it just doesn’t seem to actually index the PDF tags 🙁 so searching for a given tag brings up nothing.Tom

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.