7 thoughts on “PDFs and Metadata

  1. This makes me think about something:While OCR is not able to correctly scan an entire document and produce the same document as output, it may be just nice to use OCR to scan documents and only keep the plain text output as additional data for the original document.It could be possible to scan a paper document, get the file as PDF (or TIFF or whatever), run OCR on it to extract words, and add all of this in a metadata database.In the end, you could search for words in the document, and it would give the PDF file(s) with the word in it.


  2. KnowledgeTree can store and index your PDFs (and other doc formats too): http://www.knowledgetree.com/I use it to index all my PDF/CHM files (I have a lot of them, mostly IEEE/ISO specs) and is working well; if you are interested I have an updated ebuild too somewhere.p.s.: RSS feeds are not updates


  3. Obvious question: have you let e.g. tesseract run over the data before creating the PDFs (i.e. do all PDFs have text)?Next, beagle may be worth another look. First, you can configure it to not look all over the place.Second, you can get rid of its daemon and only use its static indexing tool to generate the index manually whenever you feel like it.Lastly, thanks to e.g. ionice I guess you can make it behave much better than in the past.That said, I don’t use, I just use grep (or for PDFs a quick and dirty hack combining find, ps2ascii and grep probably, haven’t had the need for that so far).


  4. Nope I didn’t think about running OCR; I need the image original because most of them are bills, invoices and other documents that need to be kept original… although having an invisible index of the content might have been a good idea I guess.A static index for beagle looks like a good idea, I’ll have to try it out, thanks!


  5. A Linux program, gscan2pdf, is quite nice, as it does all the OCR for you (using tesseract) and puts that into the PDF.I have begun tagging PDFs using PDF metadata. That works great, and Beagle seems to effortlessly index the data. So it’s a very quick way to pull up the relevant scanned documents.I’ve tried using knowledgetree and I really like the idea of it. But it just doesn’t seem to actually index the PDF tags :-( so searching for a given tag brings up nothing.Tom


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s