This is the story of how I ended up calling my bank at 11pm on a Sunday night to ask them to cancel my credit card. But it started with a complete different problem: I thought I found a bug in some PDF library.
I asked Hanno and Ange since they both have lots more experience with PDF as a format than me (I have nearly zero), as I expected this to be complete garbage either coming from random parts of the file or memory within the process that was generating or reading it, and thought it would be completely inconsequential. As you probably have guessed by the spoiler in both the title of the post and the first paragraph, it was not the case. Instead that string is a representation of my credit card number.
After a few hours, having worked on other tasks, and having just gone back and forth with various PDFs, including finding a possibly misconfigured AGPL library in my bank’s backend (worth of another blog post), I realized that Okular does not actually show a title for this PDF, which suggested a bug in Dolphin (the Plasma file manager). In particular Poppler’s pdfinfo
also didn’t show any title at all, which suggested there’s a problem with a different part of the code. Since the problem was happening with my credit card statements, and the credit card statements include the full 16-digits PAN, I didn’t want to just file a bug attaching a sample, so instead I started asking around for help to figure out which part of the code is involved.
Albert Astals Cid sent me the right direction by telling me the low-level implementation was coming from KFileMetadata, and that quickly pointed me at this interesting piece of heuristics which is designed to guess the title of a document by looking at the first page. The code is quite a bit convoluted, so I couldn’t at first just exclude uninitialized memory access, but I couldn’t figure out where it could be coming from, so I decided to copy the code into a single executable to play around with it. The good news was that it would give me the exact same answer, so it was not uninitialized memory. Instead, the parser was mis-reading something in the file, which by being stable meant it wasn’t likely a security issue, just sub-optimal code.
As there is no current, updated tool for PDF that behaves like mkvinfo
, that is print an element-by-element description of the content of a PDF file, I decided to just play with the code to figure out how it decided what to use as the title. Printing out each of the possible titles being evaluated showed it was considering first my address, then part of the summary information, then this strange string. What is going on there?
The code is a bit difficult to follow, particularly for me at first since I had no idea how PDF works to begin with. But the summary of it is that it goes through the textboxes (I knew already that PDF text is laid out in boxes) of the first page, joining together the text if the box has markers to follow up. Each of these entries is stored into a map of text heights, together with a “watermark” of the biggest text size encountered during this loop. If, when looking at a textbox, the height is lower than the previous maximum height, it gets discarded. At the end, the first biggest textbox content is reported as the title.
Once I disabled the height check and always reported all the considered title textboxes, I noticed something interesting: the string that kept being reported was found together with a number of textboxes that are drawn on top of the bank giro credit system — The Wikipedia page appears to talk only of the UK system. Ireland, as usual, appears to have kept their own version of the same system, and all the credit card statements, and most bills, will have a similar pre-printed “credit cheque” at the bottom. Even when they are direct-debited. The cheque includes a very big barcode… and that’s where I started sweating a bit.
The reason of the sweat is that by then I already guessed I made a huge mistake sharing the string that Dolphin was showing me. The reference to pay up a credit card is universally the full 16-digits number (PAN). Indeed the full number is printed on the cheque, and as the “An Post Ref” (An Post being the Irish postal system), and the account information (10-digits, excluding the 6-digits IIN) is printed on the bottom of the same. All of this is why I didn’t want to share the sample file, and why I always destroy the statements that arrive, in paper form, from the banks. At this point, the likeliness of the barcode containing the same information was seriously high.
My usual Barcode Scanner for Android didn’t manage to understand the barcode though, which made it awkward. Instead I decided to confirm I was actually looking at the content of the barcode in an encoded form with a very advanced PDF inspection tool: strings $file | grep Font
. This did bring up a reference to /BaseFont /Code128ARedA
. And that was the confirmation I needed. Indeed a quick search for that name brings you to a public domain font that implements Code 128 barcodes as a TrueType font. This is not uncommon, particularly as it’s the same method used by most label printers, including the Dymo I used to use for labelling computers.
At that point a quick comparison of the barcode I had in front of me with one generated through an online generator (but only for the IIN because I don’t want to leak it all), confirmed I was looking at my credit card number, and that my tweet just leaked it — in a bit of a strange encoding that may take some work to decode, but still leaked it. I called Ulster Bank and got the card cancelled and replaced.
Which lessons I can learn from this experience? First of all to consider credit card statements even more of a security risk than I ever imagine. It also gave me a practical instance of what Brian Krebs advocates for years regarding barcodes of boarding passes and similar. In particular it looks like both Ulster Bank and Tesco Bank use the same software to generate the credit card statements (which is easily told not to be the same system that generates the normal bank statements), which is developed by Fiserv (their name is in the Author field of the PDF), and they all rely on using the normal full card number for payment.
This is something I don’t really understand. In Italy, you only use the 16-digits number to pay the bank one-off by wire, and instead the statements never had more than the last five digits of the card. Except for the Italian American Express — but that does not surprise me too much as they manage it from London as well.
I’m now looking to see how I can improve on the guessing of the title for the PDFs in the KFileMetadata library — although I’m warming up to the idea of just sending a patch that delete that part of the code altogether, and if the file has no title, no title is displayed. The simplest solutions are, usually, the better.
Title guessing is a nice feature. Do you happen to know how often it gets it right? You can’t really do anything like that from a non-semantic data source like a stylistic PDF document. Non-semantic data sources can’t have nice things.
I honestly have no idea. But having thought of the semantics, I don’t see it as being “often enough”. You can see the code more clearly from my request https://phabricator.kde.org… (I should add it to the post later).In particular it discards title in metadata if it contains the word “Microsoft”, which appears silly, and if there are no spaces, which probably is okay for English but not for other languages. Much as the heuristics would be good if they worked, I can think of too many corner cases with them to be really valid.
> As there is no current, updated tool for PDF that behaves like mkvinfo, that is print an element-by-element description of the content of a PDF file
There is. They just tend to not be F nor OSS. For example, PDFlib’s TET comes to mind, which is a commercially sold product.
Heh, yeah maybe it’s missing an implied “FLOSS” there.
TBH nowadays there’s a few free options too. But most of them appeared after when this post was written (2017.)