Shame Cube, or how I leaked my own credit card number

This is the story of how I ended up calling my bank at 11pm on a Sunday night to ask them to cancel my credit card. But it started with a complete different problem: I thought I found a bug in some PDF library.

I asked Hanno and Ange since they both have lots more experience with PDF as a format than me (I have nearly zero), as I expected this to be complete garbage either coming from random parts of the file or memory within the process that was generating or reading it, and thought it would be completely inconsequential. As you probably have guessed by the spoiler in both the title of the post and the first paragraph, it was not the case. Instead that string is a representation of my credit card number.

After a few hours, having worked on other tasks, and having just gone back and forth with various PDFs, including finding a possibly misconfigured AGPL library in my bank’s backend (worth of another blog post), I realized that Okular does not actually show a title for this PDF, which suggested a bug in Dolphin (the Plasma file manager). In particular Poppler’s pdfinfo also didn’t show any title at all, which suggested there’s a problem with a different part of the code. Since the problem was happening with my credit card statements, and the credit card statements include the full 16-digits PAN, I didn’t want to just file a bug attaching a sample, so instead I started asking around for help to figure out which part of the code is involved.

Albert Astals Cid sent me the right direction by telling me the low-level implementation was coming from KFileMetadata, and that quickly pointed me at this interesting piece of heuristics which is designed to guess the title of a document by looking at the first page. The code is quite a bit convoluted, so I couldn’t at first just exclude uninitialized memory access, but I couldn’t figure out where it could be coming from, so I decided to copy the code into a single executable to play around with it. The good news was that it would give me the exact same answer, so it was not uninitialized memory. Instead, the parser was mis-reading something in the file, which by being stable meant it wasn’t likely a security issue, just sub-optimal code.

As there is no current, updated tool for PDF that behaves like mkvinfo, that is print an element-by-element description of the content of a PDF file, I decided to just play with the code to figure out how it decided what to use as the title. Printing out each of the possible titles being evaluated showed it was considering first my address, then part of the summary information, then this strange string. What is going on there?

The code is a bit difficult to follow, particularly for me at first since I had no idea how PDF works to begin with. But the summary of it is that it goes through the textboxes (I knew already that PDF text is laid out in boxes) of the first page, joining together the text if the box has markers to follow up. Each of these entries is stored into a map of text heights, together with a “watermark” of the biggest text size encountered during this loop. If, when looking at a textbox, the height is lower than the previous maximum height, it gets discarded. At the end, the first biggest textbox content is reported as the title.

Once I disabled the height check and always reported all the considered title textboxes, I noticed something interesting: the string that kept being reported was found together with a number of textboxes that are drawn on top of the bank giro credit system — The Wikipedia page appears to talk only of the UK system. Ireland, as usual, appears to have kept their own version of the same system, and all the credit card statements, and most bills, will have a similar pre-printed “credit cheque” at the bottom. Even when they are direct-debited. The cheque includes a very big barcode… and that’s where I started sweating a bit.

The reason of the sweat is that by then I already guessed I made a huge mistake sharing the string that Dolphin was showing me. The reference to pay up a credit card is universally the full 16-digits number (PAN). Indeed the full number is printed on the cheque, and as the “An Post Ref” (An Post being the Irish postal system), and the account information (10-digits, excluding the 6-digits IIN) is printed on the bottom of the same. All of this is why I didn’t want to share the sample file, and why I always destroy the statements that arrive, in paper form, from the banks. At this point, the likeliness of the barcode containing the same information was seriously high.

My usual Barcode Scanner for Android didn’t manage to understand the barcode though, which made it awkward. Instead I decided to confirm I was actually looking at the content of the barcode in an encoded form with a very advanced PDF inspection tool: strings $file | grep Font. This did bring up a reference to /BaseFont /Code128ARedA. And that was the confirmation I needed. Indeed a quick search for that name brings you to a public domain font that implements Code 128 barcodes as a TrueType font. This is not uncommon, particularly as it’s the same method used by most label printers, including the Dymo I used to use for labelling computers.

At that point a quick comparison of the barcode I had in front of me with one generated through an online generator (but only for the IIN because I don’t want to leak it all), confirmed I was looking at my credit card number, and that my tweet just leaked it — in a bit of a strange encoding that may take some work to decode, but still leaked it. I called Ulster Bank and got the card cancelled and replaced.

Which lessons I can learn from this experience? First of all to consider credit card statements even more of a security risk than I ever imagine. It also gave me a practical instance of what Brian Krebs advocates for years regarding barcodes of boarding passes and similar. In particular it looks like both Ulster Bank and Tesco Bank use the same software to generate the credit card statements (which is easily told not to be the same system that generates the normal bank statements), which is developed by Fiserv (their name is in the Author field of the PDF), and they all rely on using the normal full card number for payment.

This is something I don’t really understand. In Italy, you only use the 16-digits number to pay the bank one-off by wire, and instead the statements never had more than the last five digits of the card. Except for the Italian American Express — but that does not surprise me too much as they manage it from London as well.

I’m now looking to see how I can improve on the guessing of the title for the PDFs in the KFileMetadata library — although I’m warming up to the idea of just sending a patch that delete that part of the code altogether, and if the file has no title, no title is displayed. The simplest solutions are, usually, the better.

Help request: extending Finance::Quote

It’s not very common for me to ask explicit help with writing new software, but since this is something that I have no experience with, in a language I don’t know, and not mission-critical for any of my jobs, I don’t really feel like working on this myself.

Since right now I not only have a freelancing, registered job, but I also have to take care of most, if not all, house expenses, I’ve started keeping my money in check through Gnucash as I said before. This makes it much easier to see how much (actually, little) money I make and I can save away or spend on enjoying myself from time to time (to avoid burning out).

Now, there is one thing that bothers me: to save away the money that I owe the government as taxes (both VAT I have to pay, and extra taxes) I subscribed to a security fund, paying regularly (if I have the money available, of course!); unfortunately I need to explicitly go look up the data on my bank’s website to know exactly how much money I have stashed in there at any time.

Gnucash obviously have a way to solve this problem, by using Finance::Quote Perl module to fetch the data from a longish series of websites, mostly through scraping. Let’s not even start to talk about the chances that the websites changed their structure in the past months since the 1.17 release of the module (hint: at least one had, since I tried it out manually and it only gets a 404 error), but at last Yahoo, while accepting the ISIN of the fund, doe not give me any data for the current value of the share.

Now, the fund, which is managed by Pioneer Investments and they do provide the data, and via a very simple, ISIN-based, URL! Unfortunately, they provide that data only… in PDF. Now, this does not seem to be too bad: the data is available in text form because pdftotext provides it properly, and it’s clearly marked with the previous line to be a fixed string; on the other hand, I have no idea how it would be possible to scrape a PDF, especially in Perl, and even worse within Finance::Quote!

If somebody feels like helping me out, the URL for the PDF file with the data is the following, and the grep command will tell you what to look for in the PDF’s text. If you can help me out with this I’ll be very glad. Thanks!

# wget '∈=IT0000388204'
# pdftotext pioneer_monetario_euro_a.pdf* - | grep 'Valore quota' -A 2
Valore quota


I killed enough trees…

Pathetic as it is, this Saturday evening was spent, by me at least, cleaning up old and new paper. And that’s the kind of trees I’m talking about.

Indeed, since I now have a self-employed activity I need to have a paper trail of all the invoices sent and received, of all the contracts, of all the shipment bills and so on. While I prefer being paperless for handling, and thus I scan everything I get (invoices, contracts, shipment bills, …) I have to keep the paper trail for my accountant, at least for now. This also means printing the stuff that otherwise I wouldn’t be printing (!) like Apple’s invoices. I was hoping to avoid that, but it turns out that my accountant wants the paper.

Interestingly enough, printing from within Firefox here on Linux is a bit of a problem: it sets itself to use Letter, even though my /etc/papersize is set properly to a4 and LC_PAPER is set to it_IT (which is, obviously, A4). It really baffles me because it starts already to be a nuisance that you have to have libpaper, when the locale settings already would have the support for discerning between different paper sizes, but the fact that Firefox also defaults to Letter (which is basically only used in US for what I know of) without having an option to change (yeah I did so already with about:config, no change) is definitely stupid.

Luckily, most of the references I’ve been using lately are available in PDF, and thanks to the Sony Reader I don’t have to print them out. What I decided to cut lately, as well, is on CDs: most stuff I can get easily on Apple’s iTunes Store(yes I know it’s not available on Linux, but the music is not DRM’d, it’s in a good format, and it’s not overly expensive); too bad that they don’t have a (more expensive even) ALAC store, or I would be buying also my Metal music (AAC does no good to metal).

Games aren’t as easy: I don’t have space on the PS3 already, and I bought just a couple from the Play Station Network store, nor I have space on the PSP, and additionally, downloaded games with an Italian account are twice the price that I can get from Amazon. Sony, if you’re reading, this is the time to fix this issue! Especially with the PSP Go! coming, I don’t think it’s going to sell well among game enthusiasts, and I’m quite sure that those who will get it will probably hate the extra-high prices.

Anyway, since I’m avoiding buying CDs and rather going with the iTunes Store, and I cannot accept direct donations any longer, you can now also consider their gift cards, they are certainly accepted…

Documentation: remake versus download

One of the things that I like a lot about Gentoo is that you can easily have installed the whole set of documentation for almost every library out there, being API, tutorials or all the stuff like that.

This, unfortunately, comes with a price: you need the time and the tools to build this documentation most of the times. And sometimes the tools you’re needed to install are almost overkill against the library they are used by. While most of the software out there with generated man pages ships with them already prebuilt in the tarball (thanks to automake, the whole thing can be done quite neatly), there are packages that don’t ship with them, either because they don’t have a clean way to tar them up at release or because they are not released (ruby-elf is culprit of this too, since it’s only available on the repository for now).

For those, the solution usually is to bring in some extra packages like, for the ruby-elf case above, the docbook-ns stylesheets that are used to produce the final man page from the DocBook 5 sources. But it might not use this: there are quite a lot of different ways to build man pages: perl scripts, compiled tools, custom XML formats, you name it.

And this is just for man pages, which are usually updated explicitly by their authors: API documentation, which is usually extrapolated from the source code directly, is rarely generated when creating the final release distribution. This goes for C/C++ libraries that use doxygen or gtk-doc, to Java packages that use JavaDoc, to Ruby extensions that use RDoc (indeed, the original idea for this post came to me when I was working on Ruby-ng eclass and noticed that almost all the Ruby extensions I packaged required me to rebuild the API documentation at build time).

Now, when it comes to API documentation, it’s obvious we don’t really want to “waste” time generating it for non-developers: they would never care about reading it in the first place. This is why we have USE flags after all. But sometimes, even this does not seem to be enough control. The first problem is: which format do we use for the documentation? For those of you that don’t know it, Doxygen can generate documentation in many forms, included but not limited to HTML, PDF (through LaTeX) and Microsoft Compressed HTML (CHM). There are packages that do build all formats available; some autodiscover the available tools, other try to use the tools even when they are not installed in the system.

We should probably do some kind of selection, but it has to be said it’s not obvious, especially when upstream, while adding proper targets to rebuild documentation, only design them for their own usage: to generate and publish, on their site or something, the resulting documentation. We install the documentation for the system user, we should probably focus on what can be displayed on screen, which would probably steer us toward installing HTML files because they are browsable and easy to look at on the screen. But I’m sure there are people who are interested in having the PDFs at hand instead, so if we were to focus on just those people will complain. Not like at this point I’m caring about a 100% experience but rather having a good experience for a 90% of people, maybe 95%.

I do remember that there are quite a few packages that do try to use LaTeX to rebuild documentation, this because there have been quite a few sandbox problems with the font cache that was regenerated during portage build. Unfortunately, I don’t have any number at hand, because – silly me – the tinderbox strips documentation away to save space (maybe I should remove that quirk, the raid1 volumes have quite a bit of free space by now). I can speak, recently, for Ragel, which I’ve move away from rebuilding the documentation, inspired first by the FreeBSD ports which downloaded the pre-built PDF version from Ragel’s site (I did the same for version 6.4, under doc USE flag), and then sidestepping the issue altogether since upstream now ships with the PDF in the source tarball.

But this is also buggering me as upstream for a few projects: what is the best for my users? The online API documentation is useful when you don’t want to rebuild the documentation locally, and can be searched by search engines much more easily, but is that enough? Offline users? Users with restricted bandwidth? Servers with restricted bandwidth? Of course offline users can regenerate the documentation, but is that the best option? Should the API documentation be shipped within the source tarball? That could make the tarball much much bigger than just the sources; it can even double in size.

Downloadable documentation, Python-style, looks to me like one of the best options. You get the source tarball, and the documentation tarball; you install the latter if the doc USE flag is enabled. But how to generate them? I guess that adding one extra target to the Makefiles (or equivalent for your build system) may very well be an option, I’ll probably work on that for lscube with a ready recipe showing how to make the tarball during make dist (and of course documenting it where it’s easier to reach than my blog).

The only problem with this is that it doe not take advantages of improved generation by newer version of the software; for instance if one day Doxygen, JavaDoc, RDoc and the like decide finally to agree on a single, compatible XML/XHTML format for documentation to be accessed with an application integrating a browser and an index system (I’d like to say that both Apple and Microsoft provide applications that seem to be doing that; I haven’t used them quite long enough to actually tell how well they work, but they are designed to do that).

But at least let this be a start for a discussion: should we really rebuild PDF documentation when installing packages for Gentoo, even under doc USE flag, or should we stick with more display-oriented formats?

I dream the paperless office

And I know pretty well it’s something almost impossible to have; yet I’d like to have it because I’m succumbing in a ocean of paper right now. And paperwork as well.

While I have the Sony Reader to avoid having to deal with tons of dead tree books (although I do have quite a bit still, lots of which are still being consulted), I didn’t try before to clean up my archive of receipts, packaging slips, and stuff like that.

Time has come now since I have to keep some fuller, cleaner archive of invoices sent and received for my new activity as a self-employed “consultant”; I decided to scan and archive away (in a plastic box in my garage, that is) the whole of the job papers I had from before, as well as all my medical records, and the remaining parts of the archive. The idea was that by starting anew I could actually start keeping some time of accountability of what I receive, and spend, both for job and for pleasure. Together with the fact that is less stuff to bring around with me, this makes two things that would get me nearer toward actually moving out of home.

Sunday and Monday I spent about eight hours a day scanning and organising documents, trashing all the stuff I’m not interested in keeping an original of (that is stuff I’m glad I can archive, but that even if I lost is not that important), and putting away in the plastic box the important stuff (job and medical records, receipts for stuff that is already in warranty, etc.). I think I got through around 400 pages, on a flatbed scanner, without document feeder, assigning a name to each, and switching scans between 150 and 300 dpi, colour, grayscale and lineart scans.

I guess I’ll try to keep my archive more updated from now on by scanning everything as it arrives instead of waiting for it to pile up for twelve years (yes I got some receipts dating back to twelve years ago, like my first computer, a Pentium 133) and then trying to crunch it away in a few days. My wrist is aching like it never did before, for the sheer amount of sheets put on and removed from the scanner (I sincerely hope it’s not going to give up on me, it would be a bad thing).

Now I’m looking for a way to archive this stuff in a quick and searchable way, file-based structures don’t work that well, tagging the stuff would work better, but I have no idea what to use for that. If anybody has a free software based solution for archiving, that can be queried by the network too is a bonus, that it works on Mac OS X with Spotlight is a huge bonus, I’d be glad to hear it.

I’m also going to try out some software for accountability; I’ve heard good words of gnucash but never tried it before so I’m merging it right now; for now I don’t have enough invoices to send out that would give me reason to start writing my own software, but if there is something out there customisable enough I’d be glad to bite the bullet and get to use it. Spending my free time to work on software I need to work is not my ideal way to solve the problem.

Up to now I only worked very low profile, without having to invoice or keep records; luckily I have an accountant that can tell me what to do, but there are personal matters, including personal debts, credit cards and other expenses I finally want to take a good look at, so that I can extinguish them as soon as possible, and then start putting some away to pay for a car and a places to move to. Not easy to do I guess, but that’s what I hope to be able to do.

More about Reader and PDFs

Okay, thanks to Jeff who commented on my previous post I finally got the SD card working on Linux. If you ever have problems, enable CONFIG_SCSI_MULTI_LUN in the kernel. Tomorrow I’ll add a warning to the libprs500 ebuild if it’s unset.

Tonight I didn’t have much time, but I’ve seen that a 9 x 12 cm page is just the right setting for the reader. In LaTeX this produces a page that is perfect for reading on the Reader.

Unfortunately texinfo is not as easy as LaTeX, even if I set the size of the page, it only reduces the size of the text inside it and I can’t find how to reduce the actual page size in the PDF. While this is enough to use the zoom function of the Reader, you’ll have to repeat it for every page, and it gets boring. I’d very much like to crop the pdf file.

Unfortunately the only tool I found that can crop PDF files is ImageMagick’s convert. But convert acts on images, and that causes two problems: first it takes up a huge amount of memory (gs converts a 15MB PDF file into a 400MB PGM file, and back again); second it creates an image-only PDF file that, well, let’s just say that a 90×120 pixels (okay I got the wrong unit, it happens!) file is big 24MB, and I remember you I started from 15MB.

I was suggested by Pino (oKular developer) to try pdftk, but as far as I can see from the documentation available online, it does not allow me to crop the pages. I’ve now found an interesting script that would add cropping data to Postscript files; if ghostscript supports those, it would then allow me to convert the ps back to a cropped PDF. Tomorrow I’ll have to try.

And yes, tomorrow I ll see to provide a few more photos, of a book showing up on the Reader, both a standard PDF and an ad-hoc generated copy of “The Not So Short Guide to LaTeX2e” most likely.

On a totally different note, I was watching movie trailers on my Apple TV now and… yet another movie that makes computer capable of anything right now, “Untraceable”. And people complain that CSI is unrealistic.

My impressions about the Reader

So I finally received the Sony reader I ordered almost a month ago. Actually the shipment was quite fast, sent on January 20, received today.

First impressions with the hardware are positive, it’s a bit more heavy that I would have thought, but it’s big enough, and the eInk display is really good. Quite a nice item. PRS-505 photos here

An half problem is having the software working. There is libprs500 which takes care of almost everything, but packaging that is becoming a bit of a challenge.

While the author is a Gentoo user, and very helpful to improve the situation, I found a bit of problems now with xdg-utils. The post-installation script of libprs500 uses xdg-utils commands to install icons, desktop files and similar stuff. Unfortunately xdg-utils is … far from perfect. 81 open bugs on FreeDesktop’s bugzilla, and a lot of gray areas.

First off, xdg-utils don’t support DESTDIR (nor does the postinst script, but that I fixed); this means that it tries to write directly on filesystem, which is not good at all for distributions, not only Gentoo. I can workaround some of these problems by setting XDG_DATA_DIRS to a modified path forcing it to use the correct DESTDIR.

Even worse, xdg-mime and xdg-desktop-menu don’t even use XDG_* variables, they install data for GNOME and KDE separately, and for KDE, they use, respectively, kde-config ’s output, and nothing, just hardcoding the path. I was able to fool xdg-mime to work as I need by faking a kde-config script, but for xdg-desktop-menu there is nothing I can do. Beside the ability to use DESTDIR, I could have fooled them enough if they at least used KDEDIR/KDEDIRS variables, as I suppose they should, but they don’t.

Hopefully I’ll be able to get a modified xdg-utils soon so that I can actually complete the ebuild for libprs500, and then add it to portage.

I’m still having one problem with the connection of the PRS-505: the SD card in the slot is not seen by Linux. It’s working, because I can see it working on OSX, but on Linux somehow it does not appear to be given a device at all. I suppose it should have a device assigned like a 50-in-1 flash card reader, but this does not currently seem to happen, and I don’t know yet why.

Beside that, libprs500 is a nice frontend, it’s complete, not rough at all, and quite appealing. The only problem with the software itself I have is that it uses a sqlite db to store the books. No, not the books metadata, but the books themselves, it saves the whole file into the database. As you can guess, this is far from optimal, considering also my pet peeves with sqlite, I’d very much like to try steering upstream to something different, especially because I want to load something like 300 MB of books on the Reader.

As for what concern the kind of books to load on it; A4 books work nicely when used landscape and zooming in; tomorrow night I’ll experiment a bit with paper types for texinfo manuals, so that I could generate GDB, Make and ELISP manuals in a suitable size for the reader itself. The conversion to the reader’s own format is not that good when you have complex PDFs from texinfo or LaTeX.

There is one non-small problem with O’Reilly’s openbooks, like LDD3: the PDF has the guides printed around the page, and the zoom function of the reader is thrown off by that (it removes the white borders, but then the white border is interrupted by the guide on those books. To read those easily on the reader, the trick would be to crop them; I should look into tools to handle that, there has to be something there able to do that.

For what I’ve seen up to now, it was worth buying.

Looking for PDF books suppliers

So, after my wondering about getting a Sony Reader, I actually ordered one today. On eBay (without laser engraving) as Sony’s shop doesn’t have it available anymore, and I’d rather do it sooner rather than later, as you’ll never know what might come up when you plan too far ahead.

The main use will certainly be to read the common PDF reference documentation as I’m plenty of that, and I often end up either printing it or not using it at all. Give me a few months and the amount of paper I’d be saving would be worth the money I spent on the reader ;)

But there are other books too; Pragmatic Bookshelf sells PDFs for books, and they have quite some interesting titles, so that’s also quite a big improvement.

The only thing that would be missing would be O’Reilly books (I don’t have many, but I’m interested in some from time to time, like the GNU Make book I linked before too. Sure I can live without them as PDF, but if possible, I’d like that option too :) For what I can see, they don’t sell the whole book as PDF on the standard store, you can buy chapters at $4, but that’s quite too much for a whole book.

Luca told me to look for the subscription option, which I suppose is Safari Book Online; it sounds interesting, but considering its cost, I’d rather be sure first if that subscription is what I need. So the question here is for hoosgot a Safari Book Online subscription already. Are books in the library downloadable as PDF? Or are chapters downloadable one by one?

Thanks in advance for the info.

About my break

So, I said before I wanted to take a one week break, and just relax, watch TV Series, Movies, Anime, read some books. There’s a huge ocean between what you hope and what happens most of the times, this time I found a galaxy between them.

So, Monday I’m first woke up by my UPSes beeping, because the power company started having problems (as usual, in Italy during summer), I slept about four hours, and I couldn’t sleep in the afternoon, as I was called to be explained the new job I’m currently doing (data entry, sigh). Tuesday I was waiting for receiving my AppleTV (with HDMI cable) – I needed something to give me back control of my laptop, and this was the most straightforward alternative to watch Anime and TV series on my TV – so I woke up early, but the courier (UPS) did not came till 16, just half an hour after I did go to sleep in the afternoon, as I slept about four hours that day too, and my data entry job was paused because of problems with the application I should have been using. Wednesday, I was waked up by my parents who forgot I was sleeping, just three hours of sleep, and I got an urgent – different, albeit for the same company – data entry job to complete in just one day; twelve hours to sleep. Yesterday, the data entry job resumed, so up again with little sleep.

And let me say something about UPS. They are one of the most expensive express couriers, they previously had a perfect record with me, although one of their drivers once whined about he having to come to my house (that is outside city) too often. This time they screwed up quite a bit. When you order an AppleTV with an HDMI cable from the Apple Store, they send you two boxes, one for the AppleTV itself, and one for the cable (probably to make it easier to the storage to handle the shipments), the invoice is also printed and attached outside the box rather than inside. The AppleTV box came to my house Tuesday afternoon as expected, but the HDMI cable wasn’t there. It was shipped, by mistake, to Madrid, Spain, and came the day after. For Google Maps, there are 1.829 km between Mestre and Madrid, and the only things they have in common are the M and the r letters in the name.

Anyway, the AppleTV is a nice gadget and works quite well, even if the Samsung TV is giving me a heartache: the image coming from the HDMI cable at 1280x720p (or 1960x1080i if you prefer) is displayed with a slightly different resolution and not scaled. The result is that a border of the whole image is missing. This is probably not a problem for most users, it’s still a problem for fansubbed Anime when the subtitles appear too close to the border. And I can’t find a way to contact Samsung Italy without calling them (and I didn’t of course have time to call them).

The data entry job is being quite stressful, it’s taking a lot of my time, and is making my break hell; in particular, the web application used to type in the data uses PDF forms rather than standard web forms, but this wouldn’t be that much of a problem, if the designers of the PDF forms used the wrong order for tab switching of the inputs (what the heck were they thinking? were they drunk?) so I need to use Acrobat Reader 6.0, that does not support the tab order embedded in PDF files and creates its own (correct) order; if I use Adobe Reader 7 or 8, pressing tab moves you around the page like crazy. And these people get paid a lot more than I am.