It’s not very common for me to ask explicit help with writing new software, but since this is something that I have no experience with, in a language I don’t know, and not mission-critical for any of my jobs, I don’t really feel like working on this myself.
Since right now I not only have a freelancing, registered job, but I also have to take care of most, if not all, house expenses, I’ve started keeping my money in check through Gnucash as I said before. This makes it much easier to see how much (actually, little) money I make and I can save away or spend on enjoying myself from time to time (to avoid burning out).
Now, there is one thing that bothers me: to save away the money that I owe the government as taxes (both VAT I have to pay, and extra taxes) I subscribed to a security fund, paying regularly (if I have the money available, of course!); unfortunately I need to explicitly go look up the data on my bank’s website to know exactly how much money I have stashed in there at any time.
Gnucash obviously have a way to solve this problem, by using
Finance::Quote Perl module to fetch the data from a longish series of websites, mostly through scraping. Let’s not even start to talk about the chances that the websites changed their structure in the past months since the 1.17 release of the module (hint: at least one had, since I tried it out manually and it only gets a 404 error), but at last Yahoo, while accepting the ISIN of the fund, doe not give me any data for the current value of the share.
Now, the fund, which is managed by Pioneer Investments and they do provide the data, and via a very simple, ISIN-based, URL! Unfortunately, they provide that data only… in PDF. Now, this does not seem to be too bad: the data is available in text form because
pdftotext provides it properly, and it’s clearly marked with the previous line to be a fixed string; on the other hand, I have no idea how it would be possible to scrape a PDF, especially in Perl, and even worse within
If somebody feels like helping me out, the URL for the PDF file with the data is the following, and the
grep command will tell you what to look for in the PDF’s text. If you can help me out with this I’ll be very glad. Thanks!
# wget 'http://www.pioneerinvestments.it/it/webservice/pdfDispatcher.jhtml?doccode=ilpunto&from=02008FON∈=IT0000388204' # pdftotext pioneer_monetario_euro_a.pdf* - | grep 'Valore quota' -A 2 Valore quota 13,158