It’s not very common for me to ask explicit help with writing new software, but since this is something that I have no experience with, in a language I don’t know, and not mission-critical for any of my jobs, I don’t really feel like working on this myself.
Since right now I not only have a freelancing, registered job, but I also have to take care of most, if not all, house expenses, I’ve started keeping my money in check through Gnucash as I said before. This makes it much easier to see how much (actually, little) money I make and I can save away or spend on enjoying myself from time to time (to avoid burning out).
Now, there is one thing that bothers me: to save away the money that I owe the government as taxes (both VAT I have to pay, and extra taxes) I subscribed to a security fund, paying regularly (if I have the money available, of course!); unfortunately I need to explicitly go look up the data on my bank’s website to know exactly how much money I have stashed in there at any time.
Gnucash obviously have a way to solve this problem, by using Finance::Quote
Perl module to fetch the data from a longish series of websites, mostly through scraping. Let’s not even start to talk about the chances that the websites changed their structure in the past months since the 1.17 release of the module (hint: at least one had, since I tried it out manually and it only gets a 404 error), but at last Yahoo, while accepting the ISIN of the fund, doe not give me any data for the current value of the share.
Now, the fund, which is managed by Pioneer Investments and they do provide the data, and via a very simple, ISIN-based, URL! Unfortunately, they provide that data only… in PDF. Now, this does not seem to be too bad: the data is available in text form because pdftotext
provides it properly, and it’s clearly marked with the previous line to be a fixed string; on the other hand, I have no idea how it would be possible to scrape a PDF, especially in Perl, and even worse within Finance::Quote
!
If somebody feels like helping me out, the URL for the PDF file with the data is the following, and the grep
command will tell you what to look for in the PDF’s text. If you can help me out with this I’ll be very glad. Thanks!
# wget 'http://www.pioneerinvestments.it/it/webservice/pdfDispatcher.jhtml?doccode=ilpunto&from=02008FON∈=IT0000388204'
# pdftotext pioneer_monetario_euro_a.pdf* - | grep 'Valore quota' -A 2
Valore quota
13,158
Scraping the number value out of the PDF is easy enough once you have the PDF. The following perl snippet will do the right thing as a freestanding tool. I have not looked at Finance::Quote to see how to integrate it, but since this snippet prints the value to stdout, it should be easy to convert to a subroutine.#!/usr/bin/perluse strict;use warnings;# Spawn pdftotext as a subprocess and connect its stdout to $pipeopen my $pipe, “/usr/bin/pdftotext pioneer_monetario_euro_a.pdf – |” or die “Failed to pipe: $!n”;# Track whether we have seen the required fixed string yetmy $state = 0;while(<$pipe>) {if (/Valore quota/) {$state = 1;next;}# After the required string has been seen, the first line with a# decimal digit is considered to be the sought value.if ($state == 1 && /[0-9]/) {print $_;last;}}If no one gives you a more complete answer within the next couple of days, I will grab Finance::Quote and try to solve the full problem. In the meantime, I hope this gets you (or another reader) on the right track.