Links checking

I started writing a bog just to keep users updated on the development of Gentoo/FreeBSD and other projects I worked on; it was never my intention to make it my biggest project, but one thing causes the other and I’m afraid to say that lately my biggest contribution to free software is this very blog. I’m not proud of this, it really shouldn’t be this way, but lacking time (and having a job that gets me to work on proprietary rather than free software), this is the best I can come up with.

But, I still think that a contribution is only worth to the extent it’s actually properly done, for this reason it bother me I cannot go over all the current posts and make sure there aren’t any factual mistake in them. Usually, if I know of something I got wrong for any reason, and I want to explain the mistake and fix it, after a longish time from publication, I just write a new “correction” entry and link to the older post; originally this worked out nicely because Typo would handle the internal trackback, so that it could be automatically circularly linked; unfortunately trackbacks don’t seem to work even though I did enable them when I started the User-Agent filtering (so that the spam could be reduced to a manageable amount).

In addition, there are quite a few posts that are for now only available on the older blog which bothers me quite a bit, since it’s actually full of spam, gets my name wrong, and forces users to search two places for the first topics I wrote about. Unfortunately migrating the posts out of the b2evolution install is quite cumbersome, and I guess I should try to bribe Steve again about that.

Update (2016-04-29): I actually imported the old blog in 2012. I also started merging every other post I wrote anywhere else in the mean time.

Anyway, beside the factual errors in the content, there are a few other things that I can and should deal with, on the blog, and one of this is the validity of the external and internal links. Now, I know this is the sort of stuff that falls into the so-called ”Search Engine Optimisation” field, and I don’t care. I dislike the whole idea and I find that calling that ”SEO” is just a way for script kiddies to feel important like a “CEO”; I don’t do this for the search engines, I do this for the users; I don’t like when I find a broken link on a site, so I’d like for my own sites not to have broken links.

The Google Webmaster Tools is a very interesting tool in this regard since it allows you to find broken inbound links; I already commented about OSGalaxy breaking my links (and in the mean time I don’t get published any longer in there because they don’t handle Atom feeds); for that and other sites, I keep a whole table of redirections for the blog’s URLs, as well as a series of normalisation for URLs that often have trailing garbage characters (like semicolons and other things).

Unfortunately what GWT lacks is a way to check outbound links, at least as far as I can see; I guess it would be a very useful tool for that because Google has to index the content anyway so adding checks for that stuff shouldn’t be much of a problem for them. The nicest thing would be for Typo (the software handling my blog) to check the links before publishing, and alerting me for errors (an as-you-type check would help but it would require for a proxy to cache requests for at least a few hours otherwise I would be hitting the same servers many time while writing). Since that does not seem to be happening for now and I don’t foresee it to happen in the near future, I’m trying to find an alternative approach.

At the time I’m writing (which is not when you’re going to read this post), I’m running locally a copy of the W3C LinkChecker (I should package it for Gentoo, but I don’t have much experience with Perl packaging), over my blog; I already executed it over my site and xine’s and fixed a few of the entries that the link checker already spewed out.

Again, this is not the final solution I need, the problem with this is that it does not allow me to run an actual incremental scan; while I currently am caching all the pages through polipo this is not going to work for the long run, just for today’s spree. There are a quite a few problems with the current setup, though:

  • it does not allow to remove the 1-second delay on requests, not even for localhost (when I’m testing my own static site locally I don’t need delay at all, I can actually pipeline lots of requests together);
  • it does not just have a way to provide a “whitelist of unreachable URLs” (like my Amazon’s wishlist that does not respond to the HEAD request);
  • while the output is quite suitable to be sent via email (so I can check each day for new entries), I would have preferred for it to output XML, with a provided XSL to convert it to something user friendly, that would have allowed me to handle the URL filtering in a more semi-automatic way;
  • finally, it does not support IDN, and I like IDN which makes me a bit sad;

For now, what I gathered from the checker output is that my use of Textile for linking causes most of the bad links in the blog (because it keeps the semicolons, closed parentheses and so on as part of the link), and I dislike the effect of the workaround of adding spaces (no, the “invisible space” is not a solution since the parser doesn’t understand that is whitespace, and also add that to the link). And there are lots of broken links because since, after the Typo update , the amazon: links don’t work any longer. This actually will give me a bit of a chance: they used to be referral links (even though they never made any difference), now after the change of styles I don’t need those any longer thus I’ll just replace them database-wide to the direct link.

One thought on “Links checking

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s