Google, code search, and sitemaps

I’m writing this blog post on my birthday; I’ll probably not publish it right away but it depends on how much more stuff I want to queue, and on whether I’ll be going out tonight or not. I wanted to take this weekend off the net, but I decided to restrict it to taking myself off IM, email and facebook instead, since otherwise I can get bored.

Since I’ve been doing some work on Apache, especially thanks to Google Webmaster Tools that shown me the broken redirections and other problems, I decided to take a look at the structure of my site and see what was there to fix.

Before continuing, let me say that the site you see is generated statically starting from some custom XML files that use xhtml-compatible syntax and XSLT (with xsltproc). While I know that having custom XML is not the brightest idea, I use this method to split the actual site’s content from the form that you actually see. It allowed me in the past to move between different themes with little fiddling to the content, which is a positive side. Of course it’s not something I’d suggest for the most common of sites, but it’s still not so bad for me. I just commit the changes to git, and once it’s pushed it’s also converted for the online site.

Anyway, today I started by wanting to kill off the broken links to my site coming from the net; in particular, there has been some broken links to projects’ pages, for which I added redirects (thanks to Apache’s RewriteMap, which makes it piece of cake). From that I started looking in making sure the output was truly XHTML-validating and ended up rewriting most of the sources by using XML-NS to add my own syntax; this will allow me a more generic approach in a future if I want or need it.

From the same sources I also generate sitemaps, which I originally used to submit URLs to Google through the Webmaster Tools interface. I noticed today that now they merged in with Microsoft and Yahoo for the format to be somewhat standardised. Which is very nice. Unfortunately I cannot add support for this to nxml because there is no license on the schema files, and they list no contact information for issues with the schemas or the protocol itself.

The problem here is that even if now the standard sitemap and the code search sitemap are one and the same, Google does not support the two of them being one single file. For my site right now I build a single sitemap, but Google Webmaster Tools report warnings for codesearch tags and namespace. Which is kinda silly. It also does not allow to submit the same sitemap file as both general and code map, which is also very silly since it’s technically feasible to have one, and for smaller sites like mine, it makes totally sense, also to optimise caching.

Also, the codesearch sitemap format does not allow to specify lots of licenses, the list of supported licenses is very short, and does not really provide much meaning to the license tag: you cannot really tell whether it’s GPL2 or later, GPL2 only, GPL3, and so on so forth; it also lacks an AGPL option entirely.

So if somebody from Google is here reading me, it’d be a very nice feature for all of us to be able to submit a single sitemap and be done with it for all type of content; XML, and the namespaced “sitemap protocol” as you call it, allows for it, and it makes total sense, why should they be handled differently, webmaster-side?

One thought on “Google, code search, and sitemaps

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s