Stop inventing a new ontology for each service!

Last month I wrote a post noting who makes use of semantic data for the web in particular pointing out that Facebook, Google, Readability and Flattr all use different way to provide context to the content: OpenGraph, Schema.org, hNews and their own version of microformats respectively.

Well, NewsBlur – which, even though I criticized for the HTTP implementation, is still my best suggestion for a Google Reader replacement, if anything because it’s open source even though it’s a premium service – seems to come up with its own way to get semantic data.

The FAQ for publishers states that you can use one of a number of possible selectors to provide NewsBlur with an idea of how your content is structured — completely ignoring the fact that schema.org already includes all the structure, and it would be relatively easy to get that data explicitly. Even better, since NewsBlur has a way to show public comments within the NewsBlur interface it would be possible for it to display the comments on the post themselves, as they are also tagged and structured with the same ontology. I’ve opened an idea about it — hopefully somebody, if not the author, will feel like implementing this.

But this is by far not limited to NewsBlur! While Readability added a special case for my blog so that it actually gets the right data out of it, the content guide still only describe support for the hNews format, even though Schema.org has all the same data and more. And Flattr, well, still does not seem to care about getting data via semantic information — the best match would be support for the link relation in feeds that can be autodiscovered, but then I don’t really have an idea of where Flattr would find the metadata to create the “thing” on their side.

Please, all you guys who work on services — can we get all behind the same ontology, so that we don’t have to start adding four times redundant information on pages, increasing their size for no advantage? Please!

Who consumes the semantic web?

In my previous post I’ve noted that I was adding support for the latest fad method for semantic tagging of data on web pages, but it was obviously not clear who actually consumes that data. So let’s see.

In the midst of the changes to Typo that I’ve been sending to support a fully SSL-compatible blog install (mine is not entirely yet, mostly because most of the internal links from one post to the next are not currently protocol-relative), I’ve added one commit to give a bit more OpenGraph insights — OpenGraph is used by Facebook, almost exclusively. The only metadata that I provide on that protocol, though, is an image for the blog – since I don’t have a logo, I’m sending my gravatar – the title of the single page and the global site title.

Why that? Well, mostly because this way if you do post a link to my blog on facebook, it will appear with the title of the post itself instead of the one that is visible on the page. This solves the problem of whether the title of the blog itself should be dropped out of the <title> tag.

For what concerns Google, instead, the most important part of metadata you can provide them seems to be authorship tagging which uses Google+ to connect content of the same author. Is this going to be useful? Not sure yet, but at least it shows up in a less anonymous way in the search results, and that can’t be bad. Unlike what they say on the link, it’s possible to use an invisible <link> tag to connect the two, which is why you don’t find a G+ logo on my blog anywhere.

What else do search engines do with the remaining semantic data? Not sure, it doesn’t seem to explain it, and since I don’t know what it does behind the scenes it’s hard for me to give a proper answer. But I can guess, and hope, that they use it to reduce the redundancy of the current index. For instance, pages that are actually a list of posts, such as the main index, the categories/tags and archives will now properly tell that they are describing a blog posting whose URL is, well, somewhere else. My hope would be for the search engines to know then to link to the declared blog post’s URL instead of the index page. And possibly boost the results for the posts that result more popular (given they can then count the comments). What I’m surely counting on, is for descriptions in search results to be more humanly-centered.

Now in the case of Google you can use their Rich Snippet testing tool that gives you an idea of what it finds. I’m pretty sure that they take all this data with a grain of salt though, seeing as how many players are there in the “SEO” world, with people trying to game the system altogether. But at least I can hope that things will move in the right direction.

Interestingly, when I first implemented the new semantic data, Readability did not support it, and would show my blog’s title instead of the post’s title when reading the articles from there — after a feedback on their site they added some workaround for my case, so you can enjoy their app with my content just fine. Hopefully, with time, the microformat will be supported in the general sense.

On the other hand, Flattr still has no improvement on using metadata, as far as I can see. They require that you actually add a button manually, including repeating that kind of metadata (content type, language, tags) that is already easily inferred from the microformat given. Hereby, I’d like to reiterate my plea to Flattr developers to listen to OpenGraph and other microformat data, and at least use that to augment the manually-inserted buttons. Supporting the schema.org format, by the way, should make it relatively easy to add per-fragment buttons — i.e., I wouldn’t mind having a per-comment Flattr button to reward constructive comments, like they have on their own blog, but without the overhead that it adds to do so manually.

Right now this is all the semantic data that I figured out that is being used. Hopefully things will become more useful in the future.

The issue with the split HTML/XHTML serialization

Not everybody knows that HTML 5 has been released in two flavours: HTML 5 proper, which uses the old serialization, similarly to HTML 4, and what is often incorrectly called XHTML 5 which uses XML serialization, like XHTML and XHTML 1.1 did. The two serializations have different grades of strictness, and the browsers deal witht hem that way.

It so happens that the default output on DocBook for XHTML 1 is compatible with the HTML serialization, which means that even if the files have a .html extension, locally, they will load correctly in Chrome, for instance. The same can’t be said to XHTML 1.1 or XHTML5 output; one particularly nasty problem is that the generated code will output XML-style tags such as <a id="foo" /> which throw off the browsers entirely, unless properly loaded as XHTML … and on the other hand, IE still has trouble when served properly-typed XHTML (i.e. you have to serve it as application/xml rather than application/xhtml+xml).

So I have two choices: redirect all the .html requests to .xhtml, make it use XHTML 5 and work around the IE8 (and earlier) limitations, or I can forget about XHTML 5 at all. This starts to get tricky! So for the moment I decided to not go with XHTML 5, and at the same time I’m going to keep building ePub 2 books, and publish them as they are, instead of using ePub 3 (even though, as I said, O’Reilly got it working for their workflow).

Unfortunately even if I went through that on the server side to fix it, that wouldn’t even be enough alone! I would have to also change the CSS, since many things that were always <div> before, are now using proper semantic types, including <section> (with the exception of the table of contents on the first landing page, obviously (damn). This actually makes it easier in one way as it lets me drop the stupid nth-child CSS3 trick I used to set the style of the main div, compared to the header and footer. Hopefully this should let me fix the nasty IE 3 style beveled border that Chrome put around the Flattr button when using XHTML 5.

In the mean time I have a few general fixes to the style, now I just need to wait for the cover image to come from my designer friend, and then I can update both the website and the eBook versions all around the stores.

To close the post.. David you deserve a public apology: while you were listed as <editor> on the DocBook sources before, and the XSL was supposed to emit it on the homepage, for whatever reason, it fails to. I’ve upgrade you to <author> until I can find why the XSL is misbehaving so I can fix it properly.

In the mean time, tomorrow I’ll write a few more words about automake and then

HTML5: compliance shouldn’t require support

Seems like the whole thing about HTML5 and video/audio formats is not done yet, three years after my cursing at Quassel due to qt-webkit because Qt-Webkit decided to bring in GStreamer to support HTML5 video.

This time, the issue is with both Firefox and Thunderbird, both of which come with a webm USE flag that, if disabled, make them fail to build.

I start to wonder, why people insist that for HTML5 compliance you have to support viewing the video? All that you got to do is be able to parse the element and act on it; showing a “This content is not available with your current browser” is quite fine, if I don’t want WebM support!

No technical content for today, it’s Sunday and I’m fighting with getting Thunderbird to work.

Updated to Typo 6

While the dependency trouble I wrote about is not entirely solved yet, I’ve been able to update to Typo 6, mostly because I didn’t want keep running the ancient version I was, now that I have a clear sight of most of which packages require the most work.

Most importantly, since I wanted to spend some extra time writing a couple of plugins for Typo (in particular something to submit posts directly to Flattr, rather than using the auto-submit URL — this would mean reducing the amount of work the blog has to do, in regard to rendering the single article), I wanted to do so with a modern version of the package, not one still based on Rails 2.3 and so on.

There are a few changes to my theme with this version, by the way: I’m now using a few more HTML5 features, such as the <article> tag. Unfortunately validation still fail right now because it’s not XML-based enough (validator does not seem to allow you to register new namespaces, such as the CreativeCommons or OpenGraph ones), and it does not allow one to use RDFa types, suggsting to use a different schema (which is not available to use). All in all my answer to this is “oh well”.

The page works with Chrome, Firefox and Safari and that usually is good enough for me; I’ll try to fix it up if other browsers make a mess of it but I don’t think they will. I guess it’s going to be tricky this way for a while.

Anyway, this is it for today.

So, wasn’t HTML5 supposed to make me Flash-free?

Just like Multimedia Mike, I have been quite sceptic regarding seeing HTML5 as a saviour of the open web. Not only because I dislike Ogg to a passion after having tried to parse it myself without the help of libogg (don’t get me started), but because I can pragmatically expect a huge number of problems related to serve multiple video files variant depending on browser and operating system. Lacking common ground, it’s generally a bad situation.

But I have been hoping that Google’s commitment to support HTML5 video, especially in Youtube, would have given me a mostly Flash-free environment; unfortunately that doesn’t seem to be the case. There is a post on the Youtube API blog from last month that tries to explain users why they are still required to use Flash. On the other hand, it has the sour taste that reminds me of Microsoft’s boasting about Windows Genuine Advantage. I guess that notes such as these:

Without content protection, we would not be able to offer videos like this .

to land me on a page that says at the top “This rental is currently unavailable in your country.” without any further notice, and without a warning that Your Mileage May Vary, makes it very likely to have a mixed feeling about a post like that.

Now, from that same post, I got the feeling that for now Google is not planning on supporting embedded Youtube using HTML5, and relied entirely on Flash for that:

Flash Player’s ability to combine application code and resources into a secure, efficient package has been instrumental in allowing YouTube videos to be embedded in other web sites. Web site owners need to ensure that embedded content is not able to access private user information on the containing page, and we need to ensure that our video player logic travels with the video (for features like captions, annotations, and advertising). While HTML5 adds sandboxing and message-passing functionality, Flash is the only mechanism most web sites allow for embedded content from other sites.

Very unfortunate, given that a number of website, including one of a friend of mine actually use Youtube to embed some videos; even my blog has a post using it. It’s still a shame, because it’s a loss, for Google, of the iPad users.. or is it, at all? I have played around a minute with an iPad at the local Mediaworld (Mediamarkt) last week. And I looked at my friend’s website with it. The videos load perfectly using HTML5 I guess, given that it does not support Flash at all.

So what’s the trick? Does Google provide HTML5-enabled embedded videos when it detects the iPhoneOS/iOS Safari identification in the user-agent? Is it Safari instead to translate the Youtube links into HTML5-compatible links? In the former case, why does it not do that when it detects Chrome/Chromium as well? In the latter, why can’t there be an extension to do the same for Chrome/Chromium?

Once again, my point is that you cannot simply characterize Apple and Google as being absolutely evil and absolutely good; there is no “pureness” in our modern world as it is, and I don’t think that trying to strive for that is going to work at all… extremes are not suited for the human nature, even extreme purity.