Free Idea: structured access logs for Apache HTTPD

This post is part of a series of free ideas that I’m posting on my blog in the hope that someone with more time can implement. It’s effectively a very sketched proposal that comes with no design attached, but if you have time you would like to spend learning something new, but no idea what to do, it may be a good fit for you.

I have been commenting on Twitter a bit about the lack of decent tooling to deal with Apache HTTPD’s Combined Logging Format (inherited from NCSA). For those who do not know about it, this is hte format used by standard access_log files, which include information about requests, including the source IP, the time, the requested path, the status code and the User-Agent used.

These logs are useful for debugging but are also consumed by tools such as AWStats to produce useful statistics about the request patterns of a website. I used these extensively when writing my ModSecurity rulesets, and I still keep an eye out on them for instance to report wasteful feed readers.

The files are simple text files, and that makes it easy to act on them: you can use tail and grep, and logrotate needs no special code beside moving the file and reloading Apache to have it re-open the paths. This makes it hard to query for particular entries in fields, such as to get the list of User-Agent strings present in a log. Some of the suggestions I got over Twitter to solve this were to use awk, but as it happens, these logs are not actually parseable with a straightforward field separation.

Lacking finding a good set of tools to handle these formats directly, I have been complaining that we should probably start moving away from simple text files into more structured log formats. Indeed, I know that there used to be at least some support for logging directly to MySQL and other relational databases, and that there are more complicated machinery often used by companies and startups that process these access logs into analysis software and so on. But all of these tend to be high overhead, much more than what I or someone else with a small personal blog would care about implementing.

Instead I think it’s time to start using structured file logs. A few people including thresh from VideoLAN suggested using JSON to write the log files. This is not a terrible idea, as the format is at least well understood and easy to interface with most other software, but honestly I would prefer something with an actual structure, a schema that can be followed. Of course I’m not meaning XML, and I would rather suggest having a standardized schema for proto3. Part of that I guess is because I’m used to use this at work, but also because I like the idea of being able to just define my schema and have it generate the code to parse the messages.

Unfortunately currently there is no support or library to access a sequence of protocol buffer messages. Using a single message with repeated sub-messages would work, but it is not append-friendly so there is no way to just keep writing this to a file, and being able to truncate and resume writing to it, which is a property needed for a proper structured log format to actually fit in the space previously occupied by text formats. This is something I don’t usually have to deal with at work, but I would assume that a simple LV (Length-Value) or LVC (Length-Value-Checksum) encoding would be okay to solve this problem.

But what about other properties of the current format? Well, the obvious answer is that, assuming your structured log contains at least as much information (but possibly more) as the current log, you can always have tools that convert on the fly to the old format. This would for instance allow to have a special tail-like command and a grep-like command that provides compatibility with the way the files are currently looked at manually by your friendly sysadmin.

Having more structured information would also allow easier, or deeper analysis of the logs. For instance you could log the full set of headers (like ModSecurity does) instead of just the referrer and User-Agent. And allow for customizing the output on the conversion side rather than lose the details when writing.

Of course this is just one possible way to solve this problem, and just because I would prefer working with technologies that I’m already friendly with it does not mean I wouldn’t take another format that is similarly low-dependency and easy to deal with. I’m just thinking that the change-averse solution of not changing anything and keeping logs in text format may be counterproductive in this situation.

Apache, ETag and “Not Modified”

In my previous post on the matter I incorrectly blamed NewsBlur – which I still recommend as the best feed reader I’ve ever used! – for not correctly supporting HTTP features to avoid wasting bandwidth for fetching repeatedly unmodified content.

As Daniel and Samuel pointed out immediately, NewsBlur does support those features, and indeed I even used it as an example four years ago, oops for my memory being terrible that way, and me assuming the behaviour from the logs rather than inspecting the requests. And indeed the requests were not only correct, but matched perfectly what Apache reported:

GET /index.xml HTTP/1.1
Connection: keep-alive
Accept-Encoding: gzip, deflate
Accept: application/atom+xml, application/rss+xml, application/xml;q=0.8, text/xml;q=0.6, */*;q=0.2
User-Agent: NewsBlur Feed Fetcher - 59 subscribers - (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36)
A-IM: feed
If-Modified-Since: Wed, 16 Aug 2017 04:22:52 GMT
If-None-Match: "27dc5-556d73fd7fa43-gzip"

HTTP/1.1 200 OK
Strict-Transport-Security: max-age=31536000; includeSubDomains
Last-Modified: Wed, 16 Aug 2017 04:22:52 GMT
ETag: "27dc5-556d73fd7fa43-gzip"
Accept-Ranges: bytes
Vary: Accept-Encoding
Content-Encoding: gzip
Cache-Control: max-age=1800
Expires: Wed, 16 Aug 2017 18:56:33 GMT
Content-Length: 54071
Keep-Alive: timeout=15, max=99
Connection: Keep-Alive
Content-Type: application/xml

So what is going on here? Well, I started looking around, both because I now felt silly, and because I owed more than just an update on the post and an apology to Samuel. And a few searches later, I found Apache bug #45023 that reports how mod_deflate prevents all 304 responses from being issued. This is a bit misleading (as you can still have them in some situations), but it is indeed what is happening here, and it is a breakage introduced by Apache 2.4.

What’s going on? Well, let’s first start to figure out why I could see some 304, but not from NewsBlur. Willreadit was one of the agents that received 304 responses at least some of the time and in the landing page it says explicitly that it supports If-Modified-Since. In particular, it does not support If-None-Match.

The If-None-Match header in the request compares with the ETag header (Entity Tag) in the response coming from Apache. This header is generally considered opaque, and the client should have no insights in what it is meant to do. The server generally calculates its value based on either a checksum of the file (e.g. md5) or based on file size and last-modified time. On Apache HTTP Server, the FileETag directive is used to define which properties of the served files are used to generate the value provided in the response. The default that I’m using is MTime Size, which effectively means that changing the file in any way causes the ETag to change. The size part might actually be redundant here, since modification time is usually enough for my use cases, but this is the default…

The reason why I’m providing both Last-Modifed and ETag headers in the response is that HTTP client can just as well only implement one of the two methods, rather than both, particularly as they may think that handling ETag is easier as it’s an opaque string, rather than information that can be parsed — but it really should be considered opaquely as well as it’s noted in RFC2616. Entity Tags are also more complicated because they can be used to collapse caching of different entities (identifed by an URL) within the same space (hostname) by caching proxies. I have lots of doubts that this usage is in use, so I’m not going to consider it a valid one, but your mileage may vary. In particular, since the default uses size and modification time, it ends up always matching the Last-Modified header, for a given entity, and the If-Modified-Since request would be just enough.

But when you provide both If-Modified-Since and If-None-Match, you’re asking for both conditions to be true, and so Apache will validate both. And here is where the problem happens: the -gzip suffix – which you can see in the header of the sample request above – is added at different times in the HTTPD process, and in particular it makes it so that the If-None-Match will never match the generated ETag, because the comparison is with the version without -gzip appended. This makes sense in context, because if you have a shared caching proxy, you may have different user agents that support different compression algorithms. Unfortunately, this effectively makes it so that entity tags disable Not Modified states for all the clients that do care about the tags. Those few clients that received 304 responses from my blog before were just implementing If-Modified-Since, and were getting the right behaviour (which is why I thought the title of the bug was misleading).

So how do you solve this? In the bug I already noted above, there is a suggestion by Joost Dekeijzer to use the following directive in your Apache config:

RequestHeader edit "If-None-Match" '^"((.*)-gzip)"$' '"$1", "$2"'

This adds a version of the entity tag without the suffix to the list of expected entity tags, which “fools” the server into accepting that the underlying file didn’t change and that there is no need to make any change there. I tested with that and it does indeed fix NewsBlur and a number of other use cases, including browsers! But it has the side effect of possibly poisoning shared caches. Shared caches are not that common, but why risking it? So I decided onto a slightly different option

FileETag None

This disable the generation of Entity Tags for file-based entities (i.e. static files), forcing browsers and feed readers to rely on If-Modified-Since exclusively. If clients only implement If-None-Match semantics, then this second option loses the ability to receive 304 responses. I have actually no idea which clients would do that, since this is the more complicated semantics, but I guess I’ll find out. I decided to give a try to this option for two reasons: it should simplify Apache’s own runtime, because it does not have to calculate these tags at any point now, and because effectively they were encoding only the modification time, which is literally what Last-Modified provides! I had for a while assumed that the tag was calculated based on a (quick and dirty) checksum, instead of just size and modification time, but clearly I was wrong.

There is another problem at this point, though. For this to work correctly, you need to make sure that the modification time of files is consistent with them actually changing. If you’re using a static site generator that produces multiple outputs for a single invocation, which includes both Hugo and FSWS, you would have a problem, because the modification time of every file is now the execution time of the tool (or just about).

The answer to this is to build the output in a “staging” directory and just replace the files that are modified, and rsync sounds perfect for the job. But the more obvious way to do so (rsync -a) will do exactly the opposite of what you want, as it will preserve the timestamp from the source directory — which mean it’ll replace the old timestamp with the new one for all files. Instead, what you want to use is rsync -rc: this uses a checksum to figure out which files have changed, and will not preserve the timestamp but rather use the timestamp of rsync, which is still okay — theoretically, I think rsync -ac should work, since it should only preserve the timestamp only of the files that were modified, but since the serving files are still all meant to have the same permissions, and none be links, I found being minimal made sense.

So anyway, I’ll hopefully have some more data soon about the bandwidth saving. I’m also following up with whatever may not be supporting properly If-Modified-Since, and filing bugs for those software/services that allow it.

Update (2017-08-23): since now it’s a few days since I fixed up the Apache configuration, I can confirm that the daily bandwidth used by “viewed hits” (as counted by Awstats) went down to ⅓ of what it used to be, to around 60MB a day. This should be accounting not only for the feed readers now properly getting a 304, but also for browsers of readers who no longer have to fetch the full page when, for instance, replying to comments. Googlebot also is getting a lot more 304, which may actually have an impact on its ability to keep up with the content, so I guess I will report back.

Does your webapp really need network access?

One of the interesting thing that I noticed after shellshock was the amount of probes for vulnerabilities that counted on webapp users to have direct network access. Not only ping to known addresses to just verify the vulnerability, or wget or curl with unique IDs, but even very rough nc or even /dev/tcp connections to give remote shells. The fact that probes are there makes it logical to me to expect that for at least some of the systems these actually worked.

The reason why this piqued my interest is because I realized that most people don’t do the one obvious step to mitigate this kind of problems by removing (or at least limiting) the access to the network of their web apps. So I decided it might be a worth idea to describe a moment why you should think of that. This is in part because I found out last year at LISA that not all sysadmins have enough training in development to immediately pick up how things work, and in part because I know that even if you’re a programmer it might be counterintuitive for you to think that web apps should not have access, well, to the web.

Indeed, if you think of your app in the abstract, it has to have access to the network to serve the response to the users, right? But what happens generally is that you have some division between the web server and the app itself. People who have looked into Java in the early nougthies probably have heard of the term Application Server, which usually is present in form of Apache Tomcat or IBM WebSphere, but here is essentially the same “actor” for Rails app in the form of Passenger, or for PHP with the php-fpm service. These “servers” are effectively self-contained environments for your app, that talk with the web server to receive user requests and serve them responses. This essentially mean that in the basic web interaction, there is no network access needed for the application service.

Things gets a bit more complicated in the Web 2.0 era though: OAuth2 requires your web app to talk, from the backend, with the authentication or data providers. Similarly even my blog needs to talk with some services, to either ping them to tell them that a new post is out, and to check with Akismet for blog comments that might or might not be spam. WordPress plugins that create thumbnails are known to exist and to have a bad history of security and they fetch external content, such as videos from YouTube and Vimeo, or images from Flickr and other hosting websites to process. So there is a good amount of network connectivity needed for web apps too. Which means that rather than just isolating apps from the network, what you need to implement is some sort of filter.

Now, there are plenty of ways to remove access to the network from your webapp: SElinux, GrSec RBAC, AppArmor, … but if you don’t want to set up a complex security system, you can do the trick even with the bare minimum of the Linux kernel, iptables and CONFIG_NETFILTER_XT_MATCH_OWNER. Essentially what this allows you to do is to match (and thus filter) connections based of the originating (or destination) user. This of course only works if you can isolate your webapps on a separate user, which is definitely what you should do, but not necessarily what people are doing. Especially with things like mod_perl or mod_php, separating webapps in users is difficult – they run in-process with the webserver, and negate the split with the application server – but at least php-fpm and Passenger allow for that quite easily. Running as separate users, by the way, has many more advantages than just network filtering, so start doing that now, no matter what.

Now depending on what webapp you have in front of you, you have different ways to achieve a near-perfect setup. In my case I have a few different applications running across my servers. My blog, a WordPress blog of a customer, phpMyAdmin for that database, and finally a webapp for an old customer which is essentially an ERP. These have different requirements so I’ll start from the one that has the lowest.

The ERP app was designed to be as simple as possible: it’s a basic Rails app that uses PostgreSQL to store data. The authentication is done by Apache via HTTP Basic Auth over HTTPS (no plaintext), so there is no OAuth2 or other backend interaction. The only expected connection is to the PostgreSQL server. Pretty similar the requirements for phpMyAdmin: it only has to interface with Apache and with the MySQL service it administers, and the authentication is also done on the HTTP side (also encrypted). For both these apps, your network policy is quite obvious: negate any outside connectivity. This becomes a matter of iptables -A OUTPUT -o eth0 -m owner --uid-owner phpmyadmin -j REJECT — and the same for the other user.

The situation for the other two apps is a bit more complex: my blog wants to at least announce that there are new blog posts, and it needs to reach Akismet; both actions use HTTP and HTTPS. WordPress is a bit more complex because I don’t have much control over it (it has a dedicated server, so I don’t have to care), but I assume it mostly is also HTTP and HTTPS. The obvious idea would be to allow ports 80, 443 and 53 (for resolution). But you can do something better. You can put a proxy on your localhost, and force the webapp to go through it, either as a transparent proxy or by using the environment variable http_proxy to convince the webapp to never connect directly to the web. Unfortunately that is not straight forward to implement as neither Passenger not php-fpm has a clean way to pass environment variables per users.

What I’ve done is for now is to hack the environment.rb file to set ENV['http_proxy'] = '' so that Ruby will at least respect it. I’m still out for a solution for PHP unfortunately. In the case of Typo, this actually showed me two things I did not know: when looking at the admin dashboard, it’ll make two main HTTP calls: one to Google Blog Search – which was shut down back in May – and one to Typo’s version file — which is now a 404 page since the move to the Publify name. I’ll be soon shutting down both implementations since I really don’t need it. Indeed the Publify development still seems to go toward the “let’s add all possible new features that other blogging sites have” without considering the actual scalability of the platform. I don’t expect me to go back to it any time soon.

SSL Postmortem redux

It is a funny coincidence that the week I’m at LISA ‘13 I’m doing so much work on my own servers. It might be because I’m not spending my time in front of a computer at work like I usually do. It might be because I got unlucky and my SSL certificates failed at the wrong time.

Again, Johann pointed me on Twitter to the SSL Labs page for my blog, that noted how only a bunch of OS/Software combination fails for no SNI — but that made me notice that the website said that TLSv1.1 and TLSv1.2 were not enabled, although I was ready to swear I configured it to enable all TLS. And a quick check in my Puppet master shows that my idea was right:

SSLProtocol TLSv1.2 TLSv1.1 TLSv1

So what is going on? Well, the Apache logs don’t tell you anything about what’s going on, so I decided to try empirically and move the order:

SSLProtocol TLSv1 TLSv1.1 TLSv1.2

This worked with TLS 1.2 but not with 1.0 — which is pretty bad as most browsers do not support 1.2, only the newest ones do. Okay so what’s going on? Well, Turns out that this, taken from the Apache documentation, works:

SSLProtocol All -SSLv2 -SSLv3

And that’s what I have in my configuration right now; this also means it works very very well if a new version of TLS becomes supported, it will added. So, listen to my advice and do that!

A side note: turns out that IE6 on XP not only does not support SNI, but also it does not support any TLS protocol version (just SSLv3) which means it hasn’t been able to reach my blog for about an year already.

So I decided to look at the Apache source code, and it turns out that their documentation does not make it clear: unless you add a + in front of the protocol version, the last entry is the catch-all. And there is no warning when that happens. There is no example for not using All in the docs for Apache 2.2 — it’s actually even worse with the documentation for Apache 2.4 as the example now only enables TLSv1 and that’s it.

I’ll try to send a patch for either their documentation or the code to issue a warning when that setting is misused. In the mean time, please keep this in mind.

P.S.: seems like Readability has a problem with SNI and is now failing to fetch my articles. I’ve already contacted them about this and hopefully they’ll figure out how to fix it soon.

SNI Quest: how’s the support?

After yesterday’s incident my blog and all the other apps I’ve been hosting have moved to use SNI certificates (a downgrade to Class 1 from Class 2, but that’s okay).

SNI is still considered a partially experimental feature nowadays because Windows XP is unfortunately still a thing. Luckily for me, it doesn’t seem like I have many Windows XP users — and the few that are there are probably okay with using either Chrome, Firefox or Opera, all of which use their own implementation of SSL (two using NSS), that supports SNI just fine.

Internet Explorer uses the operating system level libraries, which are not capable of using SNI at all, even if you updated to IE8. With a bit of luck, this will also mean fewer spammers using real WinXP-based browsers will be able to post. I don’t hold my breath, but it’s still possible. A few spammers were kicked off by the HTTPS move after all, so who knows.

What turned out to be interesting is the support for dropping SNI-backed links into various web apps out there — the kind of test I’ve done many times before while testing my ModSecurity Rules. The results have been interesting. All the major websites, and RSS readers, seem to handle this pretty well, with two main exceptions.

LinkedIn has probably the worst HTTP client implementation I’ve seen on a serious web app. I already opened a ticket with them before because their fetcher does not use compressed answers. This is pretty bad, considering that non-compressed answers mean a multiple times increase, and since this is traffic upstream from your server, it means that you are paying for LinkedIn’s laziness.

Due to this, LinkedIn links to my blog were already showing a (wrong) 403 message (the actual error they would get is 406, but then they process is wrongly, and I don’t care much about that). With the new SNI certificate, the LinkedIn fetcher now can only report the hostname of my blog, and no log in Apache can be found about it, which makes me guess that they try to validate the connection’s certificate, and fail.

NewsBlur is interesting as well. At first it seemed to me like it was not supporting SNI, as the settings page for my blog’s feed showed “401 Bad URL” error messages — without any matching log in Apache, which meant that the SSL connection was not completed either. On the other hand, the feed is fetched. While Samuel at first said that he did not care enough to implement SNI support for just one customer, and that made me look for alternatives for half an hour, he’s been very helpful with debugging a bit around it. Turns out that the problem is only for real-page fetching, and I haven’t spent much more time than this working on it. If somebody wants to look at it I’m happy to point you to what’s going on.

Luckily, Python’s httplib does not verify the certificates, which means Planet Gentoo still works. I’ve not checked Planet Multimedia yet — but at least that one if it fails I can fix.

What happened to my SSL certificates? A personal postmortem

I know that for most people this is not going to be very interesting, but my current job is teaching me that it’s always a good idea to help people learn from your own mistakes; especially so if you let others comment on said mistakes to see what you could have done better. So here it goes.

Let’s start to say that I’m an idiot. Last month I was clever enough to update the certificate for xine-project which was almost to expire. Unfortunately, I wasn’t so clever as to notice that the rest of my certificates were going to expire give or take at the same time. Nor I went remembering that my StartSSL verification was expiring, as last year I was in the US when that happened, and I had some trouble as my usual Italian phone number was unavailable. I actually got a notification that my certificate was expiring already when I was in London, last week. I promised myself to act on it as soon as I would get home to Dublin, but of course I ended up forgetting about it.

And then this morning came, when I got notified via Twitter that my blog’s certificate expired. And then the panic. I’m not in Dublin; I’m not in Ireland, I’m not in Europe even. I’m in Washington, DC at LISA ‘13, without either my Italian or US phone number, without my client certificate, which was restricted to my Dell laptop which is sitting in my living room in Dublin, and of course, no longer living in Italy!

Thankfully, the StartSSL support are great guys, and while they couldn’t verify me for a Class 2 as I was before right away, I got at least further enough to be able to get new Class 1 certificates, and start the process for Class 2 re-verification. Unfortunately, Class 1 means that I can’t have multiple hostnames for the cert, or even wildcard certificates. So I decided to bit the bullet and go with SNI certificates, which basically means that each vhost now has its own certificate. Which is fine, just a bit more convoluted to set up, as I had to create a number of Certificate Signature Request (CSR) as letting StartSSL generate the keys as 4096 bit SHA-256 RSA takes a very long time.

Unfortunately, SNI means that there are a few people who won’t be able to access my blog any more, although most of them were already disallowed from commenting thanks to my ModSecurity Ruleset as they would be Windows XP with Internet Explorer (any version, my ruleset would only stop IE6 from commenting). There probably are some issues for people stuck with Android 2 and the default browser. I’m sorry for you guys, I think Opera Mobile would work fine for it, but feel free to scream at me that being the case.

Unfortunately, there seems to be trouble with Firefox and with Safari at this point: both these browsers enabled OCSP by default quite a while ago, but newly minted certificates from StartSSL will fail the OCSP check for a few hours. Also there seems to be an issue with Firefox on Android, where SNI is not supported, or maybe it’s just the same OCSP problem which leads to a different error message, I’m not sure. Chrome, Safari on iOS and Opera all work fine.

What still needs to be found out is whether Planet Gentoo and NewsBlur will handle this properly. I’m not sure yet but I’m sure I’ll find out pretty soon. Some offline RSS readers could also not support SNI — that being the case, rather than just complaining to me, let upstream know that they are broken, I’m sure somebody is going to have a good fun with that.

Before somebody points out I should have alerts about certificate expiration, yes I know. I used to have these set up on the Icinga instance that was used by my previous employer, but ever since I haven’t set up anything new for that. I’m starting to do so as we speak, by building Icinga for my Puppetmaster host. I’m also going to write on my calendar to make sure to update the certificates before they expires, as for the OCSP problem noted above.

Questions and comments are definitely welcome, suggestions on how to make things better are too, and if you use Flattr remember to use your email address, as good suggestions will be rewarded!

The WebP experiment

You might have noticed over the last few days that my blog underwent some surgery, and in particular that some even now, on some browsers, the home page does not really look all that well. In particular, I’ve removed all but one of the background images and replaced them with CSS3 linear gradients. Users browsing the site with the latest version of Chrome, or with Firefox, will have no problem and will see a “shinier” and faster website, others will see something “flatter”, I’m debating whether I want to provide them with a better-looking fallback or not; for now, not.

But this was also a plan B — the original plan I had in mind was to leverage HTTP content negotiation to provide WebP variants of the images of the website. This was a win-win situation because, ludicrous as it was when WebP was announced, it turns out that with its dual-mode, lossy and lossless, it can in one case or the other outperform both PNG and JPEG without a substantial loss of quality. In particular, lossless behaves like a charm with “art” images, such as the CC logos, or my diagrams, while lossy works great for logos, like the Autotools Mythbuster one you see on the sidebar, or the (previous) gradient images you’d see on backgrounds.

So my obvious instinct was to set up content negotiation — I’ve used it before for multiple-language websites, I expected it to work for multiple times as well, as it’s designed to… but after setting all up, it turns out that most modern web browsers still do not support WebP *at all*… and they don’t handle content negotiation as intended. For this to work we need either of two options.

The first, best option, would be for browsers only Accept the image formats they support, or at least prefer them — this is what Opera for Android does: Accept: text/html, application/xml;q=0.9, application/xhtml+xml, multipart/mixed, image/png, image/webp, image/jpeg, image/gif, image/x-xbitmap, */*;q=0.1 but that seems to be the only browser doing it properly. In particular, in this listing you’ll see that it supports PNG, WebP, JPEG, GIF and bimap — and then it accepts whatever else with a lower reference. If WebP was not in the list, even if it had an higher preference for the server, it would not be sent to the client. Unfortunately, this is not going to work, as most browsers send Accept: */* without explicitly providing the list of supported image formats. This includes Safari, Chrome, and MSIE.

Point of interest: Firefox does explicit one image format before others: PNG.

The other alternative is for the server to default to the “classic” image formats (PNG, JPEG, GIF) and then expect the browsers supporting WebP prioritizing it over the other image formats. Again this is not the case; as shown above, Opera lists it but does not prioritize, and again, Firefox prioritizes PNG over anything else, and makes no special exception for WebP.

Issues are open at Chrome and Mozilla to improve the support but they haven’t reached mainstream yet. Google’s own suggested solution is to use mod_pagespeed instead — but this module – which I already named in passing in my post about unfriendly projects – is doing something else. It’s on-the-fly changing the content that is provided, based on the reported User-Agent.

Given that I’ve spent some time on user agents, I would say I have the experiences to say that this is a huge pandora’s vase. If I have trouble with some low-development browsers reporting themselves as Chrome to fake their way in with sites that check the user agent field in JavaScript, you can guess how many of those are going to actually support the features that PageSpeed thinks they support.

I’m going to go back to PageSpeed in another post, for now I’ll stop to say that WebP has the numbers to become the next generation format out there, but unless browser developers, as well as web app developers start to get their act straight, we’re going to have hacks over hacks over hacks for the years to come… Currently, my blog is using a CSS3 feature with the standardized syntax — not all browsers understand it, and they’ll see a flat website without gradients; I don’t care and I won’t start adding workarounds for that just because (although I might use SCSS which will fix it for Safari)… new browsers will fix the problem, so just upgrade, or use a sane browser.

Passive web logs analysis. Replacing AWStats?

You probably don’t know that, but for my blog I do analyse the Apache logs with AWStats over and over again. This is especially useful at the start of the month to identify referrer spam and other similar issues, which in turn allows me to update my ModSecurity ruleset so that more spammers are caught and dealt with.

To do so, I’ve been using for, at this point, years, AWStats which is a Perl analyzer, generator and CGI application. It used to work nicely, but nowadays it’s definitely lacking. It doesn’t filter referrers search engines as much as it used to be (it’s still important to filter out requests coming from Google, but newer search engines are not recognized), and most of the new “social bookmark” websites are not there at all — yes it’s possible to keep adding to them, but with upstream not moving, this is getting harder and harder.

Even more important, for my ruleset work, is the lack of identification of modern browsers. Things like Android versions and other fringe OSes would be extremely useful for me, but adding support for all of them is a pain and I have enough things on my plates that this is not something I’m looking forward to tackle myself. It’s even more bothersome when you consider that there is no way to reconsider the already analyzed data, if a new URL is identified as a search engine, or an user agent a bot.

One of the most obvious choices for this kind of work is to use Google Analytics — unfortunately, this means that it will only work if it’s not blacklisted from the user side — that includes NoScript users and of course most of the spammers. So this is not a job for them. It’s something that has to be done on the backend, on the logs side.

The obvious point at that point is to find something capable to import the data out of the current awstats datafiles I got, and keep importing data from the Apache log files. Hopefully this should be done by saving the data in a PostgreSQL database (which is what I usually use); native support for vhost data, but the ability to collapse it in a single view would also be nice.

If somebody knows of a similar piece of software, I’d love to give it a try — hopefully, something that is written in Ruby or Perl might be the best for me (because I can hack on those) but I wouldn’t say no to Python or even Java (the latter if somebody helped me making sure the dependencies are all packed up properly). This will bring you better modsec rules, I’m sure!

Apache, Passenger, Rails: log shmock

You might or might not remember my fighting with mod_perl and my finding a bug in the handling of logs if Apache’s error log is set to use the syslog interface (which in my case would be metalog). For those wondering the upstream bug is still untouched goes without saying. This should have told me that there aren’t many people using Apache’s syslog support, but sometimes I’m stubborn.

Anyway, yesterday I finally put into so-called “production” the webapp I described last week for handling customers’ computers. I got it working in no time after mongoid started to behave (tests are still restricted, because a couple fail and I’m not sure why — I’ll have to work on that with the next release that require quite fewer hacks to test cleanly). I did encounter a nasty bug in best_in_place which I ended up fixing in Gentoo even though upstream hasn’t merged my branch yet.

To get it in “production” I simply mean configuring it to run on the twin server of this blog’s, which I’ve been using for another customer as well — and got ready for a third. Since Rails 3.1 was already installed on that box, it was quite easy to move my new app there. All it took was installing the few new gems I needed and…

Well here’s the interesting thing: I didn’t want for my application to run as my user, while obviously I wanted to check out the sources with my user so that I could get it to update with git … how do you do that? Well, Passenger is able to run the application under whatever user owns the config/environment.rb file, so you’d expect it to be able to run under an arbitrary user as well — which is the case, but only if you’re using version 3 (which is not stable in Gentoo as of yet).

So anyway I set up the new passenger to change the user, make public/assets/ and another directory I write to group-writable (the app user and my user are in the same group), and then I’m basically done, I think. I start up and I’m done with it, I think… but the hostnames tell me that “something went wrong”, without any clue as to what.

Okay so the default for Passenger is to not have any log at all, not a problem, I’ll just increase the level to 1 and see the error… or not? I still get no output in Apache’s error log .. which is still set to syslog… don’t tell me… I set Passenger to log to file, and lo and behold it works fine. I wonder if it’s time for me to learn Apache’s API and get to fix both, since it looks like I’m one of the very few people who would like to use syslog as Apache’s error log.

After getting Passenger to finally tell me what’s wrong, I find out both the reason why Rails wasn’t starting (I forgot to enable two USE flags in dev-ruby/barby which I use for generating the QR code on the label), but I also see this:

Rails Error: Unable to access log file. Please ensure that /var/www/${vhost}/log/production.log exists and is chmod 0666. The log level has been raised to WARN and the output directed to STDERR until the problem is fixed.
Please note that logging negatively impacts client-side performance. You should set your logging level no lower than :info in production.

What? Rails is really telling its users to create a world writeable log file, when it fails to write to it? Are they freaking kidding me? Is this really a suggestion coming from the developers of a framework for Web Applications which should be security-sensitive? … Okay so one can be smarter than them and do the right thing (in my case make sure that the log file is actually group-writeable) but if this is the kind of suggestions they find proper to tell you, it’s no wonder what happened with Diaspora. So it’s one more reason why Rails shouldn’t be for the faint hearted and that you should pay a very good sysadmin if you want to run a Rails application.

Oh and by the way the cherry on top of this is that instead of just sending the log to stderr, leaving it to Passenger to wrangle – which would have worked out nicely if Passenger had a way to distinguish which app the errors are coming from – Rails also moves the log level to warning, just to spite you. And then tells you that it impacts performances! Ain’t that lovely?

Plan for the day? If I find some extra free time I’d like to give a try and package (not necessarily in this order) syslogger so that the whole production.log thing can go away fast.

Configuring Visual Paradigm Server on Tomcat in Gentoo — Sorta

This post might offend some of you as I’m going to write about a piece of proprietary software. It’s software I’ve already discussed before: the UML modeller I’m using on my dayjobs as well as FLOSS work to help me out with design decisions. If you don’t care about non-Free software, feel free to skip this post.

A couple of months ago I discussed the trouble of getting JSP, Rails and mod_perl to play all together in the same pen Apache configuration. The reason why I had JSP in the mix was that the Visual Paradigm software I bought a couple of years ago is (optionally) licenses on a seat-counting floating license, which I much prefer to a single box’s license, as I often have to move around from one computer to the other.

Back in november, it seemed like I was going finally to work with someone else assisting me, so I bought a license for a second seat, and moved the license server (which is a JSP application, or to be precise a module in a complex JSP application) from my local Yamato to the server that serves this blog as well. The idea was that by making it accessible outside of my own network I could use it on my laptops as well as allowing a colleague to access it to coordinate design decisions.

Unfortunately I needed to make it run fast, and at the end of the day I didn’t set it up properly at all, just hacky enough to work… until a Tomcat update arrived. With the update, the ROOT web application was replaced by Tomcat’s own, taking the place of the VPServer application… and all hell broke loose. I didn’t have time to debug this up to today, when I really felt the need to have my UML models in front of me again, so I decided to spend some time to understand how to set this up properly.

My current final setup is something like this: Apache is the frontend server, it handles SSL and proxies the host – – to the Tomcat server. The Tomcat server is configured with an additional WebApp (rather than replacing the ROOT one) for the VPServer application, which means that I have to do some trickery with mod_rewrite to get the URLs straightened out (Visual Paradigm does not like it if the license server is not top-level, but the admin interface does not like if it’s accessed with a different prefix between Tomcat and Apache).

The application does not only provide floating license entrypoints, it also performs duties for three other modules, mostly collaborative designing tools that need to be purchased separately to be enabled, which I don’t really care about. Possibly for this it allows more than just file-based data storage, which is still the default. You can easily select a MySQL or PostgreSQL instance to store the data — in my case I decided to use PostgreSQL, since the server already had one running, and I’m very happy to lift the I/O task of managing storage from the Java process. For whatever reason, though, the JDBC connector for PostgreSQL is unable to connect to the unix socket path, so I had to sue TCP over localhost. Nothing major, just bothersome.

At the end of the day, all I needed to do was fetching the VPServer WebApp package (.zip file), extract it and move its ROOT/ directory to /var/lib/tomcat-6/webapps/VPServer, make the WEB-INF sub-directory writable to tomcat user (I don’t like it but it seems to be required — the code would like to write to the whole directory structure to be able to auto-update, I’m not really keen on that), and then configure Apache this way:

DocumentRoot /var/lib/tomcat-6/webapps/VPServer

<Directory /var/lib/tomcat-6/webapps/VPServer>
Order allow,deny
Allow from all

<Directory /var/lib/tomcat-6/webapps/VPServer/WEB-INF>
Order deny,allow
Deny from all

RewriteEngine On

RewriteRule ^/VPServer(.*)$ $1 [PT]

ProxyPassMatch ^/(VPServer|images).*$ !
ProxyPassMatch ^/.*\.css$ !
ProxyPass / ajp://localhost:8009/VPServer/
ProxyPassReverse / ajp://localhost:8009/VPServer/

SecRuleRemoveById flameeyes-2

(Before you ask, the SecRuleRemoveById above is just for documentation: the problem was that up to a couple of versions ago, Visual Paradigm left the default Java User-Agent string, which was filtered by my ModSecurity ruleset — nowadays it reports properly more details about its version and the operating system it is running on.)

The end result is pleasant, finally: with all this in mind it should be possible for me to create a (non-QA-compliant, unfortunately) ebuild for the VPServer software for my overlay, to avoid managing it all by myself, risking to forget how to set it up properly. I’m afraid though that it’ll take me much time to properly unbundle all the JARs, but in the mean time I can at least make it easier for me to update it.