This post is part of a series of free ideas that I’m posting on my blog in the hope that someone with more time can implement. It’s effectively a very sketched proposal that comes with no design attached, but if you have time you would like to spend learning something new, but no idea what to do, it may be a good fit for you.
I have been commenting on Twitter a bit about the lack of decent tooling to deal with Apache HTTPD’s Combined Logging Format (inherited from NCSA). For those who do not know about it, this is hte format used by standard access_log files, which include information about requests, including the source IP, the time, the requested path, the status code and the User-Agent used.
These logs are useful for debugging but are also consumed by tools such as AWStats to produce useful statistics about the request patterns of a website. I used these extensively when writing my ModSecurity rulesets, and I still keep an eye out on them for instance to report wasteful feed readers.
The files are simple text files, and that makes it easy to act on them: you can use tail and grep, and logrotate needs no special code beside moving the file and reloading Apache to have it re-open the paths. This makes it hard to query for particular entries in fields, such as to get the list of User-Agent strings present in a log. Some of the suggestions I got over Twitter to solve this were to use awk, but as it happens, these logs are not actually parseable with a straightforward field separation.
Lacking finding a good set of tools to handle these formats directly, I have been complaining that we should probably start moving away from simple text files into more structured log formats. Indeed, I know that there used to be at least some support for logging directly to MySQL and other relational databases, and that there are more complicated machinery often used by companies and startups that process these access logs into analysis software and so on. But all of these tend to be high overhead, much more than what I or someone else with a small personal blog would care about implementing.
Instead I think it’s time to start using structured file logs. A few people including thresh from VideoLAN suggested using JSON to write the log files. This is not a terrible idea, as the format is at least well understood and easy to interface with most other software, but honestly I would prefer something with an actual structure, a schema that can be followed. Of course I’m not meaning XML, and I would rather suggest having a standardized schema for proto3. Part of that I guess is because I’m used to use this at work, but also because I like the idea of being able to just define my schema and have it generate the code to parse the messages.
Unfortunately currently there is no support or library to access a sequence of protocol buffer messages. Using a single message with repeated sub-messages would work, but it is not append-friendly so there is no way to just keep writing this to a file, and being able to truncate and resume writing to it, which is a property needed for a proper structured log format to actually fit in the space previously occupied by text formats. This is something I don’t usually have to deal with at work, but I would assume that a simple LV (Length-Value) or LVC (Length-Value-Checksum) encoding would be okay to solve this problem.
But what about other properties of the current format? Well, the obvious answer is that, assuming your structured log contains at least as much information (but possibly more) as the current log, you can always have tools that convert on the fly to the old format. This would for instance allow to have a special tail-like command and a grep-like command that provides compatibility with the way the files are currently looked at manually by your friendly sysadmin.
Having more structured information would also allow easier, or deeper analysis of the logs. For instance you could log the full set of headers (like ModSecurity does) instead of just the referrer and User-Agent. And allow for customizing the output on the conversion side rather than lose the details when writing.
Of course this is just one possible way to solve this problem, and just because I would prefer working with technologies that I’m already friendly with it does not mean I wouldn’t take another format that is similarly low-dependency and easy to deal with. I’m just thinking that the change-averse solution of not changing anything and keeping logs in text format may be counterproductive in this situation.