This Time Self-Hosted
dark mode light mode Search

Free Idea: structured access logs for Apache HTTPD

This post is part of a series of free ideas that I’m posting on my blog in the hope that someone with more time can implement. It’s effectively a very sketched proposal that comes with no design attached, but if you have time you would like to spend learning something new, but no idea what to do, it may be a good fit for you.

I have been commenting on Twitter a bit about the lack of decent tooling to deal with Apache HTTPD’s Combined Logging Format (inherited from NCSA). For those who do not know about it, this is hte format used by standard access_log files, which include information about requests, including the source IP, the time, the requested path, the status code and the User-Agent used.

These logs are useful for debugging but are also consumed by tools such as AWStats to produce useful statistics about the request patterns of a website. I used these extensively when writing my ModSecurity rulesets, and I still keep an eye out on them for instance to report wasteful feed readers.

The files are simple text files, and that makes it easy to act on them: you can use tail and grep, and logrotate needs no special code beside moving the file and reloading Apache to have it re-open the paths. This makes it hard to query for particular entries in fields, such as to get the list of User-Agent strings present in a log. Some of the suggestions I got over Twitter to solve this were to use awk, but as it happens, these logs are not actually parseable with a straightforward field separation.

Lacking finding a good set of tools to handle these formats directly, I have been complaining that we should probably start moving away from simple text files into more structured log formats. Indeed, I know that there used to be at least some support for logging directly to MySQL and other relational databases, and that there are more complicated machinery often used by companies and startups that process these access logs into analysis software and so on. But all of these tend to be high overhead, much more than what I or someone else with a small personal blog would care about implementing.

Instead I think it’s time to start using structured file logs. A few people including thresh from VideoLAN suggested using JSON to write the log files. This is not a terrible idea, as the format is at least well understood and easy to interface with most other software, but honestly I would prefer something with an actual structure, a schema that can be followed. Of course I’m not meaning XML, and I would rather suggest having a standardized schema for proto3. Part of that I guess is because I’m used to use this at work, but also because I like the idea of being able to just define my schema and have it generate the code to parse the messages.

Unfortunately currently there is no support or library to access a sequence of protocol buffer messages. Using a single message with repeated sub-messages would work, but it is not append-friendly so there is no way to just keep writing this to a file, and being able to truncate and resume writing to it, which is a property needed for a proper structured log format to actually fit in the space previously occupied by text formats. This is something I don’t usually have to deal with at work, but I would assume that a simple LV (Length-Value) or LVC (Length-Value-Checksum) encoding would be okay to solve this problem.

But what about other properties of the current format? Well, the obvious answer is that, assuming your structured log contains at least as much information (but possibly more) as the current log, you can always have tools that convert on the fly to the old format. This would for instance allow to have a special tail-like command and a grep-like command that provides compatibility with the way the files are currently looked at manually by your friendly sysadmin.

Having more structured information would also allow easier, or deeper analysis of the logs. For instance you could log the full set of headers (like ModSecurity does) instead of just the referrer and User-Agent. And allow for customizing the output on the conversion side rather than lose the details when writing.

Of course this is just one possible way to solve this problem, and just because I would prefer working with technologies that I’m already friendly with it does not mean I wouldn’t take another format that is similarly low-dependency and easy to deal with. I’m just thinking that the change-averse solution of not changing anything and keeping logs in text format may be counterproductive in this situation.

Comments 7
  1. We’ve used LogStash and ElasticSearch to do things like this… but it’s a behemoth, very hungry on RAM and system resources.My immediate thought on this though was what about using SQLite3, yes it’s a relational database, but it’s a fairly self-contained engine that is light on resources and setup is trivial. Tailing logs is a challenge there, but it does give you that querying ability.Looking around for implementations, I found this analysis tool which just reads existing logs for querying: https://steve.fi/Software/a… Perhaps a logging plug-in could be made that logs direct to that format?The other option is as you say, JSON logging… or YAML, since one file can natively have multiple “documents”, which can be presented in a human-readable form.

    1. The ELK stack (Logstash/Filebeat – Elasticsearch – Kibana) becomes much more useful when you have more servers and applications. We write structured logs (in json) from Varnish, Apache and plain text logs from syslog, auth.log, etc. to files. Filebeat watches those files and sends new events to logstash, which does some processing (normalizing timestamps, adding geoip info, analysing user-agent field, etc.) then sends them to ES.

      ES can handle moving older events from fast hot to cheaper warm databases, expiring old events, etc.

      Kibana is great for searching and visualizing and we can easily find patterns across servers or see requests progress across applications (e.g. from our SSL offloader to Varnish to Apache to PHP-FPM).

      All in all it’s a nice stack, especially if your logs are already structured, and cheaper than tools like Splunk.

  2. RecordIO seems to be (at least partially) available via a Google optimisation/data analysis package.

  3. Have a link? I have avoided naming that explicitly because I couldn’t find if we published anything about it. I found we wrote about Capacitor, but that’s a post-processed format.

  4. Uh I like that tool, I should check if we have it in Gentoo or package it because it looks useful.But I don’t find sqlite3 is a good option for this: it requires journaling and it’s not easily truncated for rotation. For logs you really want the server to always do the least amount of work. Sending the data to another service fits the needs, but using a full sql database doesn’t. But as you said, using LogStash and ElasticSearch is overkill for most people.

  5. Yeah, log rotate in the traditional sense would have to be replaced by “dump to some archive format then do `DELETE FROM records WHERE timestamp < ${SOME_TIME}; COMMIT;`”Not elegant, but doable. Not sure if Apache supports sending it to `syslog`, as some of the `syslog` daemons (`rsyslog` for example) have plug-ins that might fit the bill.From an overhead point-of-view, a CoAP (instead of HTTP… so you don’t have to set up and tear down a TCP connection) service that just accepts a JSON blob and stashes it would be the go in terms of preventing the server from spending too much time on logging.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.