Progress Logging and Results Logging

There is one thing that my role as software mechanic seems to get me attracted to, and that’s the importance of logging information. Logging is one of those areas that tend to bring up opinions, and with the idea of making this into a wider area of observability, it brought on entire businesses (shout out to friends Honeycomb.io). But even in smaller realities, I found myself caring about logging, setting up complex routing with metalog, or hoping to get a way to access Apache logs in a structured format.

Obviously, when talking about logging in bubbles, there’s a lot more to consider than just which software you send the logs to — even smaller companies nowadays need to be careful with PII, since GDPR makes most data toxic to handle. I can definitely tell you that some of the analysis I used to do for User-Agent filtering would not pass muster for a company at the time of GDPR — in a very similar fashion as the pizzeria CRM.

But leaving aside the whole complicated legal landscape, there’s a distinction in logs that I have not seen well understood by engineers – no matter where they are coming from – and that is the difference between what I call progress logging and results logging. I say that I call them this way, because I found a number of other different categorizations of logs, but none that matches my thoughts on the matter, and I needed to give it names.

Distinctions that I did hear people talk about are more like “debug logs” versus “request logs”, or “text logs” versus “binary logs”. But this all feels like it’s mixing media and message, in too many cases — as I said in my post about Apache request logs, I would love for structured (even binary would do) request logs, which are currently “simple” text logs.

Indeed, Apache (and any other server) request logs to me fit neatly in the category of results logging. They describe what happened when an action completed: the result of the HTTP request includes some information of the request, and some information of the response. It provides a result of what happened.

If you were to oversimplify this, you could log each full request and each full response, and call that results logging: a certain request resulted in a certain response. But I would expect that there is a lot more information available on the server, which does not otherwise make it to the response, for many different reasons (e.g. it might be information that the requestor is not meant to find out, or simply doesn’t have to know, and the response is meant to be as small as possible). In the case of an HTTP request to a server that act as a reverse proxy, the requestor should not be told about which backend the request was handled by — but it would be an useful thing to log as part of the result.

When looking at the practicality of implementing results logging, servers tend to accrue the information needed for generating the result logs in data structures that are kept around throughout the request (or whatever other process) lifetime, and then extract all of the required information from them at the time of generating the log.

This does mean that if the server terminates (either because it’s killed, the power goes off, or the request caused it to crash), and the result is not provided, then you don’t get a log about the request happening at all — this is the dirty secret of Apache request logs (and many other servers): they’re called request logs but they actually log responses. There are ways around this, by writing parts of the results logs as they are identified – this helps both in terms of persistence and in terms of memory usage (if you’re storing something in memory just because you should be logging it later) but that ends up getting much closer to the concept of tracing.

Progress logs, instead, are closer to what is often called shotgun debugging or printf() debugging. They are log statement emitted as the code goes through them, and they are usually free-form for the developer writing the code. This is what you get with libraries such as Python’s logging, and can assume more or less structured form depending on a number of factors. For instance, you can have a single formatted string with maybe the source file and line, or you may have a full backtrace of where the log event happened and what the local variables in each of the function calls were. What usually make you choose between the two is cost, and signal-to-noise ratio, of course.

For example, Apache’s mod_rewrite has a comprehensive progress log that provides a lot of details of how each rewrite is executed, but if you turn that on, it’ll fill up your server’s filesystem fairly quickly, and it will also make the webserver performance go down the drain. You do want this log, if you’re debugging an issue, but you most definitely don’t want it for every request. The same works for results logs — take for instance ModSecurity: when I used to maintain my ruleset, I wouldn’t audit-log every request, but I had a special rule that, if a certain header was provided in the request, would turn on audit-logging. This allowed me to identify problems when I was debugging a new possible rule.

Unfortunately, my experience straddling open-source development and industry bubbles means I don’t have overall good hopes for an easy way to implement logging “correctly”. Both because correctly is subjective, and because I really haven’t found a good way to do this that scales all the way from a simple tool like my pdfrename to a complex Cloud-based solution. Indeed , while the former would generally be caring less about structured logs and request tracing, a Cloud software like my planned-and-never-implemented Tanuga would get a significant benefit from using OpenTelemetry to connect feed fetching and rendering.

Flexible and configurable logging libraries, such as are available for Python, Ruby, Erlang, and many more, provide a good “starting point” but by experience they don’t scale well between in and out of an organization or unit. It’s a combination of problems similar to the schema issue and the RPC issue: within an organization you can build a convention of what you expect logs to be, and you can pay the cost of updating the configuration for all sorts of tools to do the right thing, but if you’re an end user, that’s unlikely — besides, sometimes that’s untested.

So it makes sense that, up to this day, we still have a significant reliance on “simple”, unstructured text logs. They are the one universally accessible way to provide information to users. But I would argue that we should be better off to build an ecosystem of pluggable, configurable backends, where the same tool, without being recompiled or edited, can be made to output simple text on the standard error stream, or to a more structured event log. Unfortunately, judging by how the FLOSS world took the idea of standardizing services’ behaviours with systemd, I doubt that’s going to happen any time soon in the wide world… but you can probably get away with it in big enough organizations that control what they run.

Also, for a bit of fun related tidbit, verbose (maybe excessively so) progress logging is what made my reverse engineering the OneTouch Verio so easy: on Windows the standard error is not usually readable… unless you run the application through a debugger. So once I did that, I could see every single part of the code as it processed the requests and responses for the device. Now, you could think that just hiding the logs by default, without documenting the flag to turn them on would be enough — but as it turns out, as long as the logging calls are built into a binary, it’s not too hard to understand them while reverse engineering.

What this is meant to say is that, just because easy access to logs is a great feature for open source tools, and for most internal tools in companies and other institutions, the same cannot be said for proprietary software: indeed, the ability to obfuscate logs, or even generate “encrypted” logs, is something that proprietary software (and hardware) thrive on: it makes it harder to reverse engineer. So it’s no surprise if logs are a complicated landscape with requirements that are not only divergent, but at times opposite, between different stakeholders.

Free Idea: structured access logs for Apache HTTPD

This post is part of a series of free ideas that I’m posting on my blog in the hope that someone with more time can implement. It’s effectively a very sketched proposal that comes with no design attached, but if you have time you would like to spend learning something new, but no idea what to do, it may be a good fit for you.

I have been commenting on Twitter a bit about the lack of decent tooling to deal with Apache HTTPD’s Combined Logging Format (inherited from NCSA). For those who do not know about it, this is hte format used by standard access_log files, which include information about requests, including the source IP, the time, the requested path, the status code and the User-Agent used.

These logs are useful for debugging but are also consumed by tools such as AWStats to produce useful statistics about the request patterns of a website. I used these extensively when writing my ModSecurity rulesets, and I still keep an eye out on them for instance to report wasteful feed readers.

The files are simple text files, and that makes it easy to act on them: you can use tail and grep, and logrotate needs no special code beside moving the file and reloading Apache to have it re-open the paths. This makes it hard to query for particular entries in fields, such as to get the list of User-Agent strings present in a log. Some of the suggestions I got over Twitter to solve this were to use awk, but as it happens, these logs are not actually parseable with a straightforward field separation.

Lacking finding a good set of tools to handle these formats directly, I have been complaining that we should probably start moving away from simple text files into more structured log formats. Indeed, I know that there used to be at least some support for logging directly to MySQL and other relational databases, and that there are more complicated machinery often used by companies and startups that process these access logs into analysis software and so on. But all of these tend to be high overhead, much more than what I or someone else with a small personal blog would care about implementing.

Instead I think it’s time to start using structured file logs. A few people including thresh from VideoLAN suggested using JSON to write the log files. This is not a terrible idea, as the format is at least well understood and easy to interface with most other software, but honestly I would prefer something with an actual structure, a schema that can be followed. Of course I’m not meaning XML, and I would rather suggest having a standardized schema for proto3. Part of that I guess is because I’m used to use this at work, but also because I like the idea of being able to just define my schema and have it generate the code to parse the messages.

Unfortunately currently there is no support or library to access a sequence of protocol buffer messages. Using a single message with repeated sub-messages would work, but it is not append-friendly so there is no way to just keep writing this to a file, and being able to truncate and resume writing to it, which is a property needed for a proper structured log format to actually fit in the space previously occupied by text formats. This is something I don’t usually have to deal with at work, but I would assume that a simple LV (Length-Value) or LVC (Length-Value-Checksum) encoding would be okay to solve this problem.

But what about other properties of the current format? Well, the obvious answer is that, assuming your structured log contains at least as much information (but possibly more) as the current log, you can always have tools that convert on the fly to the old format. This would for instance allow to have a special tail-like command and a grep-like command that provides compatibility with the way the files are currently looked at manually by your friendly sysadmin.

Having more structured information would also allow easier, or deeper analysis of the logs. For instance you could log the full set of headers (like ModSecurity does) instead of just the referrer and User-Agent. And allow for customizing the output on the conversion side rather than lose the details when writing.

Of course this is just one possible way to solve this problem, and just because I would prefer working with technologies that I’m already friendly with it does not mean I wouldn’t take another format that is similarly low-dependency and easy to deal with. I’m just thinking that the change-averse solution of not changing anything and keeping logs in text format may be counterproductive in this situation.

From Rails to Syslog or: How I Learned to Stop Worrying and Ditch production.log

In my previous installment I ranted about. among other things, the way Rails suggests you to keep a world-writeable log file for the production environment. As I said at the end, I planned on looking at the syslogger gem and that was actually quite helpful.

The idea goes like this: by using syslogger you can tell Rails that the logs have to go through the syslog; in my case that means it goes to metalog, which then filters on the webapp names and pushes it to /var/log/rails, taking care of rotating the log as needed (either due to size or time — the former is quite useful to avoid that rogue bots cause a DoS, which happened to me when I was inexperienced with these technologies!). Of course, this only works on Unix, but that’s what I care about anyway.

Beside the placement of the logs, using metalog for me also means I can filter important messages and show them in the important messages’ log rather than being just limited to a hidden log file within the app’s own tree, and also means that I can mix in the messages of all the running applications, rather than having each report to a different file. If I were to use syslog-ng instead, I could easily make it send the logs via network to another box and aggregate all of them there… but I really don’t see the point (yet) for that, and the features that metalog comes with tramp easily the network support.

So how do you achieve this? It’s actually pretty easy. Obviously it starts with installing dev-ruby/syslogger (in Gentoo, through Portage, everywhere else, via gem); then you can configure this very easily on both Rails 2.3 and 3.x series (I have one server running Rails 2.3, the other 3.1… I have yet to set up Typo 6.x, but I’ll probably do that at some point in the near future, although unlikely before FOSDEM).

The trick is all in config/environments/production.rb, where you have to tell Rails to use a custom Logger; there is already an example, commented-out like that refers to the other gem, SyslogLogger, but you should change it to something like this

  config.logger = Syslogger.new("yourappname")

This way you can distinguish each application’s messages in the log. Then in the metalog.conf file you can have:

Rails apps : 
  program_regex = "^(typo|radiant|yourappname)"
  logdir = "/var/log/rails"
  maxfiles = 5
  break = 1

so that everything is then readable as /var/log/rails/current.

I’m not sure how much it impacts performance; I’d be surprised if it decreased them, as metalog also buffers the disk writes, but you never know until you check for sure; in general I still prefer if the (multiple) Rails processes send everything to metalog for my own convenience.

Interestingly, if you have a webapp that does not deal with on-disk files directly, but just with a database, by using syslogger you’re basically limiting the writing to the cache directories only, which is probably a positive note.

Apache, Passenger, Rails: log shmock

You might or might not remember my fighting with mod_perl and my finding a bug in the handling of logs if Apache’s error log is set to use the syslog interface (which in my case would be metalog). For those wondering the upstream bug is still untouched goes without saying. This should have told me that there aren’t many people using Apache’s syslog support, but sometimes I’m stubborn.

Anyway, yesterday I finally put into so-called “production” the webapp I described last week for handling customers’ computers. I got it working in no time after mongoid started to behave (tests are still restricted, because a couple fail and I’m not sure why — I’ll have to work on that with the next release that require quite fewer hacks to test cleanly). I did encounter a nasty bug in best_in_place which I ended up fixing in Gentoo even though upstream hasn’t merged my branch yet.

To get it in “production” I simply mean configuring it to run on the twin server of this blog’s, which I’ve been using for another customer as well — and got ready for a third. Since Rails 3.1 was already installed on that box, it was quite easy to move my new app there. All it took was installing the few new gems I needed and…

Well here’s the interesting thing: I didn’t want for my application to run as my user, while obviously I wanted to check out the sources with my user so that I could get it to update with git … how do you do that? Well, Passenger is able to run the application under whatever user owns the config/environment.rb file, so you’d expect it to be able to run under an arbitrary user as well — which is the case, but only if you’re using version 3 (which is not stable in Gentoo as of yet).

So anyway I set up the new passenger to change the user, make public/assets/ and another directory I write to group-writable (the app user and my user are in the same group), and then I’m basically done, I think. I start up and I’m done with it, I think… but the hostnames tell me that “something went wrong”, without any clue as to what.

Okay so the default for Passenger is to not have any log at all, not a problem, I’ll just increase the level to 1 and see the error… or not? I still get no output in Apache’s error log .. which is still set to syslog… don’t tell me… I set Passenger to log to file, and lo and behold it works fine. I wonder if it’s time for me to learn Apache’s API and get to fix both, since it looks like I’m one of the very few people who would like to use syslog as Apache’s error log.

After getting Passenger to finally tell me what’s wrong, I find out both the reason why Rails wasn’t starting (I forgot to enable two USE flags in dev-ruby/barby which I use for generating the QR code on the label), but I also see this:

Rails Error: Unable to access log file. Please ensure that /var/www/${vhost}/log/production.log exists and is chmod 0666. The log level has been raised to WARN and the output directed to STDERR until the problem is fixed.
Please note that logging negatively impacts client-side performance. You should set your logging level no lower than :info in production.

What? Rails is really telling its users to create a world writeable log file, when it fails to write to it? Are they freaking kidding me? Is this really a suggestion coming from the developers of a framework for Web Applications which should be security-sensitive? … Okay so one can be smarter than them and do the right thing (in my case make sure that the log file is actually group-writeable) but if this is the kind of suggestions they find proper to tell you, it’s no wonder what happened with Diaspora. So it’s one more reason why Rails shouldn’t be for the faint hearted and that you should pay a very good sysadmin if you want to run a Rails application.

Oh and by the way the cherry on top of this is that instead of just sending the log to stderr, leaving it to Passenger to wrangle – which would have worked out nicely if Passenger had a way to distinguish which app the errors are coming from – Rails also moves the log level to warning, just to spite you. And then tells you that it impacts performances! Ain’t that lovely?

Plan for the day? If I find some extra free time I’d like to give a try and package (not necessarily in this order) syslogger so that the whole production.log thing can go away fast.

Sysadmining tips from a non-sysadmin

I definitely am not a sysadmin, although as a Gentoo developer I have to have some general knowledge of sysadmining, my main work is development, and that’s what I have most of my experience with. On the other hand, I picked up some skills by maintaining the two VPSes (the one where this blog is hosted, and the one hosting xine’s bugzilla — as well as site).

Some of these tricks are related with the difficulties I have reported with using Gentoo as a guest operating system into virtual servers, but a few are totally not related. Let me try to relay some of the tricks I picked up.

The first trick is to use metalog for logging; while syslog-ng has some extra features missing in metalog (like the network logging support), for a single server, the latter is much much easier to set up and deal with. But I find the default configuration a bit difficult do deal with. My first step is then to replace the everything logging with a therest logging, by doing something along these lines:

Postgresql :
  program_regex = "^postmaster"
  program_regex = "^postgres"
  logdir   = "/var/log/postgres"
  break    = 1

Apache :
  program_regex = "^httpd"
  logdir   = "/var/log/http"
  break    = 1

The rest of important stuff :
  facility = "*"
  minimum  = 6
  logdir   = "/var/log/therest"

See that break statement? The whole point of it is to not fall back into the entries below in the file, at the end, the therest block will log everything that does not fall into previous directories. My reason to split these in this way is that I can easily check the logs for cron or postgresql and at the same time check if there is something I’m not expecting.

While using metalog drops the requirement for logrotate for the system logs, it doesn’t stop it to be needed for other log systems; quassel doesn’t log to syslog, nor does Portage, and the Apache access logs are better handled without using syslog to pass them through awstats later. Note: having portage to log to syslog is something I might make good use of; it would break qlop, but it might be worth it for some setups, like my two VPSes. But even with this limitation, metalog makes it much easier to deal with the basic logs.

The next step to simplify the management for me has been switching from Paul Vixie’s cron to fcron. The main reason is that fcron sounds “modern” compared with Vixie’s, and it has a few very useful features that makes it much easier to deal with: erroronlymail sends you mail about the cron jobs only if their status report is non-zero (failure) rather than every time if there is output; random makes it possible to avoid running heavy-handed jobs always at the same time (it makes the system altogether more secure, as an attacker cannot guess that at a given time the system will be having extra-load!), and lavg options allow you to skip running a series of jobs if the system is busy doing something else.

Oh and another important not for those of you using PostgreSQL; I learnt the hard way the other day that the default logging system of the PGSQL server is to write a postmaster.log file inside the PostgreSQL data directory. This file does not really need to be rotated as postgres seems to take care of that itself; on the other hand, it makes much more sense to leave the task to the best software: the logger! To fix this up you have to edit the /var/lib/postgresql/8.4/data/postgresql.conf file (you may have to change 8.4 to the version of PGSQL you’re running), and add the following line:

log_destination = 'syslog'

Thanks to metalog’s buffering and all its features, it should make it much easier on the I/O of the system especially if the load is very very high. Which sometimes happened to be when my blog went hammered.

Okay probably most of this is nothing useful to seasoned sysadmins, but small hints from non-sysadmin are something that gets useful for other non-sysadmin on the job for the same reason.