Free Idea: structured access logs for Apache HTTPD

This post is part of a series of free ideas that I’m posting on my blog in the hope that someone with more time can implement. It’s effectively a very sketched proposal that comes with no design attached, but if you have time you would like to spend learning something new, but no idea what to do, it may be a good fit for you.

I have been commenting on Twitter a bit about the lack of decent tooling to deal with Apache HTTPD’s Combined Logging Format (inherited from NCSA). For those who do not know about it, this is hte format used by standard access_log files, which include information about requests, including the source IP, the time, the requested path, the status code and the User-Agent used.

These logs are useful for debugging but are also consumed by tools such as AWStats to produce useful statistics about the request patterns of a website. I used these extensively when writing my ModSecurity rulesets, and I still keep an eye out on them for instance to report wasteful feed readers.

The files are simple text files, and that makes it easy to act on them: you can use tail and grep, and logrotate needs no special code beside moving the file and reloading Apache to have it re-open the paths. This makes it hard to query for particular entries in fields, such as to get the list of User-Agent strings present in a log. Some of the suggestions I got over Twitter to solve this were to use awk, but as it happens, these logs are not actually parseable with a straightforward field separation.

Lacking finding a good set of tools to handle these formats directly, I have been complaining that we should probably start moving away from simple text files into more structured log formats. Indeed, I know that there used to be at least some support for logging directly to MySQL and other relational databases, and that there are more complicated machinery often used by companies and startups that process these access logs into analysis software and so on. But all of these tend to be high overhead, much more than what I or someone else with a small personal blog would care about implementing.

Instead I think it’s time to start using structured file logs. A few people including thresh from VideoLAN suggested using JSON to write the log files. This is not a terrible idea, as the format is at least well understood and easy to interface with most other software, but honestly I would prefer something with an actual structure, a schema that can be followed. Of course I’m not meaning XML, and I would rather suggest having a standardized schema for proto3. Part of that I guess is because I’m used to use this at work, but also because I like the idea of being able to just define my schema and have it generate the code to parse the messages.

Unfortunately currently there is no support or library to access a sequence of protocol buffer messages. Using a single message with repeated sub-messages would work, but it is not append-friendly so there is no way to just keep writing this to a file, and being able to truncate and resume writing to it, which is a property needed for a proper structured log format to actually fit in the space previously occupied by text formats. This is something I don’t usually have to deal with at work, but I would assume that a simple LV (Length-Value) or LVC (Length-Value-Checksum) encoding would be okay to solve this problem.

But what about other properties of the current format? Well, the obvious answer is that, assuming your structured log contains at least as much information (but possibly more) as the current log, you can always have tools that convert on the fly to the old format. This would for instance allow to have a special tail-like command and a grep-like command that provides compatibility with the way the files are currently looked at manually by your friendly sysadmin.

Having more structured information would also allow easier, or deeper analysis of the logs. For instance you could log the full set of headers (like ModSecurity does) instead of just the referrer and User-Agent. And allow for customizing the output on the conversion side rather than lose the details when writing.

Of course this is just one possible way to solve this problem, and just because I would prefer working with technologies that I’m already friendly with it does not mean I wouldn’t take another format that is similarly low-dependency and easy to deal with. I’m just thinking that the change-averse solution of not changing anything and keeping logs in text format may be counterproductive in this situation.

Free Idea: a filtering HTTP proxy for securing web applications

This post is part of a series of free ideas that I’m posting on my blog in the hope that someone with more time can implement. It’s effectively a very sketched proposal that comes with no design attached, but if you have time you would like to spend learning something new, but no idea what to do, it may be a good fit for you.

Going back to a previous topic I wrote about, and the fact that I’m trying to set up a secure WordPress instance, I would like to throw out another idea I won’t have time to implement myself any time soon.

When running complex web applications, such as WordPress, defense-in-depth is a good security practice. This means that in addition to locking down what the code itself can do on to the state of the local machine, it also makes sense to limit what it can do to the external state and the Internet at large. Indeed, even if you cannot drop a shell on a remote server, there is value (negative for the world, positive for the attacker) to at least being able to use it form DDoS (e.g. through an amplification attack).

With that in mind, if your app does not require network at all, or the network dependency can be sacrificed (like I did for Typo), just blocking the user from making outgoing connection with iptables would be enough. The --uid-owner option makes it very easy to figure out who’s trying to open new connections, and thus stop a single user transmitting unwanted traffic. Unfortunately, this does not always work because sometimes the application really needs network support. In the case of WordPress, there is a definite need to contact the WordPress servers, both to install plugins and to check if it should self-update.

You could try to limit access to what the user can access by hosts. But that’s not easy to implement right either. Take WordPress as an example still: if you wanted to limit access to the WordPress infrastructure, you would effectively have to allow it accessing *.wordpress.org, and this can’t really be done in iptables, at far as I know, since those connections go to IP literal addresses. You could rely on FcRDNS to verify the connections, but that can be slow, and if you happen to have access to poison the DNS cache of the server, you’re effectively in control of this kind of ACL. I ignored the option of just using “standard” reverse DNS resolution, because in that case you don’t even need to poison DNS, you can just decide what your IP will reverse-resolve to.

So what you need to do is actually filter at the connection-request level, which is what proxies are designed for. I’ll be assuming we want to have a non-terminating proxy (because terminating proxies are hard), but even in that case you can now know what (forward) address you want to connect to, and in that case *.wordpress.org becomes a valid ACL to use. And this is something you can actually do relatively easily with Squid, for instance. Indeed, this is the whole point of tools such as ufdbguard (which I used to maintain for Gentoo), and the ICP protocol. But Squid is particularly designed as a caching proxy, it’s not lightweight at all, and it can easily become a liability to have it in your server stack.

Up to now, what I have used to reduce the surface of attacks of my webapps is set them behind a tinyproxy, which does not really allow for per-connection ACLs. This only provides isolation against random non-proxied connections, but it’s a starting point. And here is where I want to provide a free idea for anyone who has the time and would like to provide better security tools for srver-side defense-in-depth.

A server-side proxy for this kind of security usage would have to be able to provide ACLs, with both positive and negative lists. You may want to provide all access to *.wordpress.org, but at the same time block all non-TLS-encrypted traffic, to avoid the possibility of downgrade (given that WordPress has a silent downgrade for requests to api.wordpress.org, that I talked about before).

Even better, such a proxy should have the ability to distinguish the ACLs based on which user (i.e. which webapp) is making the request. The obvious way would be to provide separate usernames to authenticate to the proxy — which again Squid can do, but it’s designed for clients for which the validation of username and password is actually important. Indeed, for this target usage, I would ignore the password altogether, and just use the user at face value, since the connection should always only be local. I would be even happier if instead of pseudo-authenticating to the proxy, the proxy could figure out which (local) user the connection came from, by inspecting the TCP socket connection, kind of like querying the ident protocol used to work for IRC.

So to summarise, what I would like to have is an HTTP(S) proxy that focuses on securing server-side web applications. Does not have to support TLS transport (because it should only accept local connections), nor it should be a terminating proxy. It should support ACLs that allow/deny access to a subset of hosts, possibly per-user, without needing a user database of any sort, and even better if it can tell by itself which user the connection came from. I’m more than happy if someone tells me this already exists, or if not, someone starts writing this… thank you!

Free Idea: Free Software stack for audiobooks

This post is part of a series of free ideas that I’m posting on my blog in the hope that someone with more time can implement. It’s effectively a very sketched proposal that comes with no design attached, but if you have time you would like to spend learning something new, but no idea what to do, it may be a good fit for you.

This is clearly not a new idea, as I posted about something very similar over eight years ago. At the time I was looking for a way of encoding audibooks coming from audio CD in a format that was compatible with the iPod Classic. Since then, Apple appears to have done their best to make the audiobooks experience on iOS the worst possible, to the point that I don’t really use my iPod Touch as my primary audiobook player any more.

As an aside to the free idea, which can probably give a bit more context for you all, let me describe the problems I have with the current approach to audiobooks by Apple. A few iOS major versions ago, they decided to move the audiobooks handling from the Music app to the iBooks app; this would be reasonable, given that they are books, and it was always a bit strange to have them in a separate application, but it also meant you lost the ability to build playlists with them.

Playlists with audiobooks are great, because they allow you to “stitch” multiple books of the same series, so that you can play them for hours on end, for instance if you need them to sleep. I used to have a playlist for the Hitchhikers’ Guide to the Galaxy radio series and one for the books, one for Dresden Files, and one for the News Quiz, including both the collected editions in CD by BBC, my own “audiobooks” built out of the podcasts, and the more recent podcast episodes that I have not collected into audiobook files yet.

So what is the idea? There are two components that, as far as I can see, are currently heavily lacking in the FLOSS world. The first is a way to generate audiobook files, which is what I complained eight years ago. Indeed, if you look even at a random sample on Project Gutenberg, the audiobook is actually a ton of files (47!) each with a chapter in them. A proper audiobook file would be a single file, with chapter markers, and per-chapter metadata (chapter title, and in that case, the performer).

It’s more than just a matter of having a single file to move around. While of course the hardware improvements made a number of these points moot, the original reason to have a single big file over multiple small files was to avoid having to seek to a different point in the disk in-between chapters. It also allows the decoder to keep going, between chapters, as there is no “end of stream” but rather just a marker that at a given point in time some different metadata applies. Again, as I said this is no longer as relevant as it used to be, but it’s also not entirely gone.

The other component that is currently lacking, is a good playback solution. While VLC can obviously play those files right now, and if I’m not mistaken it also extracts the per-chapter metadata correctly, it lacks two features that make enjoying audiobooks possible. The first is possibly complicated, and relates to the ability to store bookmarks and current-playing time. While supposedly VLC supports the feature for resuming from last playback, I have heard it’s still sometimes unreliable (I have no idea how it’s implemented), plus it does not support just bookmarking a given time in a file/book. Bookmarking is particularly important when listening to non-novel audiobooks, as you may want to go back to it afterwards, to re-listen to advice or take a reference to further details.

The other feature is basically UI heavy, and it involves mostly the mobile UI (at least the Android one) and is the ability to scan backward and forward in the file. You have probably seen this in other players including Netflix’s own app, that allow you to scan back 30 seconds — in audiobooks it’s also useful to scan forward 30 seconds, particularly when considering the bookmarks above.

As usual for Free Ideas I have no time to work on this myself. I can give the idea details out, and depending on things I may be able to contribute to a bounty on it, but otherwise, no code I can share about this yet.

Free Idea: a QEMU Facedancer fuzzer

This post is part of a series of free ideas that I’m posting on my blog in the hope that someone with more time can implement. It’s effectively a very sketched proposal that comes with no design attached, but if you have time you would like to spend learning something new, but no idea what to do, it may be a good fit for you.

Update (2017-06-19): see last paragraph.

You may already be familiar with the Facedancer, the USB fuzzer originally designed and developed by Travis Goodspeed of PoC||GTFO fame. If you’re not, in no so many words, it’s a board (and framework) that allows you to simulate the behaviour of any USB device. It works thanks to a Python framework which supports a few other board and could – theoretically – be expanded to support any device with gadgetfs support (such as the BeagleBone Black I have at home, but I digress).

I have found about this through Micah’s videos, and I have been thinking for a while to spend some time to do the extension to gadgetfs so I can use it to simulate glucometers with their original software — as in particular it should allow me to figure out which value represent what, by changing what is reported to the software and see how it behaves. While I have had no time to do that yet, this is anyway a topic for a different post.

The free idea I want to give instead is to integrate, somehow, the Facedancer framework with QEMU, so that you can run the code behind a Facedancer device as if it was connected to a QEMU guest, without having to use any hardware at all. A “Virtualdancer”, which would not only obviate the need for hardware in the development phase (if the proof of concept of a facedancer were to require an off-site usage) but also would integrate more easily into fuzzing projects such as Bochspwn or TriforceAFL.

In particular, I have interest not only in writing simulated glucometers for debugging purposes (although a testsuite that requires qemu and a simulated device may be a bit overkill), but also in simulating HID devices. You may remember that recently I had to fix my ELECOM trackball, and this is not the first time I have to deal with broken HID descriptor. I have spent some more time looking into the Linux HID subsystem and I’m trying to figure out if I can make some simplifications here and there (again, topic for another time), so having a way to simulate an HID device with strange behaviour and see if my changes fix it or not would be extremely beneficial.

Speaking of HID, and report descriptors in particular, Alex Ionescu (of ReactOS fame) at REcon pointed out that there appear to be very few reported security issues with HID report descriptor parsing, in particular for Windows, which seems strange given how parsing those descriptor is very hard, and in particular there are very seriously broken descriptors out there. This would be another very interesting surface for a QEMU-based dancer software, to run through a number of broken HID report descriptor and send data to see how the system behaves. I would be very surprised if there is no bug in particular on the many small and random drives that apply workarounds such as the one I did for ELECOM.

Anyway, as I said I haven’t even had time to make the (probably minor) modifications to the framework to support BBB (which I already have access to), so you can imagine I’m not going to be working on this any time soon, but if you feel like working on some USB code, why not?

Update (2017-07-19): I pointed Travis at this post over Twitter, and he showed me vUSBf. While this does not have the same interface as facedancer, it proves that there is a chance to provide a virtual USB device implemented in Python.

As a follow up, Binyamin Sharet linked to Umap2 which supports fuzzing on top of facedancer, but does not support qemu as it is.

While it’s not quite (yet) what I had in mind, it proves that it is a feasible goal, and that there is already some code out there getting very close!