The complexity of request validation

You might have read before that I use a complicated setup with ModSecurity to prevent spam on this blog — I have written about it before extensively, and I also published my rules so that they can be used by other sites (Videolan’s forums are using them as well).

Well, maintaining this ruleset is not that easy, if at all; the problem comes when new browsers are introduced into the mix that makes validating their validity difficult. This is what happened a few months ago when Google first published Chrome for their ICS — which I still don’t have access to, I think I’ll get an HTC One X as soon as I get to California. Well, they did it again with the new Chrome for iOS.

There are three different identifications Chrome can come in as: Chrome, CrMo (for Android) and Crios (for the iOS devices). This simply meant that any special case put in place for Chrome on Android didn’t get auto-extended to the new Chrome on iOS — which is probably intended given that Chrome on iOS has to use the standard WebKit engine of Safari, rather than come up with its own — the only reason to use it is to have synchronised bookmarks with your computer.

Now, though, is when the problems start cropping up: the new Chrome on iOS also has the same problem as the one on ICS: it doesn’t send an Accept header, which is customary for almost every other browser, including the main desktop Chrome builds. So it was a matter of adding Crios to the list of special cases, together with CrMo.

But there is one more issue: there is one feature in the Chrome for iOS interface that allows you to go back to the so-called “desktop interface” — as long as the browser decided to have different interfaces depending on the User-Agent value. What you would expect at that point is for the application to report Chrome as user agent, but it’s not the case. What it reports is instead Safari. The problem is that it still implements some particularity that is generally limited to Chrome, including SDCH, which is something I used to validate before.

So what I ended up doing was removing the support for validation of browsers supporting sdch as an encoding — although I kept the validation that if it reports it’s Chrome, it has to have sdch (unless of course it’s passing through a Proxy). This still makes it possible to workaround most of the non-sophisticated crawlers/tools that try to pass as a browser.

Protecting yourself from R-U-Dead-Yet attacks on Apache

Do you remember the infamous “slowloris” attack over HTTP webservers? Well, it turns out there is a new variant of the same technique, that rather than making the server wait for headers to arrive, makes the server wait for POST data before processing; it’s difficult to explain exactly how that works, so I’ll leave it to the expert explanation from ModSecurity

Thankfully, since there was a lot of work done to cover up the slowloris attack, there are easy protections to be put in place, the first of which would be the use of mod_reqtimeout… unfortunately, it isn’t currently enabled by the Gentoo configuration of Apache – see bug #347227 – so the first step is to workaround this limitation. Until the Gentoo Apache team appears again, you can do so simply by making use of the per-package environment hack, sort of what I’ve described in my previous nasty trick spost a few months ago.

# to be created as /etc/portage/env/www-servers/apache

export EXTRA_ECONF="${EXTRA_ECONF} --enable-reqtimeout=static"

*Do note that here I’m building it statically; this is because I’d suggest everybody to build all the modules as static; the overhead of having them as plugins is usually quite higher than what you’d have for loading a module that you don’t care about.*

Now that you got this set up, you should ensure to set a timeout for the requests; the mod_reqtimeout documentation is quite brief, but shows a number of possible configurations. I’d say that in most cases, what you want is simply the one shown in the ModSecurity examples. Please note that they made a mistake there, it’s not RequestReadyTimeout but RequestReadTimeout.

Additionally, when using ModSecurity you can stop the attack on its track after a few requests timed out, by blacklisting the IP and dropping its connections, allowing slots to be freed for other requests to arrive; this can be easily configured through this snipped, taken directly from the above-linked post:

RequestReadTimeout body=30

SecRule RESPONSE_STATUS "@streq 408" "phase:5,t:none,nolog,pass, setvar:ip.slow_dos_counter=+1,expirevar:ip.slow_dos_counter=60"
SecRule IP:SLOW_DOS_COUNTER "@gt 5" "phase:1,t:none,log,drop, msg:'Client Connection Dropped due to high # of slow DoS alerts'"

This should let you cover yourself up quite nicely, at least if you’re using hardened, with grsecurity enforcing per-user limits. But if you’re using hosting where you don’t have decision over the kernel – as I do – there is one further problem: the init script for apache does not respect the system limits at all — see bug #347301 .

The problem here is that when Apache is started during the standard system init, there are no limits set for the session is running from, and since it doesn’t use start-stop-daemon to launch the apache process itself, no limits are applied at all. This results in a quite easy DoS over the whole host as it will easily exhaust the system’s memory.

As I posted on the bug, there is a quick and dirty way to fix the situation by editing the init script itself, and change the way Apache is started up:

# Replace the following:
        ${APACHE2} ${APACHE2_OPTS} -k start

# With this

        start-stop-daemon --start --pidfile "${PIDFILE}" ${APACHE2} -- ${APACHE2_OPTS} -k start

This way at least the system generic limits are applied properly. Though, please note that start-stop-daemon limitations will not allow you to set per-user limits this way.

On a different note, I’d like to spend a few words on telling why this particular vulnerability is interesting to me: this attack relies on long-winded POST requests that might have a very low bandwidth, because just a few bytes are sent before the timeout is hit… it is not unlike the RTSP-in-HTTP tunnelling that I have designed and documented in feng during the past years.

This also means that application-level firewalls will start sooner or later filtering these long-winded requests, and that will likely put the final nail on the coffin of the RTSP-in-HTTP tunnelling. I guess it’s definitely time for feng to move on and implement real HTTP-based pseudo-streaming instead.

So You Think You Can Crawl

I have written about crawlers before ranting about the bad behavior of some crawlers, especially those of “marketing sites” that try to find out information about your site to resell to other companies (usually under the idea that they’ll find out who’s talking about their products).

Now, I don’t have a website that has so much users that it can be taken down by crawlers, but I just don’t like waiting time, disk space and bandwidth for software that makes neither me nor anybody else any good.

I don’t usually have trouble with newly-created crawlers and search engines, but I do have problems when their crawlers are just hitting my websites without following at least some decency rules:

  • give me a way to find out who the heck you are, like a link to a description of the bot — *in English, please*, like it or not it is the international language;
  • let me know what your crawler is showing; give me a sample search or something, even if your business is reselling services it shouldn’t be impaired if you just let everybody run the same search;
  • if your description explicitly states that robots.txt is supported, make sure you’re actually fetching it; I had one crawlers trying to fetch each and every article of my blog the other day, without having ever fetched robots.txt, and with the crawler’s website stating that it supported it;
  • support deflate compression! XHTML is an inherently redundant language, and deflate compression does miracles to that; even better on pages that contain RDF information (as you’re likely to repeat the content of other tags, in semantic context); the crawler above stated to be dedicated to fetch RDF information and yet didn’t support deflate;
  • don’t be a sore loser: if you state (again, that’s the case for the crawler above) that you always wait at least two seconds between requests, don’t start fetching without any delay at all when I start rejecting your requests with 403;
  • provide a support contact, for I might be interested in allowing your crawler, but want it to behave first;
  • support proper caching; too many feed-fetchers seem to ignore Etag and If-Modified-Since headers, which gets pretty nasty especially if you have a full-content feed; even worse if you support neither these nor deflate, your software is likely getting blacklisted;
  • make yourself verifiable via Forward-confirmed reverse DNS (FCrDNS); as I said in another post of mine most search engine crawlers already follow this idea, it’s easy to implement with ModSecurity, and it’s something that even Google suggests webmasters to do; now, a few people misunderstand this as a security protection for the websites; it couldn’t be farther from the true real reason: by making your crawler verifiable, you won’t risk to get hit by sanctions over crawlers trying to pass for you; this is almost impossible if your crawler uses EC2 of course.

Maybe rather than SEOs – by the way, it is just me who dislike this term and finds most people self-describing that way to just try be on the same level of CEOs – we should have Crawlers Experts running around to fix all the crappy crawlers that people write for their own startup.

Apple’s HTTP tunnel, and new HTTP streaming

Finally, last night, I’ve been able to finish, at least in a side-branch, to support Apple’s RTSP-in-HTTP tunnelling support, as dictated by their specifications. Now that the implementation is complete (and really didn’t take that much work to support once the parser worked as needed), I can tell a few things about that specification and about Apple phasing it out in favour of a different, HTTP-only streaming system.

First of all the idea of supporting both the RTSP and the RTSP-in-HTTP protocol, while working with the same exact streaming logic behind the scenes, requires a much more flexible parser, which isn’t as easy because of the HTTP design which I already discussed. While of course, once the work is done, it’s done, the complexity of such a parser isn’t ignorable.

But, since the work was done in quite a short time for me, it’s really not that bad, if the technique worked as good as it’s supposed to. Unfortunately, that’s not the case. For instance, the default configuration of net-proxy/polipo (a French HTTP proxy), does not allow for the technique to work, because of the way this is designed to work: pipelining and re-use of the connection, which are very common things to do with proxies to try improving performance, usually wait for the server to complete a request before they are returned to the client; unfortunately the GET request that is made by the client is one that will never complete, as it is where the actual streaming will happen.

At the end, for testing, I found it definitely easier to use the good old squid for testing purposes, even though the documentation at one (very hidden) point explains which parameters to set to make it work with QuickTime. But it definitely mean that not all HTTP proxy will let this technique work correctly.

And it’s definitely not the only reason. Since the HTTP and RTSP protocols are pretty similar, even the documentation says that if it POSTed the RTSP requests directly, it would have been seen as a bad HTTP requet by the proxy; to avoid that the requests are sent base64-encoded (which means, bigger than the original). But while the data coming from the client is usually scrutinised more, proxies nowadays probably scrutinise the responses as well as the requests, to make sure that they are not dealing with a malicious server (phising or stuff like that); and if they do, they are very likely to find the response coming from the GET request quite suspicious, likely considering it a tentative to HTTP response splitting (which is a common webapp vulnerability).

Now, of course it would have been possible for Apple to simply upgrade the trick by encoding the response as well as the request, but that has one huge drawback: it would both increase the latency of the stream (because the base64 content would have to be decoded before it’s used) and at the same time it would increase the size of the response, by ⅓, one third, due to that kind of encoding). Another alternative would have been to simply encode with base64 the pure RTSP responses, and keep unencoded the RTP streams (which are veicolated over interleaved RTSP). Unfortunately this would have required more work, since at that point, the GET body wouldn’t be simply be stream-compatible with a pure RTSP stream , and thus wouldn’t be very transparent for either the client nor the server.

On the other hand, the idea of implementing that as an extension hasn’t entirely disappeared in my mind; since the channels one and following are used by the RTP streams, the channel code zero is still unused, and would make it possible to simply use that to send the RTSP response encoded in base64. At least in feng this wouldn’t require huge changes to the code, since we already consider a special channel zero for the SCTP connection.

With all these details considered, I can understand why Apple was looking into alternatives. What I cannot understand is, still, what they decided to use as alternative, since the new HTTP Live Streaming protocol still looks tremendously hacky to me. Hopefully, our next step is rather going to be Adobe’s take at a streaming protocol .

Please use HTTP features

I know I’m pretty needy on this matters, but there is one thing that drives me crazy about software using HTTP and that is the non-use of the HTTP features that have been created to save bandwidth. I hate that kind of software because from one side, I don’t always connect via flatrate (my phone provider has a semi-flatrate by the traffic) and from the other, I know that servers don’t always have free bandwidth.

So when I find a free software project designed to warn you about changes in web pages that does not seem to know about If-Modified-Since and If-None-Match, I tend to discard it; I especially am worried if the software does not warn you against setting the polling every minute.

But it doesn’t stop at free software; in the last week I’ve noticed an extensive increase in the traffic generated from my blog; rather than the usual 200300 MB a day, it started generating 500/600MB a day, constantly, without any new referrer that might explain the increase. After looking at the statistics for a little longer I noticed that someone from Germany were making over 1000 requests a day.. quick check on the logs and it turned out that the user was requesting my main article feed once per minute, via his feed reader.

Now, already checking once per minute is a bit too much; most planets and other feed readers check at most once per hour; I usually try hard not to post more than twice per day. But the problem there is that the feed reader software (FeedReader by NewsBrain) is actually braindamanged. It does not use the HTTP headers to only request if the feed has changed, so it kept requesting the same content over and over and over and over. Given that it seems to be a commercial proprietary software, and it doesn’t seem to have a clue about the protocol it’s designed to use, that feed reader is now blacklisted in my server and will not work with the websites hosted there.

So please, if you develop a software that make use of the HTTP protocol, learn to use its features!

Never a panacea: ModSecurity drawbacks

This article was originally published on the Axant Technical Blog.

I’ve written before of using ModSecurity for reducing bots traffic, especially for those bots that are not important to the success of a site, like almost all of the so-called “marketing bots”. Unfortunately, installing and setting up ModSecurity with the defaults parameter can cause quite a bit of headaches, especially for technically-oriented applications.

There are indeed quite a few different drawbacks to the use of that module, in particular related to the Core Rule Set that ships with it; some of the rules are quite taxing to the web server (since it has to parse eventually a lot of data), and others are simply hitting false positives quite easily.

For instance, the rule with id 960017 (Host header is a numeric IP address) while very valid usually breaks for the Nagios HTTP check, while the very draconian 950005 will stop any application to receive posts that talks about most Unix paths, including /etc. Luckily enough, mod_security does have means to handle whitelisting from rules with multiple methods: you can use rules that hit on user-agent (bad idea for whitelisting) or source IP (better), or you can use Apache environments.

For instance, my blog has the following entry in its vhost definition to apply both my antispam rules and exclude the draconian rule:

Include /etc/apache2/vhosts.d/modsec_antispam.include<br />
SecRuleRemoveById 950005

(Yes there are a few things that still needs to be cleared up, especially regarding the trackbacks that should probably have different antispam rules from the comments; in particular, trackbacks probably shouldn’t arrive from browsers at all).

So unfortunately, before modsec can be set up as standard piece of software for Apache servers, time has to pass…

CrawlBot Wars

This article was originally published on the Axant Technical Blog.

Everybody who ever wanted to write a “successful website” (or more recently, thanks to the Web 2.0 hype, a “successful blog”) knows the bless and curse of crawlers, or bots, that are unleashed by all kind of entities to scan the web, and report the content back to their owners.

Most of these crawlers are handled by search engines, such as Google, Microsoft Live Search, Yahoo! and so on. With the widespread use of feeds, at least Google and Yahoo! added to their standard crawler bots also feed-specific crawlers that are used to aggregate blogs and other feeds into nice interfaces for their users (think Google Reader). Together with this kind of crawlers, though, there are less useful, sometimes nastier crawlers that either don’t respond to search engines, or respond to search engines whose ethical involvement makes somewhat wonder.

Good or bad, at the end of the day you might not want some bots to crawl your site; some Free Software -bigots- activists some time ago wanted, for instance, to exclude the Microsoft bot from their sites (while I have some other ideas), but there are certain bots that are even more useful to block, like the so-called “marketing bots”.

You might like Web 2.0 or you might not, but certainly lots of people found the new paradigm of Web as a gold mind to make more money out of content others have written – incidentally these are not, like RIAA, MPAA and SIAE insist, the “pirates” that copy music and movies, but rather companies whose objective is to provide other companies with marketing research and data based on content of blogs and similar services. While some people might be interested in getting their blog scanned by these crawlers either way, I’d guess that for most users who host their own blog this is just a waste of bandwidth: the crawlers tend to be quite pernicious since they don’t use If-Modified-Since or Etag headers in their request, and even when they do, they tend to make quite a few requests on the feeds per hour (compare this with Google’s Feedfetcher bot that requires at most one copy of the same feed per hour – well, if it isn’t confused by multiple compatibility redirects like it unfortunately is with my main blog).

While there is a voluntary exclusion protocol (represented by the omni-present robots.txt file), only actually “good” robots do consider that, while evil or rogue robots can simply ignore it. Also, it might be counter-productive to block rogue robots even when they do look at it. Say that a rogue robot wants your data, and to pass as a good one is advertising itself in the User-Agent string, complete with a link to a page explaining what it’s supposedly be doing, and accepting the exclusion. If you exclude it in robots.txt you can give it enough information to choose a different User-Agent string that is not listed in the exclusion protocol.

One way to deal with the problem is by blocking the requests at the source, answering straight away with an HTTP 403 (Access Denied) on the web server when making a request. When using the Apache web server, the easiest way to do this is by using modsecurity and a blacklist rule for rogue robots, similar to the antispam system I’ve been using for a few months already. The one problem I see with this is that Apache’s mod_rewrite seem to be executed before mod_security, which means that for any request that is rewritten by compatibility rules (moved, renamed, …) there is first a 301 response and just after that an actual 403.

I’m currently working on compiling such a blacklist by analysing the logs of my server, the main problem is deciding which crawlers to block and which to keep. When the description page explicitly states they are marketing research, blocking them is quite straightforward; when they seem to provide an actual search service, that’s more shady, and it turns down to checking the behaviour of the bot itself on the site. And then there are the vulnerability scanners.

Still, it doesn’t stop here: given that in the Google description of GoogleBot they provide a (quite longish to be honest) method to verify that a bot is actually GoogleBot as it advertises itself to be, one has to assume that there are rogue bots out there trying to pass for GoogleBot or other good and lecit bot. This is very likely the case because some website that are usually visible only by registered users make an exception for search engine crawlers to access and index their content.

Especially malware, looking for backdoors into a web application, is likely to forge the User-Agent of a known good search engine bot (that is likely not blocked by the robots.txt exclusion list), so that it doesn’t fire up any alarm in the logs. So finding “fake” search engine bots is likely to be an important step in securing a webserver running webapplications, may them be trusted or not.

As far as I know there is currently no way in Apache to check that a request actually does come from the bot it’s declared to come from. The nslookup method that Google suggests works fine for a forensic analysis but it’s almost impossible to perform properly with Apache itself, and not even modsecurity, by itself, can do much about that. On the other hand, there is one thing in the recent 2.5 versions of modsecurity that can be probably used to implement an actually working check: the LUA scripts loading. Which is what I’m going to work on as soon as I find some extra free time.

Feed readers and HTTP features

Please note: this post was written quite some time ago, before the Typo upgrade, among other things, and while I’m going to re-read and fix it up I might have something out of sync, sorry.

Following some of the older changes to the feeds, removing RSS feeds and replacing all of them with Atom feeds, I started looking at the behaviour of news reader to make sure they do work as expected. This made me notice quite a few behaviours that I really wonder about.

First of all, most newsreaders seems to properly implement the HTTP/1.1 protocol rules that allow for 304 responses (Not Modified) to avoid re-fetching the whole feed if there has been no changes, this is very good because it saves bandwidth on both sides, on the other hand, none seems to record the 301 (Moved Permanently) replies, which causes the server to receive requests on the old URLs after a move still (and since I had a migration from an old Typo to a new one, I have lots of rewriting of URLs). Crawlers and aggregators like Google’s or Yahoo’s also fail at recording that.

While 302 is a temporary move that should not recorded, one could argue that a permanent move would be saved, at least in an application that has a collection of URLs like a feed reader. Now of course it’s also true that if you could hijack the DNS of a domain, and send a moved permanently to a different server, it would be quite nasty, but is something I think that should be looked into.

But one thing that I find disturbing is that there are some feed readers that don’t implement HTTP/1.1 features, like for instance newsbeuter (which actually is caused by the librss library they are using, I already asked about this to the author) that instead of using the If-Modified-Since or If-None-Match headers, runs a HEAD request for the feed repeatedly, and a GET when something has changed indeed. It’s not like I have a problem with that, since anyway a HEAD request is still better than having a GET repeated over and over and over. Which is what some service seems to be doing. Especially some “enterprise” services that seems to re-sell search services on a per-keyword basis.

In general, I’m now considering finding a way to check whether I can identify the “rogue” agents who request the feeds without conditional gets, and see if I can contact their technical support to get the thing fixed, but sure it’s tremendous to see that nowadays there are still people writing “enterprise” crawlers who don’t know HTTP/1.1 provides feature to avoid wasting others’ bandwidth! If you’re using some free feed reader and you don’t know how it behave, you can try to check with wireshark which kind of requests it does, and in case you might want to tell upstream about these features.

Remember that it doesn’t just save my bandwidth, it also saves yours, and the whole Internet’s. It’s also why the feeds are much more useful than webpages when you just want to read an article, if it’s in the feed that is. And don’t think it’s very small, my articles feed always slightly under 200KB of data, in Atom.

Why RTSP?

Lately I’ve been writing more often about my work on feng and the lscube project ; the idea behind lscube is to get a well-working and well-scaling entirely free streaming software stack, both server-side (feng) and client-side (libnemesi), with the ability to stream live content (with flux). The protocol used by this stack is the Real Time Streaming Protocol, currently version 1, as designed by the RFC2326. RTSP, originated from RealNework, is just the control protocol, and uses out of band connections for sending the data, using, in our case, the RTP protocol (or it can use multiplexed connection, like interleaved RTSP or SCTP). The whole protocol description is quite tedious and is not what I’m interested in writing about right now.

What I think might be worth explaining is why we still care about RTSP, given that the reality of audio/video streaming lately seems to focus on the much more generic HTTP protocol (calling it Hyper Text Transfer Protocol was probably underestimating its actual use, I guess). Indeed, even cherokee implemented support for a/v streaming and while Alvaro shows how this can be used to implement the <video> tag, we also know that the video tag is not going to be the future at least not for the “big” stremaing sites. Indeed most streaming sites will try their best to avoid external players to access their content. But RTSP is, after all, being implemented by a very wide range of companies, including two open-source server software projects (Helix DNA Server by RealNetworks and Darwin Streaming Server by Apple), and both Apple and Microsoft with their own multimedia stacks.

The idea is not to use RTSP for small video streaming, which can, after all, very well be cached, but rather to have longer-playing content be streamed, with a few different advantages over the HTTP method. First of all, HTTP isn’t really practical for live streaming, either unicast or multicast doesn’t really matter here, it’s much easier to do with RTSP than HTTP. Also, RTSP allows for precise seeking and pausing, which HTTP does not allow, without doing lots of tricks and hacks at least. And then there is multicast, the magic keyword that I’ve heard spoken many times since I had my first internet connection. Indeed, my ISP used to have some experimental multicast-based stream for 56k dial-up and 256kbit ADSL; nowadays they don’t provide that feature any longer (I know nobody who was ever able to get it to work anyway) but I guess they did use the data they researched at the time to implement their IPTV system on ADSL2.

I have one situation clear in mind where mutlicast streaming, together with precise seeking, would be very helpful. At my high school we had a “English multimedia laboratory”, which basically was a classroom with fourteen crusty PCs wired up together; half the time we would use them as normal computers to browse the net for whatever reason (usually because neither the teacher nor the lab assistant wanted to do anything during the day), the other half they would be switched to just repeat the same video signal to all the monitors (which were, obviously for the time, CRTs; on the other hand the way the stuff was wired up, in either mode the monitors further at the back had very bad signal). Doing the same thing, all digital, with a simple Gigabit Ethernet connection, would probably have given quite better results (on the other hand, one could argue that having a single big TV would have saved the hassle).

Now, while all these things can probably be forced down HTTP’s throat with webapps, services, protocol extensions and whatever, having a dedicated protocol, like RTSP, that handles them, is probably quite an improvement; I’m certainly looking forwards for the day when my set top box in my bedroom (the AppleTV right now), would be able to stream my anime down the Ethernet connection with RTSP, so that I could seek without having to wait for the buffer to catch up, and easily skip to the middle of an episode without having to wait for all of it to be downloaded.

So basically, yeah RTSP is a bit more niche than HTTP right now but I don’t see it as dead yet at all; it’s actually technologically pretty cool, just underutilized.

Code reuse and RFC 822 message parsing

If you’re just an user with no knowledge of network protocols you might not think there is any difference between an email, a file downloaded through the web, or a video streamed from a cerntral site. If you have some basic knowledge, you might expect the three to instead have little in common, since they come in three different protocols, IMAP (for most modern email systems, that is), HTTP and (for the sake of what I’m going to say), RTSP. In truth, the three of them have quit a bit in common, represented by RFC 822. A single point of contact between this, and many other, technologies.

The RTSP protocol (commonly used by both Real Networks and Apple, beside being a quite nice open protocol) uses a request/response system based on the HTTP protocol, so the similarity between the two is obvious. And both requests and responses of HTTP and RTSP are almost completely valid messages for the RFC822 specifications; the same used for email messages.

This is osmething htat is indeed very nice because it means that the same code that can be used to parse email messages can be used to parse requests and responses for those two protocols. Unfortunately, it’s easier said than done. Since I’ve been working on feng, I’ve been trying to reduce the amount of specific code that we ship, trying to re-use as much generic code as possible, which is what brought us to use ragel for parsing, and glib for most of the utility functions.

For these reason, I also considered using the gmime library to handle the in-memory representation of the messages, as well as possibly the whole parsing futher on. Unfortunately, when trying to implement it I noticed that in quite a few places I would end up doing more work than needed, duplicating parts of the strings, and freeing them right away, with the gmime library doing the final duplication to save it in the hash table (because both my original parser and gmime end up with a GHashTable object).

For desktop applications, this overhead is not really important, but it really is for a server project like feng, since not only it adds an overhead that can be considerable for the target of hundreds of requests a second that the project aims towards, but also adds one more failure point where the code can abort for out of memory. Unfortunately, Jeffrey Stedfast, the gmime maintainer, is more concerned with the cleanness of the API, and its use on the desktop, than of its micro-optimisation; I understand his point, and I thus think it might be a better choice for me to write my own parser to do what I need.

Since the parser can be a component on its own self that can be reused, I’m also going to make sure that it can sustain a high load of messages to parse. Unfortunately, I have no idea how to properly benchmark the code; I’d sincerely like to compare, after at least a draft work, the performance of gmime’s parser against mine, both in term of memory usage and speed. For the former I would have used the old massif tool from valgrind, but I can’t get myself to work with the new one. And I have no idea how to benchmark the speed of the code. If somebody does know how i could do that, I’d be glad.

Basically, my idea is to make sure that the parser works in two modes, a debug/catchall mode where the full headers are parsed and copied over, and another one where the headers are parsed, but are saved only when they are accepted by a provided function. I haven’t yet put to test my idea, but I guess that the hard work would be done more by the storage than the actual parser, especially considering that the parser is implemented by the ragel state machine generator, which is quite fast by itself. And if not for the speed of the parser itself, it would certainly reduce the amount of memory used, especially during parsing of eventual crafted messages.

Hopefully, given enough time and effort, it might produce a library that can be used as a basis for parsing and generating requests and responses for both RTSP and HTTP, as well as parsing e-mail messages, and other RFC 822 applications (I think, but I’m not sure, that the MSN messenger protocol uses something like that too; I do know that git uses it too though).

Who knows, maybe I’ll resume gitarella next, and write it using ruby-liberis, if that’s going to prove faster than the current alternatives. I sincerely hope so.