If you’re just an user with no knowledge of network protocols you might not think there is any difference between an email, a file downloaded through the web, or a video streamed from a cerntral site. If you have some basic knowledge, you might expect the three to instead have little in common, since they come in three different protocols, IMAP (for most modern email systems, that is), HTTP and (for the sake of what I’m going to say), RTSP. In truth, the three of them have quit a bit in common, represented by RFC 822. A single point of contact between this, and many other, technologies.
The RTSP protocol (commonly used by both Real Networks and Apple, beside being a quite nice open protocol) uses a request/response system based on the HTTP protocol, so the similarity between the two is obvious. And both requests and responses of HTTP and RTSP are almost completely valid messages for the RFC822 specifications; the same used for email messages.
This is osmething htat is indeed very nice because it means that the same code that can be used to parse email messages can be used to parse requests and responses for those two protocols. Unfortunately, it’s easier said than done. Since I’ve been working on feng, I’ve been trying to reduce the amount of specific code that we ship, trying to re-use as much generic code as possible, which is what brought us to use ragel for parsing, and glib for most of the utility functions.
For these reason, I also considered using the gmime library to handle the in-memory representation of the messages, as well as possibly the whole parsing futher on. Unfortunately, when trying to implement it I noticed that in quite a few places I would end up doing more work than needed, duplicating parts of the strings, and freeing them right away, with the gmime library doing the final duplication to save it in the hash table (because both my original parser and gmime end up with a GHashTable object).
For desktop applications, this overhead is not really important, but it really is for a server project like feng, since not only it adds an overhead that can be considerable for the target of hundreds of requests a second that the project aims towards, but also adds one more failure point where the code can abort for out of memory. Unfortunately, Jeffrey Stedfast, the gmime maintainer, is more concerned with the cleanness of the API, and its use on the desktop, than of its micro-optimisation; I understand his point, and I thus think it might be a better choice for me to write my own parser to do what I need.
Since the parser can be a component on its own self that can be reused, I’m also going to make sure that it can sustain a high load of messages to parse. Unfortunately, I have no idea how to properly benchmark the code; I’d sincerely like to compare, after at least a draft work, the performance of gmime’s parser against mine, both in term of memory usage and speed. For the former I would have used the old massif tool from valgrind, but I can’t get myself to work with the new one. And I have no idea how to benchmark the speed of the code. If somebody does know how i could do that, I’d be glad.
Basically, my idea is to make sure that the parser works in two modes, a debug/catchall mode where the full headers are parsed and copied over, and another one where the headers are parsed, but are saved only when they are accepted by a provided function. I haven’t yet put to test my idea, but I guess that the hard work would be done more by the storage than the actual parser, especially considering that the parser is implemented by the ragel state machine generator, which is quite fast by itself. And if not for the speed of the parser itself, it would certainly reduce the amount of memory used, especially during parsing of eventual crafted messages.
Hopefully, given enough time and effort, it might produce a library that can be used as a basis for parsing and generating requests and responses for both RTSP and HTTP, as well as parsing e-mail messages, and other RFC 822 applications (I think, but I’m not sure, that the MSN messenger protocol uses something like that too; I do know that git uses it too though).
Who knows, maybe I’ll resume gitarella next, and write it using ruby-liberis, if that’s going to prove faster than the current alternatives. I sincerely hope so.
What about the new Massif is giving you trouble? Running it isn’t much different to the old one, except that you have to run the ‘ms_print’ script on the output file.
The new massif is not showing the actual memory usage over time, is it? And last I checked, it does not show an actual graphical graph at all.
I am considering GMime or Ragel to implement a Ruby library for fast parsing of RFC2822/MIME encoded mails. Are there any starting points I could (re-)use?