Parsers, tokenizers and state machines

You might remember that I have been having fun with Ragel in the paste, since Luca proposed using it for LScube projects (which reduced to using it on feng, for what I am concerned). The whole idea sounded overly complicated in the first place, but then we were able to get most, if still not all, kinks nailed.

Right now, feng uses Ragel to parse the combined RTSP/HTTP request line (with a homebrew backtracking parser, which became the most difficult feature to implement), to split the headers into a manageable hash table (converting them from string into enumerated constants at the same time, another kinky task), to convert both the Range and Transport headers for RTSP into lists of structures that can be handled in the code directly, and for splitting and validating URLs provided by the clients.

While the Ranges, Transports and URLs parsing is done directly with Ragel, and without an excessive amount of blunt force, both the parsing of the request line and the tokenizing of header names is “helped” by using a huge, complex and absolutely ludicrous (if I may say so) Rube Goldberg machine using XML and XSLT to produce Ragel files, which then produce C code, that is finally compiled. You might note here that we have at least three sources and four formats: XML → Ragel → C → relocatable object.

The one thing that we aren’t doing with Ragel right now is the configuration file parser. Originally, Luca intended the configuration file for feng to be compatible with the lighttpd one, in particular allowing to include bits and pieces of the two. As it happens, feng really needs much less configuration, right now, than lighttpd ever did, and the only conditional we support is based on the $SOCKET variable to produce something akin to vhost support. The lighttpd compatibility here is taking a huge toll in the size of code needed to actually parse the file, as well as the waste of time and space to parse it in a tree of tokens and then run through them to set the parameters. And this is without launching into the problems caused by actually wanting to use the “lemon” tool.

Unfortunately, neither replacing the parser (for the same format) with Ragel, nor inventing a new format (keeping the same features) to parse with Ragel is looking interesting. The problem is that Ragel provides a very low-level interface for the parsers, and does not suit well the kind of parsing a configuration file needs. There is an higher-level parser based upon Ragel, by the same author, called Kelbt, but it’s barely documented, and seems only to produce C++ code (which you should probably know already how I dislike).

I definitely don’t intend to write my own Ragel-based parser generator like I did for RTSP (and HTTP); I also don’t want to lose support for “variables” (since many paths are created by appending root paths with sub-directories), so I don’t think switching to INI itself would be a good idea. I considered using the C pre-processor to convert an user-editable key-value pair file into a much simpler one (without conditionals, vhosts and variables) to then parse it, but it’s still a nasty approach, as it requires a further pass after editing and depends on the presence of a C preprocessor anyway.

I also don’t intend using XML for the configuration file! While I have used XML in the above-mentioned Goldberg machine, it smells funny to me as well as anyone else. Even though I find those people judging anything using XML as “crap” delusional, I don’t think XML is designed for user-editable configuration files (and I seriously abhor non-user-editable configuration files, such as those provided by Cherokee, to name one).

Mauro, some time ago, suggested me to look into Antlr3, unfortunately at the time I had to stop because of the lack of documentation on how to do stuff with it. I had already enough trouble to get proper information out of the Ragel documentation (which looks like the thesis document it is). Looking up Wikipedia earlier to look for a couple of links for documentation, it seems like the only source of good documentation is a book — since my time to read another technical book right now is a bit limited, and spending €20 just to get the chance to decide whether I want Antlr or not, doesn’t look appealing either.

Anyway, setting aside the Antlr option for a moment, Luca prodded me about one feature I overlooked in Ragel, maybe because it was introduced later than I started working on implementing it: Scanners. The idea of a scanner is that it reduces the work needed to properly assign the priorities to the transitions so that the longest-matching pattern executes the final code. This would be nice, wouldn’t it? Maybe it would even allow us to drop the Goldberg machine, since the original reason to have it was, indeed, the need to get those priorities right without having to change all the code repeatedly until it works properly.

Unfortunately, to use Ragel scanners, I had to change first of all the Goldberg machine, and splitting it further into another file, since the scanners can only be called into, rather than included in the state machine (and having the machine defined in the same file seem to trigger the need to declare all the possible variables at the same time, which is definitely not something I want to do for all state machine). This was bad enough (making the sources more disorganises, and the build system more complex), but after I was able to build the final binary, I also found an increment of over 3KB of executable code, without optimisations turned on (and using state tables, rather than goto statements, an increase in code size is a very bad sign).

Now it is very well possible that the increase in code size gets us a speed-up in parsing (I sincerely doubt so, though), and maybe using alternative state machine types (such as the above-mentioned goto based one) will provide better results still. Unfortunately, assessing this gets tricky. I already tried writing some sort of benchmark of possible tokenizers, mostly because I fear that gperf, lying on its name, is far from perfect and is producing suboptimal code for too many cases (and gperf is used by many many projects, even at a basic system level. The problem with that is that I suck at both benchmarking and profiling, so it started.

Anyway, comments open, if you have any ideas on how I can produce the benchmark, or suggestions about the configuration file parsing and so on, please let me know. Oh and actually there are two further custom file formats (domain-specific languages at all effects) that are currently parsing in the most suboptimal of ways (strcmp()), and we would need at least one more. If we could get one high-level, but still efficient (very efficient if possible) parser generator, we would definitely gain a lot in tem of managing sources.

4 thoughts on “Parsers, tokenizers and state machines

  1. Ragel is really well suited to tokenizing though, and you already use it. I would just bear it and go along with that. Feeding Ragel into a proper parser like Bison is not hard either, but I assume that’s a step you’re trying to avoid.However I do think that rolling your own config DSL is a bit cumbersome. If you’re willing to add dependencies, how about something like yaml or json? Lua, even? That opens the door to scriptable behavior, which would be a killer feature. I can see myself using that. If you don’t add it, I just might…

    Like

  2. We are already using json-c so we might switch to pure json configuration files, that might have the added bonus of providing an simpler way to produce them from external apps (e.g. Theater).

    Like

  3. Making my own toy language, I experimented with Ragel, re2c, ANTLR, and writing my own lexer by hand in C (which I found to be more trouble than it was worth compared to the others). In my experience, nothing beat Ragel and re2c; both were easy to use and produced code with slightly better performance to my hand-written lexer. You were right in staying away from ANTLR: Besides the fact that ANTLR performed much worse, it was also a Byzantine, ill-documented mess to work with, as you found out yourself. It is meant for Java, and the C version basically makes you write Java in C.IMHO, re2c is easier to use than Ragel for a small language, but Ragel gives you more power if you need it. Ragel’s other advantage is that it is more actively developed, but then re2c is mature enough that it doesn’t really need too many tweaks anymore. That’s my two copper pieces on lexers; make of it what you will.(FWIW, for parsers, I found Bison to perform much better than Lemon, with both being equivalently easy to use. However, in both it seemed much harder to generate reasonable error messages than in a simple hand-written recursive descent parser.)

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s