Runtime vs buildtime

No this time I’m not talking about failures but of parsing.

For feng I’m now working on replacing the main configuration parser with something neater and possibly simpler. Unfortunately, as it happens, most of the parsers out there are actually based on the concept of flex and yacc of splitting the lexing and parsing into two different functionalities. This is all fine and dandy when you have extensible syntax, and all the frills, but I find it a bit cumbersome, as we split the work both at runtime and build time.

This is why for the smaller stuff we’ve been using Ragel: we can embed lexing and parsing in a single action, which is very fast (and thus very good to parse the data that arrives from the client, which we have to answer to right away); unfortunately while generally flexible, me and Luca found that Ragel has its limits — one of these is with recursive definitions, which luckily we don’t need in the context of feng, but which I tried to use for Ruby-Elf to demangle C++ names.

On other hand I found before that Ragel tend to be quite flexible when used as an intermediate language; in feng, to properly parse request lines (and in particular the method name) I use a (bloated, admittedly) Goldberg Machine made of XML, XSLT and Ragel, that translate a simple list of protocols and correlated methods to C code that parse the request lines. This happens because Ragel has no backtracking, and I’d have to parse the same line twice to let it backtrack. Alternatives would be validating the methods against the protocol, but that’s also a difficult thing to do… so for now this will do.

But already when I wanted to replace the custom “sd” format, which was implemented as a series of fgets() and g_ascii_strcasecmp() (which it’s a definite waste because it was lowering the text on both sides every time!) over a loop, Ragel didn’t seem like that much of a good choice. At the end I went with simply using the already-present glib parser for INI-like files, but that didn’t reduce the code as much as I wished, because I still had to look up the configuration lines with strings, from the loaded file structure.

So from one side the GLIB parser is neat enough because it’s extensible: anything that is not known for, but has a valid syntax, is kept in memory so that extended files can be used by older versions of feng without problems; on the other hand, this also means that if the file is huge, nonsense is also loaded into memory. While it does strip comments, it doesn’t strip unknown elements because it doesn’t know which elements are valid.

On the other side, argument parsing is usually doing the other way around: rather than letting the parser understand everything and then ask it whether there are the parameters you’re looking for, you write in the code a table describing the accepted parameters and then the parser compares each option against that table; depending on the format/library/concept used it either sets variables around, or call functions registered as callbacks to tell you that the option was encountered. Unrecognised options can either be ignored or will cause an error.

Special format parsers, such as most JSON parsers, seem to do something similar: they let you register callbacks that are called when specific tokens are detected… the result sincerely is often enough clumsy, to me, and the compiler has no way to understand well what you’re doing. Others such as json-c prefer to load everything into memory, and then you’re up on your own to parse it; if it’s structured data… good luck.

The lemon parser generator that lighttpd (and thus feng, for good or worse) used seems to have produced a combined approach: on one side the configuration file is parsed and loaded into memory and on the other side two long tables are used to scan through the parsed results; this N-pass method, though, is complex though, and still does a lot more work than it should because I cannot report an error to the user on a broken constrain with a line number of where the error was in the first place.

So I’m now considering writing a Ragel parser generator: you describe in some verbose way the format of the configuration file, define actions of what to do when a described configuration variable is found, add support for producing an error to the user if there is a mistake in the configuration variable… this kind of things. What I’m not sure of is the format.

Sincerely, out of experience, the configuration file format that makes more sense for feng is the one used by ISC for DHCP and Bind: it provides for top-level global parameters, then named and unnamed sections. Luca’s original idea was that of using the same syntax as lighttpd to be able to reuse the files and share part of it, but we never came to have as much features as that, and it actually is showing itself to be troublesome; the fact that you use comparison to define vhosts and sockets makes it very difficult to deal with.

Even more so, experience shows me that we need to keep separated the listening interfaces (and their parameters) from the virtual hosts; it gets easier to do since at least RTSP mandates providing the full URL on requests rather than just the local path. What I’m aiming at this point would be something along these lines:

log level debug; # or 5
log syslog;

user feng;
group feng;

socket {
    port 8554;
}

socket {
    port 554;
    family all; # or ipv4 or ipv6
    sctp on;
    sctp streams 16;
}

vhost myhost.tld {
    root /var/lib/feng/myhost;
    log access syslog;
}

vhost yourhost.tld {
    root /var/lib/feng/yourhost;
    log access /var/log/feng/yourlost.tld;
}
Exit mobile version