This Time Self-Hosted
dark mode light mode Search

Tokenizing in Ragel

For feng I’m currently working toward getting the parser(s) rewritten using Ragel the state machine generator. Using Ragel for the parsers usually means having a much faster conversion from the RFC822-based RTSP protocol to something that the program can access and understand directly.

One of the interesting things that I’ve been working on, though, is proper tokenisation of the protocol, for instance converting an header name into an enumerated type. The reason why that’s a problem is that it’s not really straightforward to produce, with Ragel, a state machine that can parse a string into an enumerated code, and still discern between invalid and unsupported values; and this in particular is important for us because we have to either ignore or give a 400-response (bad request) depending on that.

The first obvious way to define a token is this one:

MyToken = "FOO" % { token  = MyTokenFoo; } |
    "BAR" % { token = MyTokenBar; } |
    [A-Z]{3};

This works, but does not distinguish extremely well between invalid and unsupported; yes of course you can check in the code whether the parser reached the first final, but it can be made, in my opinion, more solid by setting the validity at the end of the last transition as well; unfortunately code like the following:

MyToken = "FOO" % { token  = MyTokenFoo; } |
    "BAR" % { token = MyTokenBar; } |
    [A-Z]{3} % { token = MyTokenUnsupported };

Will always hit the unsupported case, because the general rule is a superset of the rest of the rules. To solve this problem, you should use priorities, which are documented in the Ragel documentation, but not really explained. In this case, we must give the priority to entering in a given state so that the known token beat the general token; one way to do that is using this:

MyToken = "FOO" > 2 % { token  = MyTokenFoo; } |
    "BAR" > 2 % { token = MyTokenBar; } |
    [A-Z]{3} > 1 % { token = MyTokenUnsupported };

Higher priority means that the state will be considered before the others and the first one wins. Unfortunately this can mess up when you have multiple state machines, since you risk conflicts between priorities. To solve this, you can use named priorities:

MyToken = "FOO" > (mytoken, 2) % { token  = MyTokenFoo; } |
    "BAR" > (mytoken, 2) % { token = MyTokenBar; } |
    [A-Z]{3} > (mytoken, 1) % { token = MyTokenUnsupported };

At that point the states can have priority one over the other only inside the same namespace. And this solves the parser completely; just make sure your variable is set by default to Invalid, and don’t forget to make sure that you still got to make sure that the parser reached the correct final state.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.