Friday, March 25, 2011

HTTP parsing with Javascript generators

Event based programming is all the latest craze these days (see node.js, etc).
Parsing HTTP/1.1 in an event based system usually requires complicated state machines.

Handling state machines can be greatly simplified using Javascript Generators.
Any Javascript function containing a yield statement is a generator function and can be used to create a generator. Client code can interact with a generator using the next() or send(obj) methods. The result of next() or send(...) is the "yielded" value, the result of the yield statement is null or whatever argument send(...) is passed.

function fib() {
  var i = 0, j = 1;
  while (true) {
    yield i;
    var t = i; i = j; j += t;
  }
}
var g = fib();
g.next() -> 0
g.next() -> 1
g.next() -> 1
g.next() -> 2
g.next() -> 3
...

See here for more details on generators and iterators.

Now imagine:
  1. The input data is coming in random sized chunks (buffers) from the network. Buffers have methods to get the remaining size and the next byte or character. Every received buffer triggers some kind of data event.
  2. For for the sake of this post let's assume that you have a simple scanner that can simply read a line, or a part of a chunk from an input buffer. All those methods return null if they do not have enough data to continue, and store all temporary data needed to continue.
When a buffer arrives from the network and we get a network event, how do we parse it as far as we can while keeping the current state alive until the next buffer arrives?
Note that a buffer may contain a partial line, and whole line and another line, some of the headers, some of the headers and a part of a chunk-encoded part of the body, multiple chunks, etc.
With HTTP pipe lining a single buffer may finish the current request and contain part of the next request as well.

We want something like:

    function onData(buffer) {
        while there is data in the buffer left:
            parse buffer and generate HTTP events,
            until either more data is required to continue
            or the current request is finished
    }

Using Generators you could now write something like this:
function pumpReader(state, buffer, ...) {
    // drain the buffer by parsing requests
    while (buffer.remaining() && state.send(buffer)) {
        // try the rest of the buffer for the next request
        state.next();
    }
}
 <state> is a generator that is passed the current buffer on each iteration. The next() or send(obj) methods cause the generator to continue to run until the next yield statement is reached. At that point the generator will yield false if more data is required and true if the HTTP/1.1 request parsing is complete.

The nice thing is that this can be driven purely by network events, when new data arrives it is "pumped" through the state machine. The generator maintains the current execution state of the generator function.

The "state machine" itself could be represented like this:
function parseState(<some http event handler>) {
    while(true) {
    var b = yield false; // seed b, the buffer to be used.
    var scanner = new HttpScanner();
    var line;
    // get the request line
    while((line = scanner.getNextLine(b)) == null) b = yield false;
         ... deal with the request line ...
    // read the headers
    do {
        while((line = scanner.getNextLine(b)) == null) b = yield false;
        ... add to current headers, or generate header events ...
    } while(line.length>0);
... we have a full HTTP/1.1 header now, generate a useful event ...
    if (<has body based on request line>) {
        let len, data, size;
        scanner.setEncoding(<encoding from header>);
        if (<is chunked>) {
            let chunksize;
            do {
                // get the chunk size
                while((line = scanner.getNextLine(b)) == null) b = yield false;
                chunksize = parseInt(line,16) || 0;
                len = chunksize;
                // now consume the chunk
                while(len > 0) {
                    [size,data] = scanner.getContent(b,len);
                    if (size > 0) {
                        ... generate data event ...
                       
len -= size;
                    }
                    if (len > 0) b = yield false;
                }
                // skip next /r/n (unless this is the last chunk)
                if (chunksize > 0) while(scanner.getNextLine(b) == null) b = yield false;
            } while(chunksize > 0);
            // read optional headers
            while((r.headers = scanner.getHeaders(b)) == null) b = yield false;
        } else {
            len = parseInt(<content-length from headers>) || 0;
            // a bit duplicated code here...
            while(len > 0) {
                [size,data] = scanner.getContent(b,len);
                if (size > 0) {
                    ... generate data event... 
                    len -= size;
                }
                if (len > 0) b = yield false;
            }
        }
    }
    ... done with the request ...
    yield true; // done
    }
}
The most we ever have to store in a temporary buffer is one line, the headers, or one chunk of data.

And if we had a network socket receiving asynchronous data events we could set it up like this:
    var state = parseState(<some http event handler object>);
    state.next(); // start it
    socket.on("data",function(data) {pumpReader(state,data);});
I glossed over a bunch of details. The point is that state management is done by the generator itself, the parts are easy to understand and can be validated individually.

You can see this in action in RhiNode and RhiNodeII. (RhiNode is a Javascript wrapper around Java's NIO using Rhino).

Note that only some Javascript interpreters support generators. All of Mozillas interpreters do, Google V8 does not.