Wednesday, January 13, 2010


Since I'm currently occupied with implementing Coroutines for the MLVM project (and thus for the Hotspot VM) I decided to share my thoughts and findings:

Coroutines allow a subroutine to return without terminating, so that the next time it is called execution continues after the return statement (most often called yield instead of return).
In contrast to other programming language concepts (e.g. continuations) coroutines were never precisely defined. Coroutines might not even be recognizable as such, and depending on their semantics and expressive power they are called generators, coexpressions, fibers, iterators, green threads, greenlets, tasklets, cooperative threads, etc. (did I miss an important one?)

Coroutines were introduced by [Conway, 1963], and [Marlin, 1980] defines some basic characteristics:
  • Local data (variables, ...) is restored on successive calls.
  • Execution of a coroutine is only suspended when it yields, to be resumed when the coroutine is called again.
Some useful classifications of coroutines are given in [Moura and Ierusalimschy, 2009]:
  • Coroutines that can only yield from within their main method are called stackless. They can be implemented by compile-time transformations but are only useful for a limited range of applications. If coroutines can yield in subsequently called methods they are called stackful.
  • Only coroutines that are not tied to specific language constructs (like a for-each loop) are considered to be first-class coroutines.
  • Symmetric coroutines are controlled by a scheduler-like component that decides which coroutine should run next after a yield. With asymmetric coroutines each yield needs to specify which coroutine should come next.
I propose another classification depending upon how the coroutines interact:
  • Simple coroutines communicate by their actions, without passing a value.
  • Coexpressions can receive and return values, which most of the time will only make sense for asymmetric coroutines (because when passing a value one wants to know where it will go). Coexpressions that only receive a value act as a sink (writing a file, etc.), coexpressions that only return values act as generators (reading a file, etc.) and coexpressions that both receive and return a value can perform complex transformations that need to keep an internal state.

During last years JVM Language Summit it became clear that there is an urgent need for coroutines in Java. Of course every algorithm that is implemented using coroutines can be manually split into an explicit state machine, but in many cases this will be prohibitively expensive. (not only in implementation effort but also performance-wise: compilers create much better code when dealing with local variables and the stack is great at ensuring memory locality!)

Let me introduce a simple example: the SAX parser (an XML parser that doesn't have to keep the whole file in memory because it passes one entity after the other to a callback)
For large or incomplete (streaming) XML documents a SAX parser might be the only way to go, but it is hard to use because the callback needs to keep track of its state and act accordingly.
Using coroutines we have two ways of solving this:
  • We can turn the main program into a coroutine that will be called by the callback or
  • We can use a few lines of code to turn the SAX parser into a coexpression that returns one entity each time we call it. (which IMHO is the most elegant solution)
Simula contains an early implementation of first-class stackful symmetric and asymmetric coroutines which is used for simulation applications. Scripting languages for games (which might be seen as a modern-day equivalent to simulations) also often provide support for coroutines (e.g. lua) because they allow a simulation method to suspend itself until the next event happens.
I'll give a short descritpion of how I think a coroutine API should look like in Java:

Symmetric and asymmetric coroutines should be implemented by the framework (although they could in theory be simulated using one another) in order to provide useful semantics out-of-the-box.
The simplest way of keeping an ordered set of symmetric coroutines is a doubly-linked list (adding, removing and changing order are constant time operations). The simplest scheduling algorithm always yields to the next coroutine in the ring, and a mechanism to implement custom schedules should be provided.

Because of the fundamental differences between symmetric and asymmetric coroutines (and between simple coroutines and coexpressions) I think that these should be provided by distinct interfaces:
Coroutine (for simple symmetric coroutines) and Coexpression
These should in principle work like Threads (created with a Runnable or subclassed).

Coexpression should be generic - this leads to a cleaner API: Coexpression<InT, OutT>

This is what I imagine the API could look like:

class Coroutine {
  public Coroutine();
  public Coroutine(Runnable target);

  public void run();

  public static void yieldNext();
  public static <OutT> OutT yieldCall(Coexpression<Void, OutT> expr);
  public static <InT> void yieldCall(Coexpression<InT, Void> expr, InT input);
  public static <InT, OutT> OutT yieldCall(CoExpression<InT, OutT> expr, InT input);

class Coexpression<InT, OutT> {
  public Coexpression();
  public Coexpression(CoRunnable<InT, OutT> target);

  public OutT run(InT input);

  public static <OutT> OutT yieldCall(Coexpression<Void, OutT> expr);
  public static <InT> void yieldCall(Coexpression<InT, Void> expr, InT input);
  public static <InT, OutT> OutT yieldCall(CoExpression<InT, OutT> expr, InT input);

  public InT yieldReturn(OutT output);

class Test extends Coroutine {
  public void main(String[] args) {
    new Test();

  public void run() {

The example program starts executing in the main method, where it creates a coroutine (which is automatically added to the current thread). Then it switches back and forth between main and run two times (the return from run is an implicit yieldNext).

Whats the current state of this?
I have an early prototype of this running which uses multiple stacks per thread. It is very early (fails horribly on GC, etc.) but I have managed to get very encouraging performance numbers. (an equivalent of ~7 method calls per coroutine switch and a maximum of ~1000000 concurrent coroutines using C2 on a 32-bit machine)
It works by allocating a fixed number of stacks per thread. If there are too many coroutines it will start to copy the stack contents to and from the heap in order to share stacks.
I'll certainly experiment with more esoteric ways of implementing coroutines, but for now it provides a baseline for experimenting with the API, etc.