Tuesday, April 6, 2010

JRuby + Coroutines = really fast

Today I did some tests to see how much, and if at all, JRuby could benefit from implementing Ruby fibers with real coroutines. I was hoping to see some improvements, and that maybe JRuby will even surpass the native Ruby implementation in fiber performance.

So I got myself the current Ruby (Ruby 1.9.1-p376) and JRuby (from git) and made up a simple program that passes messages through a number of fibers.

Running a number of tests with different parameters on Ruby and an unmodified JRuby wasn't too difficult (after realizing that -1.9 isn't the same as --1.9).
But in order to run JRuby with coroutine support I first had to hack together coroutine support for JRuby :-)
As I'm not familiar with the JRuby code I had to dive around a bit. Somewhere along the way I realized that I'll probably need CoroutineLocals, so I had to implement those first. Then some more hacking and I got myself the (probably) first JRuby implementation to use real coroutines.
I only had to modify:
FiberLibrary.java (to use coroutines instead of threads)
ThreadService.java (to use CoroutineLocals instead of ThreadLocals)
Ruby.java (I'm not sure why, but I had to change the way the fiber library is included to addLazyBuiltin)

That was easy! A good sign for the quality of the JRuby code, I suppose...

But here are the results:
I ran different numbers of messages through a line of 5, 50, 500 and 5000 consecutive fibers. Here are the charts that compare Ruby, JRuby with coroutine support and JRuby without coroutine support. (the x-axis shows the number of messages, and the y-axis is time taken in seconds)

5 fibers:

50 fibers:

500 fibers:

5000 fibers:

I somewhat expected JRuby with coroutine support to win over JRuby without by a large margin (because Threads are used to emulate fibers), but being faster than Ruby is of course a nice bonus!

But now I should get back to work and integrate all the fixes I had to do to make it work...

Wednesday, March 10, 2010

Coroutines for Java: Status Update

I'm currently working on bringing Coroutines to the Hotspot Java VM (as part of the MLVM project). I'm happy to announce that I have pushed a version of my code to the mlvm mercurial repository. The patches are named "coro.patch".

If you want to test the new coroutine features you can either compile your own binaries (see the wiki for details) or use a recent openjdk build (from here) together with one of my precompiled binaries:

  • Linux x86 binaries:
    Place the desired binaries (debug or product) in the jre/lib/i386 folder

  • Windows x86 binaries:
    Place the desired binaries (debug or product) in the jre\bin folder

  • Java classes:
    These classes have to be inserted before the boot classpath because some classes in rt.jar are replaced. This can be achieved via prepending them to the boot classpath:

If everything is configured correctly you shuold be able to run the following example program (source):

This program produces the following output:
main start
main end

The Coroutine is created simply by creating a new Coroutine instance. Analogous to Thread the code that will be executed within the coroutine can be defined either by overriding the run method or by passing a Runnable to the constructor. The Coroutine framework keeps all active coroutines in a doubly-linked ring, which defines an execution order for the coroutines and makes sure that no coroutine is "lost". (which is more important than you might think!)

These symmetric coroutines are great to allow, for example, periodic scheduling of agents and the like. But another way of using coroutines are asymmetric coroutines.
These are good at inverting the way a method behaves: Imagine a method that somehow generates a stream of values. Normally one would have to think about where to put and what to do with these values. But using coroutines we can write the method as if we had an infinite buffer for our values.

An example I like to bring up for this is the SAX parser: using asymmetric coroutines a (callback-based) SAX parser can be turned into a generator that returns one value at a time (source):

The extended for loop here is short for:
while(!parser.isFinished()) {
String element = coCall(parser);

The coroutine in this case extends CoGenerator. For other use cases there are two more classes: CoConsumer and CoExpression
A CoConsumer receives a value whenever it is called, but doesn't return anything, and a CoExpression both receives and returns a value.

The interface of the Coroutine classes looks like this:

The performance of this implementation is quite good - it takes ~15 ns per context switch on my machine in Ubuntu x86.

Have fun!


Sometimes seeing the actual assembly code that hotspot creates can be very helpful. "-XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly" should do the trick.
But you will probably get the following message:

Could not load hsdis-i386.dll; library not loadable; PrintAssembly is disabled

This happens when the library that does the actual disassembly is missing.
Building this library can be really tricky, especially on Windows. Here is a precompiled binary: hsdis-i386.zip (Windows x86)

Thursday, February 11, 2010

Taking out the trash: Coroutines versus Garbage Collection

From the point of view of heap management a coroutine is just another object that contains pointers to (possibly lots of) other objects. On the other hand, coroutine data might be lying around on some coroutine stack outside the Java heap (instead of in an object inside the Java heap). So while designing a coroutine system one inevitably needs to think about how coroutines and garbage collection interact.

Ok, so where's the problem?
First of all, the Hotspot Java VM (like most VMs) handles stacks quite differently than other objects.
  • The pointers on the stacks are always strong root pointers. So any object that is referenced within a stack frame will always stay alive (=strong), and these pointers are the starting point for every Garbage Collection run (=root).
  • The VM makes adjustments to the stackframes before and after Garbage Collection (called prologue and epilogue). These mainly take care of converting bytecode pointers to bytecode indices (and back to bytecode pointers) in interpreted frames.
  • The VM looks at stackframes for a number of reasons, for example to see if a given method is in use, etc.
So what about Coroutine stackframes? Treating them like any other object doesn't work because of these special cases. In fact, if one doesn't want to make radical changes to the compilers and the compiler infrastructure the following condition needs to be fulfilled at all times:
The VM at all times needs to be able to efficiently locate all currently running instances of all methods.
Ok, so where's the problem? Well, there isn't one for symmetric Coroutines (which are periodically scheduled). They will be scheduled again and again until they die and when they die they will be thrown away and that's it.
But asymmetric Coroutines (which explicitly call each other) might never reach their end. The programmer can simply set the last reference to an asymmetric Coroutine to null and assume that it will be collected. But the this - pointer of the first stackframe of the Coroutine is still a root pointer, and thus the Coroutine and all the objects it references will never go away.
Of course they will be cleaned up as soon as the thread ends, but threads can (and most often will) live for a long time.

What can we do about it?
There are a few options that allow us to kill Coroutines before the thread ends:
Remove the special treatment for stacks
Removing the Garbage Collection prologue and epilogue seems possible. But having stackframes that are only reachable via a GC is hard: Imagine that the compiler needs to deoptimize a method because some assumptions it took while compiling are broken (more than one implementation of an interface, etc.). The compiler needs to find all current activations of the method and patch them so that they will be deoptimized.
Have the programmer close Coroutines when he's done
This will of course always work - and a close operation should probably be implemented in any case.
Let the references on Coroutine stacks be weak references
This looks very promising on first glance. The Coroutine is only weakly referenced by the stack, so as soon as it isn't referenced by the Java code any more it can go away.
But: This is too weak. As long as the Coroutine is alive none of the references on the stack should be weak!
Remove the Coroutine stack in the Coroutines finalizer
Well this is basically a variation of the aforementioned options. It won't work because the Coroutine will always be referenced from its stack.
Introduce a weakly referenced object between the thread and the coroutine stacks
That will be the way to go. This proxy object is referenced by the Coroutine, and as soon as the Coroutine dies it will go away, and the stack along with it. If the Coroutine still exists it will keep the whole stack alive.

Collecting Coroutines
One question remains: can we just collect Coroutines? What if they still hold locks, etc.?
Right. Coroutines are closed by resuming them and immediately throwing a CoroutineExitException. This exception will (or might) propagate down the stack until it ends the Coroutine.
A Coroutine can only by collected if throwing the exception ends it without any side effects.
Looking at a Coroutine stack it can be determined:
  • Are there any locks (synchronized) that will be freed?
  • Will the CoroutineExitException be caught?
  • Is there a finally clause?
If none of these is the case then we're on the safe side.

Wednesday, January 13, 2010


Since I'm currently occupied with implementing Coroutines for the MLVM project (and thus for the Hotspot VM) I decided to share my thoughts and findings:

Coroutines allow a subroutine to return without terminating, so that the next time it is called execution continues after the return statement (most often called yield instead of return).
In contrast to other programming language concepts (e.g. continuations) coroutines were never precisely defined. Coroutines might not even be recognizable as such, and depending on their semantics and expressive power they are called generators, coexpressions, fibers, iterators, green threads, greenlets, tasklets, cooperative threads, etc. (did I miss an important one?)

Coroutines were introduced by [Conway, 1963], and [Marlin, 1980] defines some basic characteristics:
  • Local data (variables, ...) is restored on successive calls.
  • Execution of a coroutine is only suspended when it yields, to be resumed when the coroutine is called again.
Some useful classifications of coroutines are given in [Moura and Ierusalimschy, 2009]:
  • Coroutines that can only yield from within their main method are called stackless. They can be implemented by compile-time transformations but are only useful for a limited range of applications. If coroutines can yield in subsequently called methods they are called stackful.
  • Only coroutines that are not tied to specific language constructs (like a for-each loop) are considered to be first-class coroutines.
  • Symmetric coroutines are controlled by a scheduler-like component that decides which coroutine should run next after a yield. With asymmetric coroutines each yield needs to specify which coroutine should come next.
I propose another classification depending upon how the coroutines interact:
  • Simple coroutines communicate by their actions, without passing a value.
  • Coexpressions can receive and return values, which most of the time will only make sense for asymmetric coroutines (because when passing a value one wants to know where it will go). Coexpressions that only receive a value act as a sink (writing a file, etc.), coexpressions that only return values act as generators (reading a file, etc.) and coexpressions that both receive and return a value can perform complex transformations that need to keep an internal state.

During last years JVM Language Summit it became clear that there is an urgent need for coroutines in Java. Of course every algorithm that is implemented using coroutines can be manually split into an explicit state machine, but in many cases this will be prohibitively expensive. (not only in implementation effort but also performance-wise: compilers create much better code when dealing with local variables and the stack is great at ensuring memory locality!)

Let me introduce a simple example: the SAX parser (an XML parser that doesn't have to keep the whole file in memory because it passes one entity after the other to a callback)
For large or incomplete (streaming) XML documents a SAX parser might be the only way to go, but it is hard to use because the callback needs to keep track of its state and act accordingly.
Using coroutines we have two ways of solving this:
  • We can turn the main program into a coroutine that will be called by the callback or
  • We can use a few lines of code to turn the SAX parser into a coexpression that returns one entity each time we call it. (which IMHO is the most elegant solution)
Simula contains an early implementation of first-class stackful symmetric and asymmetric coroutines which is used for simulation applications. Scripting languages for games (which might be seen as a modern-day equivalent to simulations) also often provide support for coroutines (e.g. lua) because they allow a simulation method to suspend itself until the next event happens.

I'll give a short descritpion of how I think a coroutine API should look like in Java:

Symmetric and asymmetric coroutines should be implemented by the framework (although they could in theory be simulated using one another) in order to provide useful semantics out-of-the-box.
The simplest way of keeping an ordered set of symmetric coroutines is a doubly-linked list (adding, removing and changing order are constant time operations). The simplest scheduling algorithm always yields to the next coroutine in the ring, and a mechanism to implement custom schedules should be provided.

Because of the fundamental differences between symmetric and asymmetric coroutines (and between simple coroutines and coexpressions) I think that these should be provided by distinct interfaces:
Coroutine (for simple symmetric coroutines) and Coexpression
These should in principle work like Threads (created with a Runnable or subclassed).

Coexpression should be generic - this leads to a cleaner API: Coexpression<InT, OutT>

This is what I imagine the API could look like:

class Coroutine {
  public Coroutine();
  public Coroutine(Runnable target);

  public void run();

  public static void yieldNext();
  public static <OutT> OutT yieldCall(Coexpression<Void, OutT> expr);
  public static <InT> void yieldCall(Coexpression<InT, Void> expr, InT input);
  public static <InT, OutT> OutT yieldCall(CoExpression<InT, OutT> expr, InT input);

class Coexpression<InT, OutT> {
  public Coexpression();
  public Coexpression(CoRunnable<InT, OutT> target);

  public OutT run(InT input);

  public static <OutT> OutT yieldCall(Coexpression<Void, OutT> expr);
  public static <InT> void yieldCall(Coexpression<InT, Void> expr, InT input);
  public static <InT, OutT> OutT yieldCall(CoExpression<InT, OutT> expr, InT input);

  public InT yieldReturn(OutT output);

class Test extends Coroutine {
  public void main(String[] args) {
    new Test();

  public void run() {

The example program starts executing in the main method, where it creates a coroutine (which is automatically added to the current thread). Then it switches back and forth between main and run two times (the return from run is an implicit yieldNext).

Whats the current state of this?
I have an early prototype of this running which uses multiple stacks per thread. It is very early (fails horribly on GC, etc.) but I have managed to get very encouraging performance numbers. (an equivalent of ~7 method calls per coroutine switch and a maximum of ~1000000 concurrent coroutines using C2 on a 32-bit machine)
It works by allocating a fixed number of stacks per thread. If there are too many coroutines it will start to copy the stack contents to and from the heap in order to share stacks.
I'll certainly experiment with more esoteric ways of implementing coroutines, but for now it provides a baseline for experimenting with the API, etc.