Bottoms Scheme implementation project
history | edit |

Virtual machine

We need to choose the "bottom" on which to build the language implementation, some kind of "processor" or "machine", and we want to build our thing "from the bottom up", which implies that we choose a lowlevel one, i.e. not a "processor" that takes, say, Haskell or Scheme code to process.

Writing directly for actual hardware seems too hard for our purposes, even standardized hardware like the one offered through virtualization software like Xen or QEMU. Not only is it complex to initialize, but usefully interfacing it to the world would be too much work.

By using a virtual machine running within an existing OS, we can cheat a little and make use of the host OS's capabilities (virtual memory, bridges to permanent storage / files, networking and more), which makes the result more useful of course.

By writing one of our own, we can decide on its language (instruction set) ourselves, which will give us the chance to think about that aspect and let us tailor it for our usage.

The VM Language: "VML"

The language (instruction set) that the VM implements. See VML.

VM implementation language

The language that the VM is implemented in.

We don't want to piggy-back on a garbage collector of the implementation language, this feels like cheating. (It could be useful for interoperation, but let's learn how to go all the way first. Also, some language's GC could not be used for correctness reasons (e.g. Go has an imprecise GC, although that seems to be changing).)

Plain C is not a good choice since security is one of the aims: it offers too many ways to make security-relevant errors that are not found by either the compiler nor testing. C++ is probably neither.

Languages that ensure memory safety usually employ garbage collection. While we could still implement our own GC within such a langage (not relying on the host GC to maintain guest (VML) objects), there would be two collectors running then, which would probably not be such a good idea for analyzing memory usage behaviour, and it feels a bit like cheating still (it will have some performance impact that real-world systems probably wouldn't want to pay, hence even if we're not bothered by this now, we're going down a "non-real" route).

There are few languages that both ensure memory safety and do not use GC:

Perhaps with the easier learning curves (Rust) we will be limited in our attempts to reach the goal of security. With the advanced ones (ATS, theorem provers) it seems likely that we will be unable to finish the VM in due time (ATS should be more forgiving as it gives the possibility to at least get a VM to run, even if not safely, and then still offers an upgrade path to safety we could pursue later on).

Perhaps we should pursue another option: implement the VM in a way that allows for runtime testing with properties-based testing, extensively. That doesn't give the same guarantees (in the interest of cutting the time needed to run the tests, the generated test values may be too sparse and miss issues), but would be a good start, especially since it allows us to test for any kind of property (when attaching the necessary runtime information to objects), and presented with the problem of exceeding test running times, we'll be motivated to think about how to make these more efficient (disable tests, partially prove manually, etc.?). It may be a good way to introduce us to the problems that advanced static verification tries to solve.

There is thus another suggestion:

Some more potential options (discovered too late to have been analyzed yet):

Questions & Answers

Why not write an interpreter instead of a compiler?

Direct interpreters (i.e. those not translating the source program into something else first) for languages with variables (i.e. almost all except concatenative languages (Forth and alike)) suffer from memory retention problems that makes them unsuitable for some kinds of programs:

Since they don't analyze lexical variable usage (because they don't analyze anything in advance), they need to retain variable bindings everywhere in their respective lexical scope. But that means that the values contained in variables are forced to stay around even if subsequent operations don't need them anymore, and a closure will retain references to all variables that are in its lexical scope, even those its body doesn't use, and this can prevent valid programs from working. And manual freeing by way of (set! foo #f) would be similarly problematic as memory handling in C is.

The problem is also occasionally seen e.g. in Java, but it's much more important in languages where it is common to refer to variables in lexical contexts with delay (closures), and worse still when working with streams as they can be infinitely long and hence failure to release the head can ensure running out of memory.

Thus, we need to write a compiler to actually make the language properly usable. And of course it's more interesting, too.

Note that those contributing to the VM will have a chance to work on an interpreter, too, just one for VML instead of Scheme. Arguably we're still writing an interpreter, but for a lower layer of the language stack.

(The sad part about this is that the underlying representation of the code at runtime isn't S-expressions anymore, and the latter would give the possibility of direct inspection and in-place modification from within the program/debugger. Would those be worthwhile features? Could they be re-introduced even with a compiler?)

Why not target a real machine (assembly language), or MMIX, or LLVM?

VML will already be quite close to assembly languages / LLVM. It should be possible to add a translation layer later on with little difficulty (either ignoring the security aspect, or we might know enough then to be able to make it secure with confidence (modulo bugs in LLVM itself)).

Why not target the JVM, JavaScript, Asm.js, or PyPy, or LuaJIT, or Beam, or...?

There will typically be difficulties matching the needs of the source language with the workings of those environments. For example, direct translations of Scheme procedures to JavaScript procedures would have issues with both missing tail-call optimization and the memory retention problems mentioned above, and it doesn't provide for first-class access to continuations; using JavaScript closures in continuation-passing style wouldn't work for the missing TCO (and possibly for issues with how closures are implemented in JS), thus the only workable approach would likely look much like assembly language. Also, questions like how to use the debugger of that environment on programs generated by our compiler and still map the display back to the original language. Also, there are some of the same security issues as discussed above.

Writing a system on top of these from the bottom up while working around the limitations will be even messier than such an approach is anyway, and looks too hard as a first time group project. Also, would it be interesting?

With our own VM, we have a clear security boundary at the VML boundary (processes in VML can't do harm other than work with what access they are given; the compiler above VML could still mess up programs in security relevant ways, but the scope of the breakage is limited to the process). Designing for this is probably a worthwhile approach: after the design works (and tests are written (what about proofs?)), alternative backends could be added more easily.

(Note that another way to get our programs running in the web browser would be to compile the VM to Asm.js, either by way of Emscripten, or by producing an Asm.js backend for S3C.)