We need to choose the "bottom" on which to build the language implementation, some kind of "processor" or "machine", and we want to build our thing "from the bottom up", which implies that we choose a lowlevel one, i.e. not a "processor" that takes, say, Haskell or Scheme code to process.
Writing directly for actual hardware seems too hard for our purposes, even standardized hardware like the one offered through virtualization software like Xen or QEMU. Not only is it complex to initialize, but usefully interfacing it to the world would be too much work.
By using a virtual machine running within an existing OS, we can cheat a little and make use of the host OS's capabilities (virtual memory, bridges to permanent storage / files, networking and more), which makes the result more useful of course.
By writing one of our own, we can decide on its language (instruction set) ourselves, which will give us the chance to think about that aspect and let us tailor it for our usage.
The language (instruction set) that the VM implements. See VML.
The language that the VM is implemented in.
We don't want to piggy-back on a garbage collector of the implementation language, this feels like cheating. (It could be useful for interoperation, but let's learn how to go all the way first. Also, some language's GC could not be used for correctness reasons (e.g. Go has an imprecise GC, although that seems to be changing).)
Plain C is not a good choice since security is one of the aims: it offers too many ways to make security-relevant errors that are not found by either the compiler nor testing. C++ is probably neither.
Languages that ensure memory safety usually employ garbage collection. While we could still implement our own GC within such a langage (not relying on the host GC to maintain guest (VML) objects), there would be two collectors running then, which would probably not be such a good idea for analyzing memory usage behaviour, and it feels a bit like cheating still (it will have some performance impact that real-world systems probably wouldn't want to pay, hence even if we're not bothered by this now, we're going down a "non-real" route).
There are few languages that both ensure memory safety and do not use GC:
ATS: has unsafe and safe modi (usage modi?) (with the idea of making it possible to write unsafe code quickly, then make it safe later on), strong type system, plus a proof language. The latter seems to make it possible to write proofs for any safety/security property we could come up with about the memory allocator/GC.
Rust: more "mainstream" (well-known, and with a focus on being easier to learn / use).
But Rust does not have a dependent type system. It had typestates early on, but those were removed. A suggested workaround looks limited and not usable to ensure any random property?
Does Rust allow for custom object allocators? Then perhaps by using Rust objects (but allocated/deallocated through our own GC) it would be possible to gain safety?
(Are there any VM implementation stories? TODO: read JavaScript: Servo’s only garbage collector (HN).)
PreScheme1: created some decades ago with the explicit aim of writing a virtual machine (for Scheme-48). It has an ML style type system (using just type inference, no type declarations(?)) and does not seem to have features that would allow to get static guarantees about code manipulating memory.
It being Scheme before compilation, one could use an alternative storage ('memory') implementation with runtime tagging+checking, though.
It does not support code that needs to allocate closures at runtime; supposedly that's not necessary for the kind of code one writes for a VM, or programs need to be written in a different style. Also, it does not safeguard statically from errors in freeing memory, the idea is to test as a Scheme program before compilation.
(Can PreScheme handle stack allocation? Nested structs? Does it optimize these automatically?)
What about the alternative PreScheme implementations?
1 more links in Reading
Theorem provers like Isabelle, Coq, or possibly Agda or Idris, can allow to encode guarantees while writing the program in C and model it at the same time, or writing a generator that outputs a safe C program. These approaches seem to be outside our capabilities, though (also, wouldn't ATS allow to encode the same guarantees while being easier to use?).
Perhaps with the easier learning curves (Rust) we will be limited in our attempts to reach the goal of security. With the advanced ones (ATS, theorem provers) it seems likely that we will be unable to finish the VM in due time (ATS should be more forgiving as it gives the possibility to at least get a VM to run, even if not safely, and then still offers an upgrade path to safety we could pursue later on).
Perhaps we should pursue another option: implement the VM in a way that allows for runtime testing with properties-based testing, extensively. That doesn't give the same guarantees (in the interest of cutting the time needed to run the tests, the generated test values may be too sparse and miss issues), but would be a good start, especially since it allows us to test for any kind of property (when attaching the necessary runtime information to objects), and presented with the problem of exceeding test running times, we'll be motivated to think about how to make these more efficient (disable tests, partially prove manually, etc.?). It may be a good way to introduce us to the problems that advanced static verification tries to solve.
There is thus another suggestion:
Some more potential options (discovered too late to have been analyzed yet):
Cyclone (no dependent type system?)
CCured (neither?, abandoned except for a reanimation attempt) (revived as Checked C (HN) ?)
Direct interpreters (i.e. those not translating the source program into something else first) for languages with variables (i.e. almost all except concatenative languages (Forth and alike)) suffer from memory retention problems that makes them unsuitable for some kinds of programs:
Since they don't analyze lexical variable usage (because they don't analyze anything in advance), they need to retain variable bindings everywhere in their respective lexical scope. But that means that the values contained in variables are forced to stay around even if subsequent operations don't need them anymore, and a closure will retain references to all variables that are in its lexical scope, even those its body doesn't use, and this can prevent valid programs from working. And manual freeing by way of (set! foo #f)
would be similarly problematic as memory handling in C is.
The problem is also occasionally seen e.g. in Java, but it's much more important in languages where it is common to refer to variables in lexical contexts with delay (closures), and worse still when working with streams as they can be infinitely long and hence failure to release the head can ensure running out of memory.
Thus, we need to write a compiler to actually make the language properly usable. And of course it's more interesting, too.
Note that those contributing to the VM will have a chance to work on an interpreter, too, just one for VML instead of Scheme. Arguably we're still writing an interpreter, but for a lower layer of the language stack.
(The sad part about this is that the underlying representation of the code at runtime isn't S-expressions anymore, and the latter would give the possibility of direct inspection and in-place modification from within the program/debugger. Would those be worthwhile features? Could they be re-introduced even with a compiler?)
when bootstrapping manually from the bottom, i.e. writing code directly in that lowest layer, assembly language and LLVM seem overly painful, primarily because there's no GC. Although perhaps restricting programs to short-lived processes and not deallocating any memory might be feasible during those first stages.
after implementing GC, programs (either written manually or generated by the compiler) need to be more complex and error-prone because they need to cooperate correctly with the GC. One would want to write a layer on top of bare ASM/LLVM that abstracts these away as soon as possible, basically implementing VML on top of ASM/LLVM. It would be easier to do this later.
LLVM and normal assembly languages don't provide safety mechanisms. A program can access anything directly short of protections enforced by the host OS (which are generally weak (OSs have too many locally exploitable bugs) and difficult to use efficiently). How would we make sure that the compiler never produces programs that are unsafe? When writing a VM, the code to secure is written once, not changed for every program, and can probably be reasoned about, tested or proven secure more easily, whereas here the code producer would need to be made so that its output is never insecure, which is going to exhibit a higher order of difficulty. (An alternative to making the compiler secure could be to build a layer on top of assembly/LLVM that either uses static types that enforce that the program can't be unsafe, and then run the type checker for all output that the compiler generates, or mix that with dynamic checks; really, all of the same safety checks that would be done building or running the VM would need to be implemented, but in a more difficult to understand and control context.)
(Note that aside of securing memory access, there might be timing side channels that one would want to try to eliminate.)
(similarly it will be easier to implement debugging features in a VM (ok, there are debuggers for assembly and LLVM code already; but with a VM we would be free to implement unusual features; ok, one could supposedly implement the same debugging features by emitting the infrastructure by the compiler instead, but then again it will be easier in an interpreter))
VML will already be quite close to assembly languages / LLVM. It should be possible to add a translation layer later on with little difficulty (either ignoring the security aspect, or we might know enough then to be able to make it secure with confidence (modulo bugs in LLVM itself)).
There will typically be difficulties matching the needs of the source language with the workings of those environments. For example, direct translations of Scheme procedures to JavaScript procedures would have issues with both missing tail-call optimization and the memory retention problems mentioned above, and it doesn't provide for first-class access to continuations; using JavaScript closures in continuation-passing style wouldn't work for the missing TCO (and possibly for issues with how closures are implemented in JS), thus the only workable approach would likely look much like assembly language. Also, questions like how to use the debugger of that environment on programs generated by our compiler and still map the display back to the original language. Also, there are some of the same security issues as discussed above.
Writing a system on top of these from the bottom up while working around the limitations will be even messier than such an approach is anyway, and looks too hard as a first time group project. Also, would it be interesting?
With our own VM, we have a clear security boundary at the VML boundary (processes in VML can't do harm other than work with what access they are given; the compiler above VML could still mess up programs in security relevant ways, but the scope of the breakage is limited to the process). Designing for this is probably a worthwhile approach: after the design works (and tests are written (what about proofs?)), alternative backends could be added more easily.
(Note that another way to get our programs running in the web browser would be to compile the VM to Asm.js, either by way of Emscripten, or by producing an Asm.js backend for S3C.)