Into the Turing tar-pit

about 16 minutes

At the first Copenhagen Tech Polyglot Meetup I presented some options and gave some tips on how to get started with designing a programming language and writing a compiler for it. This is a more detailed follow up post. You can find the slides on Speaker Deck and download them as a PDF.

Why?

First of all, it should be clear why you would want to create a new language or compiler. What is the problem you are trying to solve? Is there already an adequate language for it? If so, what is it missing, why is using it problematic, i.e., is it a good idea to create a superset or subset of it? Maybe an existing language just needs better means to express the solution to your problem. Maybe you have ideas for new, cool language features. Another good reason is trying to understand how programming languages and compilers work. It is also a lot of fun!

What?

Once you know why, what kind of language do you want to design? What paradigms should it follow? Is the language rather imperative or declarative? Should it have statements and expressions, or only the latter? Is it low-level with raw access to memory, or does it support higher-order programming, i.e., are functions first-class values, can they be passed as arguments to other functions and returned as a result?

Implementing an imperative language is slightly easier, as it maps more closely to common low-level targets. Declarative languages are usually harder to implement efficiently. In any case, first-class functions and closures are very useful features that can be implemented fairly easily.

Another aspect is the degree of dynamism. Should the language be very dynamic and allow the redefinition of elements (e.g., functions) at run-time, like Lisp, Python and JavaScript? This is desirable if your language should support interactive/explorative programming, but might also be generally useful for debugging purposes. If your language should be used to write high-performance system applications, it is perhaps a better idea to make it more static.

For a static language, the compiler can generate better target code, as it does not have to allow for changes at run-time, and thus can apply more optimizations, such as inlining functions. Also, for a dynamic language, the compiler needs to be available at run-time and needs to be able to update existing definitions, which can be difficult to implement, depending on the target.

Programming constructs and language extensibility are two other design decisions. What are the core elements provided by the language? For example, if it is object-oriented, does it have constructs like classes, methods and interfaces? Or does the language only expose lower-level primitives, such as functions and closures, and users can build their own object-system on top of them? To begin with, it is better to choose a limited number of built-ins that are universal enough, which keeps the compiler simple. Later on you can provide further higher-level abstractions which make the language more practical.

What mechanisms does the language provide to define solutions for problems of a certain domain? How expressive and extensible is the language? One option for extensibility is operator overloading. For example, the language may provide a built-in addition operator + for integers, but users might want to define a vector and perform additions of them. In dynamic languages it is common to build domain-specific languages using meta-programming facilities. For example, if a non-existing method is called in Ruby, the object’s fallback method method_missing is invoked with the original method name and arguments.

In low-level languages like C it is common to have a preprocessor which performs text-based substitution and macro expansion. Some higher-level languages provide syntactic macro systems, which perform rule-based or even programmatic substitutions given the abstract syntax tree. Such macro systems are more powerful, less error-prone, and also make the compiler simpler: Instead of having specific syntax and compiler passes for built-ins, they can be defined as macros in the standard library. However, implementing a syntactic macro system requires more effort than implementing a few built-ins, especially if transformations should be safe, i.e, hygienic.

Another consideration is the complexity of the type system. Are there types at compile-time? Does the compiler use them to ensure safety? Does the user have to manually annotate every expression with a type, or does the compiler infer types for expressions without annotations? Are there types at run-time? Implementing a type system is not essential, so you can also add it later once the language and compiler are somewhat stable.

Turing tar-pit

The title is a reference to Alan Perlis’ “Epigrams on Programming”:

“Beware of the Turing tar-pit in which everything is possible but nothing of interest is easy.”

Language design is hard, as one needs to make a tradeoff between creating a specific or a general-purpose language, and needs to decide exactly which features are enough to create something practical. It takes several iterations to find a sweet spot, and it is easy to end up with a Turing tar-pit for first versions of a new language.

Host

A compiler will typically parse a source language into some data structures (abstract syntax tree), transform it in several passes into an internal lower language (intermediate representation), and finally generate code for a target language.

First, you have to decide in which language you want to write the compiler in. At some point it might even be the language you are writing the compiler for, when your compiler is self-hosting, but until then you will have to work on a boostrapping compiler. In general, it is a good idea to pick a host and language for which libraries for most of the compilation steps exist. Using libraries, especially for parsing, allows you to focus on getting a simple, working compiler done without having to learn about difficult algorithms in detail first. Later, once you got the general idea of compilation, you can focus on the individual parts and replace the libraries with your own implementations.

For parsing, one option is to use a parser generator, which takes a grammar of your source language, and generates a parser which can be executed on the host. Usually, each parser generator tool has its own definition syntax and you will be writing an interleaved mix of the custom definition language and host language code, which might take some time to get used to. Another option is to use a parser combinator library. Parsers are simply higher-order functions in the host language: Simple parsers, for example for an identifier, can be combined into more complex parsers, for example for variable definitions.

For code generation, it is also advisable to use a library, as they are producing error-free and readable code, which is easier to debug. Some code generation libraries take a tree as an input, while others provide an API or even a high-level DSL.

One of the best host platforms to get started with is JavaScript. It is not necessary to write the compiler in JavaScript itself, as there is a great number of alternatives. The main reason is the vast ecosystem of third-party libraries. For parsing you could use the PEG.js parser generator, and for code generation you can produce an Mozilla Parser API AST, which escodegen will translate into JavaScript. Finally, you can improve the generated code by using an optimizer like the Google Closure Compiler.

Another great host is the JVM. There is also a number of host languages you can choose from, like Java, Scala or JRuby. For parsing in Java there is ANTLR, and Scala comes with a great parser combinator library. For generating JVM byte code there is the defacto low-level ASM framework, for which higher-level DSLs like jitescript and bitescript exist.

If you are already familiar with Lisp, you might want to use the programmable programming language Racket, which provides very powerful and flexible parsing, transformation and debugging facilities.

To simplify things it is a good idea to implement the compiler on the platform on which the target code will be executed.

Source

After deciding what features the language should have, it is time to find a good way of exposing them to users in an approachable way. How should that look like? It is of course possible to invent a completely new syntax, but usually adopting an existing syntax helps making the language more approachable for users and you get editor support, like syntax-highlighting or indentation, for free.

S-expressions are not only applicable to Lisp-like languages. They are in general very simple, and having no keywords, makes them extremely versatile, and thus also very easy to parse.

An ALGOL-like syntax is more verbose, with dedicated, descriptive keywords, commonly found in statement-based imperative languages (e.g., Pascal) and expression-based declarative languages (e.g., ML). If you would like to have less keywords, you can choose a more C-like variant, e.g., with semicolons to terminate statements and curly braces for blocks of statements. If on the other hand you feel like this is too noisy, you could also make indentation significant, like in Python.

Regardless of what syntax you choose: If it is too difficult to implement a parser for it, or you are spending too much time on it, the syntax is probably too complicated and should be simplified.

Intermediate Representation

Once the source code is parsed into an abstract syntax tree, it has to be gradually transformed into the final target code. The intermediate representation is the form between source and target, and should be fairly independent of both. Choosing a common intermediate representation eases the implementation of analysis and optimization passes.

One common IR is Continuation-Passing Style (CPS). In this form, all functions take a continuation as an extra argument, a function, which is passed the result of the computation. Furthermore, all other arguments need to be simple expressions, i.e., variables or constants. CPS is widely used by compilers for functional languages. It is easy to translate into and it is easy to find material on.

For example, this complex expression would be transformed as follows:

f(g(x, y), h(z))
g(x, y, function ($1) {
    h(z, function ($2) {
        // k is the final continuation
        f($1, $2, k);
    });
});

The A-Normal Form (ANF) is a simpler alternative and similar to CPS, where the result of a complex expression is bound to a variable.

var $1 = g(x, y);
var $2 = h(z);
var $3 = f($1, $2);

Programs in CPS and ANF make the evaluation order clear and translate really well to statement-based target languages and also to lower-level languages like assembly. ANF is easier to get started with, but is not as widely used and described as CPS.

Static single assignment form (SSA) is the most common IR and is mainly used for compilers of imperative high-level and lower-level languages. In this form, variables are only assigned once, i.e., multiple assignments to the same variable are transformed into assignments to different versions of the variable. A special Phi function is required to join different versions after control flow blocks. SSA is lower level and harder to understand than ANF and CPS, but it is the IR with the most available material, and it offers the best way to generate highly optimized target code.

No matter what variants of AST and IR you will use, it is a very good idea to implement a pretty-printer for them, as you will spend a lot of time debugging the output of various transformation passes.

Target

Finally, you have to choose a target. Assembly is at the lowest level and platform dependent, which means you have to have write a code generator (backend) for each architecture. The LLVM intermediate representation on the other hand is platform independent and the LLVM compiler project has optimizing backends for various architectures. However, you still have to generate fairly low-level register-based instructions, and implement higher-level features like garbage collection and closures yourself.

Another alternative is Java bytecode, the instruction set for the JVM. It is also platform independent, but is stack-based, simpler, and contains many higher-level instructions, like dynamic method dispatch. Another advantage is the availability of high-performance JIT virtual machines with state-of-the-art garbage collectors. An even higher-level alternative is JavaScript: It is very dynamic, provides first class functions and closures, and still runs with high performance on basically any mobile device, desktop and server.

As you can see, each option is increasingly more portable, has more features, and is easier to use, but is also less performant.

Now what?

After designing your language and implementing a compiler for it, there is still more work left to do to make it practical. Programs are rarely written from scratch. Instead, programmers rely on the bundled standard library and third-party libraries, which contain common data structures and algorithms.

As implementing a rich standard library is a lot of work, and creating a community and ecosystem is even harder, supporting interoperability is important: Your language should have a good foreign function interface so programs can make use of libraries written in other languages.

Pointers

One great way to learn about language design and compiler construction is to read about other languages and their implementations. Another is to read books: There is of course “Compilers: Principles, Techniques, and Tools”, also known as the “Dragon Book”, but I highly recommend reading “Essentials of Programming Languages” and “Modern Compiler Implementation”. For type systems you want to read “Types and Programming Languages”.

There is also lots of awesome material available online. Matt Might’s blog contains great well-written tutorials for smaller languages, such as implementing a simple interpreter, compiling Scheme to C or compiling to Java, but also fantastic explanations on how to parse S-expressions, implement closures, or perform CPS conversion and A-Normalization.

If you are interested in using LLVM, the tutorial “Implementing a language with LLVM” is a great place to start at. It is using C++ as the host language, but versions for Python, OCaml and Haskell exist as well.

If you are interested in creating an ML-like language, follow the MinCaml tutorial, describing a compiler for a ML subset written in OCaml, targeting x86, PowerPC, and SPARC assembly. The author also wrote a paper about it.

If you want to build a Lisp-like language, have a look at the paper “An Incremental Approach to Compiler Construction”, and its extended tutorial version “Compilers: Backend to Frontend and Back to Front Again”, which describe the step-by-step development of a Scheme compiler targeting x86 assembly. It starts out with a very simple language and compiler, which is then gradually extended with features like proper tail calls and a foreign function interface. Nada Amin put together a great implementation of it. If you are specifically interested in Clojure and ClojureScript, David Nolen has a concise walkthrough of ClojureScript’s compiler.

To learn about type systems you should definitely have a look at Tom Primožič’s great repository of implementations and detailed explanations of the Hindley–Milner type system and several of its extensions.

Becoming a hacker

In “How to Become a Hacker” Eric Raymond describes how learning Lisp is an enlightening experience, even if you will never actually use it. Growing a language and writing a compiler for it is even more enlightening. It will make you a better programmer, even if you are not going to use your new language after all: You will get a better understanding of languages you have been using, you will make better choices when choosing a language or language features for a specific problem, and you will make better choices when writing programs.

Steve Yegge put it this way:

“The disease, nay, the virus of programming-language religion has a simple cure: you just write a compiler. Or an interpreter. One for any language other than the one you know best. It’s as easy as that. After you write a compiler (which, to be sure, is a nontrivial task, but if there’s some valid program out there that you couldn’t ever write, then you’re not justified in calling yourself a programmer), the disease simply vanishes. In fact, for weeks afterwards, you can’t look at your code without seeing right through it, with exactly the same sensation you get when you stare long enough at a random-dot stereogram: you see your code unfold into a beautiful parse tree, with scopes winding like vines through its branches, the leaves flowering into assembly language or bytecode.

When you write a compiler, you lose your innocence. It’s fun to be a shaman, knowing that typing the right conjuration will invoke the gods of the machine and produce what you hope is the right computation. Writing a compiler slays the deities, after which you can no longer work true magic. But what you lose in excitement, you gain in power: power over languages and over language-related tools. You’ll be able to master new languages rapidly and fearlessly. You may lose your blind faith, but you gain insight into the hauntingly beautiful machinery of your programs. For many, it deepens their real faith. Regardless, it lets them sit at the table with their peers as equals.”