275 lines
No EOL
14 KiB
ReStructuredText
275 lines
No EOL
14 KiB
ReStructuredText
***************************************************
|
|
Chapter 8: Conclusion and other useful LLVM tidbits
|
|
***************************************************
|
|
|
|
Written by Chris Lattner
|
|
|
|
Tutorial Conclusion
|
|
===================
|
|
|
|
Welcome to the the final chapter of the "`Implementing a language with LLVM
|
|
<http://www.llvm.org/docs/tutorial/index.html>`_" tutorial.
|
|
In the course of this tutorial, we have grown our little Kaleidoscope language
|
|
from being a useless toy, to being a semi-interesting (but probably still useless)
|
|
toy. :)
|
|
|
|
It is interesting to see how far we've come, and how little code it has taken.
|
|
We built the entire lexer, parser, AST, code generator, and an interactive run-loop
|
|
(with a JIT!) by-hand in under 540 lines of (non-comment/non-blank) code.
|
|
|
|
Our little language supports a couple of interesting features: it supports user
|
|
defined binary and unary operators, it uses JIT compilation for immediate evaluation,
|
|
and it supports a few control flow constructs with SSA construction.
|
|
|
|
Part of the idea of this tutorial was to show you how easy and fun it can be to
|
|
define, build, and play with languages. Building a compiler need not be a scary
|
|
or mystical process! Now that you've seen some of the basics, I strongly encourage
|
|
you to take the code and hack on it. For example, try adding:
|
|
|
|
- ***global variables*** - While global variables have questional value in modern
|
|
software engineering, they are often useful when putting together quick
|
|
little hacks like the Kaleidoscope compiler itself. Fortunately, our
|
|
current setup makes it very easy to add global variables: just have value
|
|
lookup check to see if an unresolved variable is in the global variable
|
|
symbol table before rejecting it. To create a new global variable, make
|
|
an instance of the LLVM GlobalVariable class.
|
|
|
|
- ***typed variables*** - Kaleidoscope currently only supports variables of type
|
|
double. This gives the language a very nice elegance, because only supporting
|
|
one type means that you never have to specify types. Different languages have
|
|
different ways of handling this. The easiest way is to require the user to
|
|
specify types for every variable definition, and record the type of the variable
|
|
in the symbol table along with its Value*.
|
|
|
|
- ***arrays, structs, vectors, etc*** - Once you add types, you can
|
|
start extending the type system in all sorts of
|
|
interesting ways. Simple arrays are very easy and are quite useful
|
|
for many different applications. Adding them is mostly an
|
|
exercise in learning how the LLVM `getelementptr
|
|
<http://www.llvm.org/docs/LangRef.html#i_getelementptr>`_ instruction works:
|
|
it is so nifty/unconventional, it `has its own FAQ!
|
|
<http://www.llvm.org/docs/GetElementPtr.html>`_ If you add
|
|
support for recursive types (e.g. linked lists), make sure to
|
|
read the `section in the LLVM Programmer's Manual
|
|
<http://www.llvm.org/docs/ProgrammersManual.html#TypeResolve>`_ that describes
|
|
how to construct them.
|
|
|
|
- ***standard runtime*** - Our current language allows the user to
|
|
access arbitrary external functions, and we use it for things like "putchard".
|
|
As you extend the language to add higher-level constructs, often these
|
|
constructs make the most sense if they are lowered to calls into a
|
|
language-supplied runtime. For example, if you add hash tables to the
|
|
language, it would probably make sense to add the routines to a runtime,
|
|
instead of inlining them all the way.
|
|
|
|
- ***memory management*** - Currently we can only access the stack in Kaleidoscope.
|
|
It would also be useful to be able to allocate heap memory, either with calls
|
|
to the standard libc malloc/free interface or with a garbage collector.
|
|
If you would like to use garbage collection, note that LLVM fully supports
|
|
`Accurate Garbage Collection <http://www.llvm.org/docs/GarbageCollection.html>`_
|
|
including algorithms that move objects and need to scan/update the stack.
|
|
|
|
- ***debugger support*** - LLVM supports generation of `DWARF Debug info
|
|
<http://www.llvm.org/docs/SourceLevelDebugging.html>`_ which
|
|
is understood by common debuggers like GDB. Adding support for debug info is
|
|
fairly straightforward. The best way to understand it is to compile some C/C++
|
|
code with ``llvm-gcc -g -O0`` and taking a look at what it produces.
|
|
|
|
- ***exception handling support*** - LLVM supports generation of `zero cost exceptions
|
|
<http://www.llvm.org/docs/ExceptionHandling.html>`_ which interoperate
|
|
with code compiled in other languages. You could also generate code by
|
|
implicitly making every function return an error value and checking it.
|
|
You could also make explicit use of setjmp/longjmp. There are many different
|
|
ways to go here.
|
|
|
|
- ***object orientation, generics, database access, complex numbers, geometric
|
|
programming, ...*** - Really, there is no end of crazy features that you can
|
|
add to the language.
|
|
|
|
- ***unusual domains*** - We've been talking about applying LLVM to a domain that
|
|
many people are interested in: building a compiler for a specific language.
|
|
However, there are many other domains that can use compiler technology that are
|
|
not typically considered. For example, LLVM has been used to implement OpenGL
|
|
graphics acceleration, translate C++ code to ActionScript, and many other cute
|
|
and clever things. Maybe you will be the first to JIT compile a regular expression
|
|
interpreter into native code with LLVM?
|
|
|
|
- ***Have fun*** - try doing something crazy and unusual. Building a language like
|
|
everyone else always has, is much less fun than trying something a little crazy or
|
|
off the wall and seeing how it turns out. If you get stuck or want to talk about it,
|
|
feel free to email the `llvmdev mailing list
|
|
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>`_: it has lots of people who are
|
|
interested in languages and are often willing to help out.
|
|
|
|
Before we end this tutorial, I want to talk about some "tips and tricks" for
|
|
generating LLVM IR. These are some of the more subtle things that may not be obvious,
|
|
but are very useful if you want to take advantage of LLVM's capabilities.
|
|
|
|
Properties of the LLVM IR
|
|
=========================
|
|
|
|
We have a couple common questions about code in the LLVM IR form - let's
|
|
just get these out of the way right now, shall we?
|
|
|
|
--------------
|
|
|
|
Target Independence
|
|
-------------------
|
|
|
|
Kaleidoscope is an example of a "portable language": any program
|
|
written in Kaleidoscope will work the same way on any target that it
|
|
runs on. Many other languages have this property, e.g. LISP, Java, Haskell,
|
|
Javascript, Python, etc. (note that while these languages are portable,
|
|
not all their libraries are).
|
|
|
|
One nice aspect of LLVM is that it is often capable of preserving target
|
|
independence in the IR: you can take the LLVM IR for a Kaleidoscope-compiled
|
|
program and run it on any target that LLVM supports, even emitting C code and
|
|
compiling that on targets that LLVM doesn't support natively.
|
|
You can trivially tell that the Kaleidoscope compiler generates target-
|
|
independent code because it never queries for any target-specific
|
|
information when generating code.
|
|
|
|
The fact that LLVM provides a compact, target-independent,
|
|
representation for code gets a lot of people excited. Unfortunately,
|
|
these people are usually thinking about C or a language from the
|
|
C family when they are asking questions about language portability.
|
|
I say "unfortunately", because there is really no way to make (fully general) C
|
|
code portable, other than shipping the source code around (and of course, C
|
|
source code is not actually portable in general either - ever port a really old
|
|
application from 32- to 64-bits?).
|
|
|
|
The problem with C (again, in its full generality) is that it is heavily
|
|
laden with target specific assumptions. As one simple example, the
|
|
preprocessor often destructively removes target-independence from the code
|
|
when it processes the input text::
|
|
|
|
|
|
#ifdef __i386__
|
|
int X = 1;
|
|
#else
|
|
int X = 42;
|
|
#endif
|
|
|
|
While it is possible to engineer more and more complex solutions to problems like
|
|
this, it cannot be solved in full generality in a way that is better than
|
|
shipping the actual source code.
|
|
|
|
That said, there are interesting subsets of C that can be made portable.
|
|
If you are willing to fix primitive types to a fixed size (say int = 32-bits, and
|
|
long = 64-bits), don't care about ABI compatibility with existing binaries, and
|
|
are willing to give up some other minor features, you can have portable code.
|
|
This can make sense for specialized domains such as an in-kernel language.
|
|
|
|
--------------
|
|
|
|
Safety Guarantees
|
|
-----------------
|
|
|
|
Many of the languages above are also "safe" languages: it is
|
|
impossible for a program written in Java to corrupt its address space and
|
|
crash the process (assuming the JVM has no bugs). Safety is an
|
|
interesting property that requires a combination of language design,
|
|
runtime support, and often operating system support.
|
|
|
|
It is certainly possible to implement a safe language in LLVM, but LLVM
|
|
IR does not itself guarantee safety. The LLVM IR allows unsafe pointer casts,
|
|
use after free bugs, buffer over-runs, and a variety of other problems. Safety
|
|
needs to be implemented as a layer on top of LLVM and, conveniently, several groups
|
|
have investigated this. Ask on the `llvmdev mailing list
|
|
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>`_ if you are interested
|
|
in more details.
|
|
|
|
--------------
|
|
|
|
Language-Specific Optimizations
|
|
-------------------------------
|
|
|
|
One thing about LLVM that turns off many people is that it does not solve all
|
|
the world's problems in one system (sorry 'world hunger', someone else will
|
|
have to solve you some other day). One specific complaint is that people perceive
|
|
LLVM as being incapable of performing high-level language-specific optimization:
|
|
LLVM "loses too much information".
|
|
|
|
Unfortunately, this is really not the place to give you a full and unified
|
|
version of "Chris Lattner's theory of compiler design". Instead,
|
|
I'll make a few observations:
|
|
|
|
First, you're right that LLVM does lose information.
|
|
For example, as of this writing, there is no way to
|
|
distinguish in the LLVM IR whether an SSA-value came
|
|
from a C "int" or a C "long" on an ILP32 machine
|
|
(other than debug info). Both get compiled down to an 'i32'
|
|
value and the information about what it came from is lost.
|
|
The more general issue here, is that the LLVM type system
|
|
uses "structural equivalence" instead of "name equivalence".
|
|
Another place this surprises people is if you have two types
|
|
in a high-level language that have the same structure (e.g.
|
|
two different structs that have a single int field):
|
|
these types will compile down into a single LLVM type and it
|
|
will be impossible to tell what it came from.
|
|
|
|
Second, while LLVM does lose information, LLVM is not a
|
|
fixed target: we continue to enhance and improve it in many
|
|
different ways. In addition to adding new features (LLVM did not
|
|
always support exceptions or debug info), we also extend the IR to
|
|
capture important information for optimization (e.g. whether an argument
|
|
is sign or zero extended, information about pointers aliasing, etc). Many
|
|
of the enhancements are user-driven: people want LLVM to include some specific
|
|
feature, so they go ahead and extend it.
|
|
|
|
Third, it is possible and easy to add language-specific optimizations,
|
|
and you have a number of choices in how to do it. As one trivial example,
|
|
it is easy to add language-specific optimization passes that "know" things
|
|
about code compiled for a language. In the case of the C family, there is an
|
|
optimization pass that "knows" about the standard C library functions. If you
|
|
call "exit(0)" in main(), it knows that it is safe to optimize that into "return
|
|
0;" because C specifies what the 'exit' function does.
|
|
|
|
In addition to simple library knowledge, it is possible to embed a
|
|
variety of other language-specific information into the LLVM IR. If
|
|
you have a specific need and run into a wall, please bring the topic
|
|
up on the llvmdev list. At the very worst, you can always treat LLVM as
|
|
if it were a "dumb code generator" and implement the high-level optimizations
|
|
you desire in your front-end, on the language-specific AST.
|
|
|
|
--------------
|
|
|
|
Tips and Tricks
|
|
===============
|
|
|
|
There is a variety of useful tips and tricks that you come to
|
|
know after working on/with LLVM that aren't obvious at first glance.
|
|
Instead of letting everyone rediscover them, this section talks about
|
|
some of these issues.
|
|
|
|
--------------
|
|
|
|
Implementing portable offsetof/sizeof
|
|
-------------------------------------
|
|
|
|
One interesting thing that comes up, if you are trying to keep the
|
|
code generated by your compiler "target independent", is that you often
|
|
need to know the size of some LLVM type or the offset of some field in an
|
|
llvm structure. For example, you might need to pass the size of a type into
|
|
a function that allocates memory.
|
|
|
|
Unfortunately, this can vary widely across targets: for example the width
|
|
of a pointer is trivially target-specific. However, there is a `clever way
|
|
to use the getelementptr instruction
|
|
<http://nondot.org/sabre/LLVMNotes/SizeOf-OffsetOf-VariableSizedStructs.txt>`_ that
|
|
allows you to compute this in a portable way.
|
|
|
|
--------------
|
|
|
|
Garbage Collected Stack Frames
|
|
------------------------------
|
|
|
|
Some languages want to explicitly manage their stack frames, often
|
|
so that they are garbage collected or to allow easy implementation
|
|
of closures. There are often better ways to implement these features
|
|
than explicit stack frames, but `LLVM does support them
|
|
<http://nondot.org/sabre/LLVMNotes/ExplicitlyManagedStackFrames.txt>`_, if you want.
|
|
It requires your front-end to convert the code into `Continuation Passing
|
|
Style <http://en.wikipedia.org/wiki/Continuation-passing_style>`_
|
|
and the use of tail calls (which LLVM also supports). |