Finished editting PythonLangImpl8.rst in docs.
This commit is contained in:
parent
f74f5129bd
commit
882c9e0271
2 changed files with 276 additions and 1 deletions
|
|
@ -194,7 +194,7 @@ numeric value of a number). First, we define the possibilities:
|
|||
|
||||
Each token yielded by our lexer will be of one of the above types. For
|
||||
simple tokens that are always the same, like the "def" keyword, the
|
||||
lexer will yield ``DefToken()``>. Identifiers, numbers and characters,
|
||||
lexer will yield ``DefToken()``. Identifiers, numbers and characters,
|
||||
on the other hand, have extra data, so when the lexer encounteres the
|
||||
number 123.45, it will emit it as ``NumberToken(123.45)``. An identifier
|
||||
``foo`` will be emitted as ``IdentifierToken('foo')``. And finally, an
|
||||
|
|
|
|||
|
|
@ -0,0 +1,275 @@
|
|||
*************************************************
|
||||
Chapter 8: Conclusion and other useful LLVM tidbits
|
||||
*************************************************
|
||||
|
||||
Written by Chris Lattner
|
||||
|
||||
Tutorial Conclusion
|
||||
===================
|
||||
|
||||
Welcome to the the final chapter of the "`Implementing a language with LLVM
|
||||
<http://www.llvm.org/docs/tutorial/index.html>`_" tutorial.
|
||||
In the course of this tutorial, we have grown our little Kaleidoscope language
|
||||
from being a useless toy, to being a semi-interesting (but probably still useless)
|
||||
toy. :)
|
||||
|
||||
It is interesting to see how far we've come, and how little code it has taken.
|
||||
We built the entire lexer, parser, AST, code generator, and an interactive run-loop
|
||||
(with a JIT!) by-hand in under 540 lines of (non-comment/non-blank) code.
|
||||
|
||||
Our little language supports a couple of interesting features: it supports user
|
||||
defined binary and unary operators, it uses JIT compilation for immediate evaluation,
|
||||
and it supports a few control flow constructs with SSA construction.
|
||||
|
||||
Part of the idea of this tutorial was to show you how easy and fun it can be to
|
||||
define, build, and play with languages. Building a compiler need not be a scary
|
||||
or mystical process! Now that you've seen some of the basics, I strongly encourage
|
||||
you to take the code and hack on it. For example, try adding:
|
||||
|
||||
- ***global variables*** - While global variables have questional value in modern
|
||||
software engineering, they are often useful when putting together quick
|
||||
little hacks like the Kaleidoscope compiler itself. Fortunately, our
|
||||
current setup makes it very easy to add global variables: just have value
|
||||
lookup check to see if an unresolved variable is in the global variable
|
||||
symbol table before rejecting it. To create a new global variable, make
|
||||
an instance of the LLVM GlobalVariable class.
|
||||
|
||||
- ***typed variables*** - Kaleidoscope currently only supports variables of type
|
||||
double. This gives the language a very nice elegance, because only supporting
|
||||
one type means that you never have to specify types. Different languages have
|
||||
different ways of handling this. The easiest way is to require the user to
|
||||
specify types for every variable definition, and record the type of the variable
|
||||
in the symbol table along with its Value*.
|
||||
|
||||
- ***arrays, structs, vectors, etc*** - Once you add types, you can
|
||||
start extending the type system in all sorts of
|
||||
interesting ways. Simple arrays are very easy and are quite useful
|
||||
for many different applications. Adding them is mostly an
|
||||
exercise in learning how the LLVM `getelementptr
|
||||
<http://www.llvm.org/docs/LangRef.html#i_getelementptr>`_ instruction works:
|
||||
it is so nifty/unconventional, it `has its own FAQ!
|
||||
<http://www.llvm.org/docs/GetElementPtr.html>`_ If you add
|
||||
support for recursive types (e.g. linked lists), make sure to
|
||||
read the `section in the LLVM Programmer's Manual
|
||||
<http://www.llvm.org/docs/ProgrammersManual.html#TypeResolve>`_ that describes
|
||||
how to construct them.
|
||||
|
||||
- ***standard runtime*** - Our current language allows the user to
|
||||
access arbitrary external functions, and we use it for things like "putchard".
|
||||
As you extend the language to add higher-level constructs, often these
|
||||
constructs make the most sense if they are lowered to calls into a
|
||||
language-supplied runtime. For example, if you add hash tables to the
|
||||
language, it would probably make sense to add the routines to a runtime,
|
||||
instead of inlining them all the way.
|
||||
|
||||
- ***memory management*** - Currently we can only access the stack in Kaleidoscope.
|
||||
It would also be useful to be able to allocate heap memory, either with calls
|
||||
to the standard libc malloc/free interface or with a garbage collector.
|
||||
If you would like to use garbage collection, note that LLVM fully supports
|
||||
`Accurate Garbage Collection <http://www.llvm.org/docs/GarbageCollection.html>`_
|
||||
including algorithms that move objects and need to scan/update the stack.
|
||||
|
||||
- ***debugger support*** - LLVM supports generation of `DWARF Debug info
|
||||
<http://www.llvm.org/docs/SourceLevelDebugging.html>`_ which
|
||||
is understood by common debuggers like GDB. Adding support for debug info is
|
||||
fairly straightforward. The best way to understand it is to compile some C/C++
|
||||
code with ``llvm-gcc -g -O0`` and taking a look at what it produces.
|
||||
|
||||
- ***exception handling support*** - LLVM supports generation of `zero cost exceptions
|
||||
<http://www.llvm.org/docs/ExceptionHandling.html>`_ which interoperate
|
||||
with code compiled in other languages. You could also generate code by
|
||||
implicitly making every function return an error value and checking it.
|
||||
You could also make explicit use of setjmp/longjmp. There are many different
|
||||
ways to go here.
|
||||
|
||||
- ***object orientation, generics, database access, complex numbers, geometric
|
||||
programming, ...*** - Really, there is no end of crazy features that you can
|
||||
add to the language.
|
||||
|
||||
- ***unusual domains*** - We've been talking about applying LLVM to a domain that
|
||||
many people are interested in: building a compiler for a specific language.
|
||||
However, there are many other domains that can use compiler technology that are
|
||||
not typically considered. For example, LLVM has been used to implement OpenGL
|
||||
graphics acceleration, translate C++ code to ActionScript, and many other cute
|
||||
and clever things. Maybe you will be the first to JIT compile a regular expression
|
||||
interpreter into native code with LLVM?
|
||||
|
||||
- ***Have fun*** - try doing something crazy and unusual. Building a language like
|
||||
everyone else always has, is much less fun than trying something a little crazy or
|
||||
off the wall and seeing how it turns out. If you get stuck or want to talk about it,
|
||||
feel free to email the `llvmdev mailing list
|
||||
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>`_: it has lots of people who are
|
||||
interested in languages and are often willing to help out.
|
||||
|
||||
Before we end this tutorial, I want to talk about some "tips and tricks" for
|
||||
generating LLVM IR. These are some of the more subtle things that may not be obvious,
|
||||
but are very useful if you want to take advantage of LLVM's capabilities.
|
||||
|
||||
Properties of the LLVM IR
|
||||
========================
|
||||
|
||||
We have a couple common questions about code in the LLVM IR form - let's
|
||||
just get these out of the way right now, shall we?
|
||||
|
||||
--------------
|
||||
|
||||
Target Independence
|
||||
-------------------
|
||||
|
||||
Kaleidoscope is an example of a "portable language": any program
|
||||
written in Kaleidoscope will work the same way on any target that it
|
||||
runs on. Many other languages have this property, e.g. LISP, Java, Haskell,
|
||||
Javascript, Python, etc. (note that while these languages are portable,
|
||||
not all their libraries are).
|
||||
|
||||
One nice aspect of LLVM is that it is often capable of preserving target
|
||||
independence in the IR: you can take the LLVM IR for a Kaleidoscope-compiled
|
||||
program and run it on any target that LLVM supports, even emitting C code and
|
||||
compiling that on targets that LLVM doesn't support natively.
|
||||
You can trivially tell that the Kaleidoscope compiler generates target-
|
||||
independent code because it never queries for any target-specific
|
||||
information when generating code.
|
||||
|
||||
The fact that LLVM provides a compact, target-independent,
|
||||
representation for code gets a lot of people excited. Unfortunately,
|
||||
these people are usually thinking about C or a language from the
|
||||
C family when they are asking questions about language portability.
|
||||
I say "unfortunately", because there is really no way to make (fully general) C
|
||||
code portable, other than shipping the source code around (and of course, C
|
||||
source code is not actually portable in general either - ever port a really old
|
||||
application from 32- to 64-bits?).
|
||||
|
||||
The problem with C (again, in its full generality) is that it is heavily
|
||||
laden with target specific assumptions. As one simple example, the
|
||||
preprocessor often destructively removes target-independence from the code
|
||||
when it processes the input text::
|
||||
|
||||
|
||||
#ifdef __i386__
|
||||
int X = 1;
|
||||
#else
|
||||
int X = 42;
|
||||
#endif
|
||||
|
||||
While it is possible to engineer more and more complex solutions to problems like
|
||||
this, it cannot be solved in full generality in a way that is better than
|
||||
shipping the actual source code.
|
||||
|
||||
That said, there are interesting subsets of C that can be made portable.
|
||||
If you are willing to fix primitive types to a fixed size (say int = 32-bits, and
|
||||
long = 64-bits), don't care about ABI compatibility with existing binaries, and
|
||||
are willing to give up some other minor features, you can have portable code.
|
||||
This can make sense for specialized domains such as an in-kernel language.
|
||||
|
||||
--------------
|
||||
|
||||
Safety Guarantees
|
||||
----------------
|
||||
|
||||
Many of the languages above are also "safe" languages: it is
|
||||
impossible for a program written in Java to corrupt its address space and
|
||||
crash the process (assuming the JVM has no bugs). Safety is an
|
||||
interesting property that requires a combination of language design,
|
||||
runtime support, and often operating system support.
|
||||
|
||||
It is certainly possible to implement a safe language in LLVM, but LLVM
|
||||
IR does not itself guarantee safety. The LLVM IR allows unsafe pointer casts,
|
||||
use after free bugs, buffer over-runs, and a variety of other problems. Safety
|
||||
needs to be implemented as a layer on top of LLVM and, conveniently, several groups
|
||||
have investigated this. Ask on the `llvmdev mailing list
|
||||
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>`_ if you are interested
|
||||
in more details.
|
||||
|
||||
--------------
|
||||
|
||||
Language-Specific Optimizations
|
||||
-------------------------------
|
||||
|
||||
One thing about LLVM that turns off many people is that it does not solve all
|
||||
the world's problems in one system (sorry 'world hunger', someone else will
|
||||
have to solve you some other day). One specific complaint is that people perceive
|
||||
LLVM as being incapable of performing high-level language-specific optimization:
|
||||
LLVM "loses too much information".
|
||||
|
||||
Unfortunately, this is really not the place to give you a full and unified
|
||||
version of "Chris Lattner's theory of compiler design". Instead,
|
||||
I'll make a few observations:
|
||||
|
||||
First, you're right that LLVM does lose information.
|
||||
For example, as of this writing, there is no way to
|
||||
distinguish in the LLVM IR whether an SSA-value came
|
||||
from a C "int" or a C "long" on an ILP32 machine
|
||||
(other than debug info). Both get compiled down to an 'i32'
|
||||
value and the information about what it came from is lost.
|
||||
The more general issue here, is that the LLVM type system
|
||||
uses "structural equivalence" instead of "name equivalence".
|
||||
Another place this surprises people is if you have two types
|
||||
in a high-level language that have the same structure (e.g.
|
||||
two different structs that have a single int field):
|
||||
these types will compile down into a single LLVM type and it
|
||||
will be impossible to tell what it came from.
|
||||
|
||||
Second, while LLVM does lose information, LLVM is not a
|
||||
fixed target: we continue to enhance and improve it in many
|
||||
different ways. In addition to adding new features (LLVM did not
|
||||
always support exceptions or debug info), we also extend the IR to
|
||||
capture important information for optimization (e.g. whether an argument
|
||||
is sign or zero extended, information about pointers aliasing, etc). Many
|
||||
of the enhancements are user-driven: people want LLVM to include some specific
|
||||
feature, so they go ahead and extend it.
|
||||
|
||||
Third, it is possible and easy to add language-specific optimizations,
|
||||
and you have a number of choices in how to do it. As one trivial example,
|
||||
it is easy to add language-specific optimization passes that "know" things
|
||||
about code compiled for a language. In the case of the C family, there is an
|
||||
optimization pass that "knows" about the standard C library functions. If you
|
||||
call "exit(0)" in main(), it knows that it is safe to optimize that into "return
|
||||
0;" because C specifies what the 'exit' function does.
|
||||
|
||||
In addition to simple library knowledge, it is possible to embed a
|
||||
variety of other language-specific information into the LLVM IR. If
|
||||
you have a specific need and run into a wall, please bring the topic
|
||||
up on the llvmdev list. At the very worst, you can always treat LLVM as
|
||||
if it were a "dumb code generator" and implement the high-level optimizations
|
||||
you desire in your front-end, on the language-specific AST.
|
||||
|
||||
--------------
|
||||
|
||||
Tips and Tricks
|
||||
==============
|
||||
|
||||
There is a variety of useful tips and tricks that you come to
|
||||
know after working on/with LLVM that aren't obvious at first glance.
|
||||
Instead of letting everyone rediscover them, this section talks about
|
||||
some of these issues.
|
||||
|
||||
--------------
|
||||
|
||||
Implementing portable offsetof/sizeof
|
||||
-------------------------------------
|
||||
|
||||
One interesting thing that comes up, if you are trying to keep the
|
||||
code generated by your compiler "target independent", is that you often
|
||||
need to know the size of some LLVM type or the offset of some field in an
|
||||
llvm structure. For example, you might need to pass the size of a type into
|
||||
a function that allocates memory.
|
||||
|
||||
Unfortunately, this can vary widely across targets: for example the width
|
||||
of a pointer is trivially target-specific. However, there is a `clever way
|
||||
to use the getelementptr instruction
|
||||
<http://nondot.org/sabre/LLVMNotes/SizeOf-OffsetOf-VariableSizedStructs.txt>`_ that
|
||||
allows you to compute this in a portable way.
|
||||
|
||||
--------------
|
||||
|
||||
Garbage Collected Stack Frames
|
||||
------------------------------
|
||||
|
||||
Some languages want to explicitly manage their stack frames, often
|
||||
so that they are garbage collected or to allow easy implementation
|
||||
of closures. There are often better ways to implement these features
|
||||
than explicit stack frames, but `LLVM does support them
|
||||
<http://nondot.org/sabre/LLVMNotes/ExplicitlyManagedStackFrames.txt>`_, if you want.
|
||||
It requires your front-end to convert the code into `Continuation Passing
|
||||
Style <http://en.wikipedia.org/wiki/Continuation-passing_style>`_
|
||||
and the use of tail calls (which LLVM also supports).
|
||||
Loading…
Add table
Add a link
Reference in a new issue