From 882c9e0271d90e1d07b125cfb0aa9d89e6871f09 Mon Sep 17 00:00:00 2001 From: Maggie Mari Date: Fri, 17 Aug 2012 16:55:23 -0500 Subject: [PATCH] Finished editting PythonLangImpl8.rst in docs. --- .../doc/kaleidoscope/PythonLangImpl1.rst | 2 +- .../doc/kaleidoscope/PythonLangImpl8.rst | 275 ++++++++++++++++++ 2 files changed, 276 insertions(+), 1 deletion(-) diff --git a/docs/source/doc/kaleidoscope/PythonLangImpl1.rst b/docs/source/doc/kaleidoscope/PythonLangImpl1.rst index 0a10860..09d894e 100644 --- a/docs/source/doc/kaleidoscope/PythonLangImpl1.rst +++ b/docs/source/doc/kaleidoscope/PythonLangImpl1.rst @@ -194,7 +194,7 @@ numeric value of a number). First, we define the possibilities: Each token yielded by our lexer will be of one of the above types. For simple tokens that are always the same, like the "def" keyword, the -lexer will yield ``DefToken()``>. Identifiers, numbers and characters, +lexer will yield ``DefToken()``. Identifiers, numbers and characters, on the other hand, have extra data, so when the lexer encounteres the number 123.45, it will emit it as ``NumberToken(123.45)``. An identifier ``foo`` will be emitted as ``IdentifierToken('foo')``. And finally, an diff --git a/docs/source/doc/kaleidoscope/PythonLangImpl8.rst b/docs/source/doc/kaleidoscope/PythonLangImpl8.rst index e69de29..dd4abc1 100644 --- a/docs/source/doc/kaleidoscope/PythonLangImpl8.rst +++ b/docs/source/doc/kaleidoscope/PythonLangImpl8.rst @@ -0,0 +1,275 @@ +************************************************* +Chapter 8: Conclusion and other useful LLVM tidbits +************************************************* + +Written by Chris Lattner + +Tutorial Conclusion +=================== + +Welcome to the the final chapter of the "`Implementing a language with LLVM +`_" tutorial. +In the course of this tutorial, we have grown our little Kaleidoscope language +from being a useless toy, to being a semi-interesting (but probably still useless) +toy. :) + +It is interesting to see how far we've come, and how little code it has taken. +We built the entire lexer, parser, AST, code generator, and an interactive run-loop +(with a JIT!) by-hand in under 540 lines of (non-comment/non-blank) code. + +Our little language supports a couple of interesting features: it supports user +defined binary and unary operators, it uses JIT compilation for immediate evaluation, +and it supports a few control flow constructs with SSA construction. + +Part of the idea of this tutorial was to show you how easy and fun it can be to +define, build, and play with languages. Building a compiler need not be a scary +or mystical process! Now that you've seen some of the basics, I strongly encourage +you to take the code and hack on it. For example, try adding: + + - ***global variables*** - While global variables have questional value in modern + software engineering, they are often useful when putting together quick + little hacks like the Kaleidoscope compiler itself. Fortunately, our + current setup makes it very easy to add global variables: just have value + lookup check to see if an unresolved variable is in the global variable + symbol table before rejecting it. To create a new global variable, make + an instance of the LLVM GlobalVariable class. + + - ***typed variables*** - Kaleidoscope currently only supports variables of type + double. This gives the language a very nice elegance, because only supporting + one type means that you never have to specify types. Different languages have + different ways of handling this. The easiest way is to require the user to + specify types for every variable definition, and record the type of the variable + in the symbol table along with its Value*. + + - ***arrays, structs, vectors, etc*** - Once you add types, you can + start extending the type system in all sorts of + interesting ways. Simple arrays are very easy and are quite useful + for many different applications. Adding them is mostly an + exercise in learning how the LLVM `getelementptr + `_ instruction works: + it is so nifty/unconventional, it `has its own FAQ! + `_ If you add + support for recursive types (e.g. linked lists), make sure to + read the `section in the LLVM Programmer's Manual + `_ that describes + how to construct them. + + - ***standard runtime*** - Our current language allows the user to + access arbitrary external functions, and we use it for things like "putchard". + As you extend the language to add higher-level constructs, often these + constructs make the most sense if they are lowered to calls into a + language-supplied runtime. For example, if you add hash tables to the + language, it would probably make sense to add the routines to a runtime, + instead of inlining them all the way. + + - ***memory management*** - Currently we can only access the stack in Kaleidoscope. + It would also be useful to be able to allocate heap memory, either with calls + to the standard libc malloc/free interface or with a garbage collector. + If you would like to use garbage collection, note that LLVM fully supports + `Accurate Garbage Collection `_ + including algorithms that move objects and need to scan/update the stack. + + - ***debugger support*** - LLVM supports generation of `DWARF Debug info + `_ which + is understood by common debuggers like GDB. Adding support for debug info is + fairly straightforward. The best way to understand it is to compile some C/C++ + code with ``llvm-gcc -g -O0`` and taking a look at what it produces. + + - ***exception handling support*** - LLVM supports generation of `zero cost exceptions + `_ which interoperate + with code compiled in other languages. You could also generate code by + implicitly making every function return an error value and checking it. + You could also make explicit use of setjmp/longjmp. There are many different + ways to go here. + + - ***object orientation, generics, database access, complex numbers, geometric + programming, ...*** - Really, there is no end of crazy features that you can + add to the language. + + - ***unusual domains*** - We've been talking about applying LLVM to a domain that + many people are interested in: building a compiler for a specific language. + However, there are many other domains that can use compiler technology that are + not typically considered. For example, LLVM has been used to implement OpenGL + graphics acceleration, translate C++ code to ActionScript, and many other cute + and clever things. Maybe you will be the first to JIT compile a regular expression + interpreter into native code with LLVM? + + - ***Have fun*** - try doing something crazy and unusual. Building a language like + everyone else always has, is much less fun than trying something a little crazy or + off the wall and seeing how it turns out. If you get stuck or want to talk about it, + feel free to email the `llvmdev mailing list + `_: it has lots of people who are + interested in languages and are often willing to help out. + +Before we end this tutorial, I want to talk about some "tips and tricks" for +generating LLVM IR. These are some of the more subtle things that may not be obvious, +but are very useful if you want to take advantage of LLVM's capabilities. + +Properties of the LLVM IR +======================== + +We have a couple common questions about code in the LLVM IR form - let's +just get these out of the way right now, shall we? + +-------------- + +Target Independence +------------------- + +Kaleidoscope is an example of a "portable language": any program +written in Kaleidoscope will work the same way on any target that it +runs on. Many other languages have this property, e.g. LISP, Java, Haskell, +Javascript, Python, etc. (note that while these languages are portable, +not all their libraries are). + +One nice aspect of LLVM is that it is often capable of preserving target +independence in the IR: you can take the LLVM IR for a Kaleidoscope-compiled +program and run it on any target that LLVM supports, even emitting C code and +compiling that on targets that LLVM doesn't support natively. +You can trivially tell that the Kaleidoscope compiler generates target- +independent code because it never queries for any target-specific +information when generating code. + +The fact that LLVM provides a compact, target-independent, +representation for code gets a lot of people excited. Unfortunately, +these people are usually thinking about C or a language from the +C family when they are asking questions about language portability. +I say "unfortunately", because there is really no way to make (fully general) C +code portable, other than shipping the source code around (and of course, C +source code is not actually portable in general either - ever port a really old +application from 32- to 64-bits?). + +The problem with C (again, in its full generality) is that it is heavily +laden with target specific assumptions. As one simple example, the +preprocessor often destructively removes target-independence from the code +when it processes the input text:: + + + #ifdef __i386__ + int X = 1; + #else + int X = 42; + #endif + +While it is possible to engineer more and more complex solutions to problems like +this, it cannot be solved in full generality in a way that is better than +shipping the actual source code. + +That said, there are interesting subsets of C that can be made portable. +If you are willing to fix primitive types to a fixed size (say int = 32-bits, and +long = 64-bits), don't care about ABI compatibility with existing binaries, and +are willing to give up some other minor features, you can have portable code. +This can make sense for specialized domains such as an in-kernel language. + +-------------- + +Safety Guarantees +---------------- + +Many of the languages above are also "safe" languages: it is +impossible for a program written in Java to corrupt its address space and +crash the process (assuming the JVM has no bugs). Safety is an +interesting property that requires a combination of language design, +runtime support, and often operating system support. + +It is certainly possible to implement a safe language in LLVM, but LLVM +IR does not itself guarantee safety. The LLVM IR allows unsafe pointer casts, +use after free bugs, buffer over-runs, and a variety of other problems. Safety +needs to be implemented as a layer on top of LLVM and, conveniently, several groups +have investigated this. Ask on the `llvmdev mailing list +`_ if you are interested +in more details. + +-------------- + +Language-Specific Optimizations +------------------------------- + +One thing about LLVM that turns off many people is that it does not solve all +the world's problems in one system (sorry 'world hunger', someone else will +have to solve you some other day). One specific complaint is that people perceive +LLVM as being incapable of performing high-level language-specific optimization: +LLVM "loses too much information". + +Unfortunately, this is really not the place to give you a full and unified +version of "Chris Lattner's theory of compiler design". Instead, +I'll make a few observations: + +First, you're right that LLVM does lose information. +For example, as of this writing, there is no way to +distinguish in the LLVM IR whether an SSA-value came +from a C "int" or a C "long" on an ILP32 machine +(other than debug info). Both get compiled down to an 'i32' +value and the information about what it came from is lost. +The more general issue here, is that the LLVM type system +uses "structural equivalence" instead of "name equivalence". +Another place this surprises people is if you have two types +in a high-level language that have the same structure (e.g. +two different structs that have a single int field): +these types will compile down into a single LLVM type and it +will be impossible to tell what it came from. + +Second, while LLVM does lose information, LLVM is not a +fixed target: we continue to enhance and improve it in many +different ways. In addition to adding new features (LLVM did not +always support exceptions or debug info), we also extend the IR to +capture important information for optimization (e.g. whether an argument +is sign or zero extended, information about pointers aliasing, etc). Many +of the enhancements are user-driven: people want LLVM to include some specific +feature, so they go ahead and extend it. + +Third, it is possible and easy to add language-specific optimizations, +and you have a number of choices in how to do it. As one trivial example, +it is easy to add language-specific optimization passes that "know" things +about code compiled for a language. In the case of the C family, there is an +optimization pass that "knows" about the standard C library functions. If you +call "exit(0)" in main(), it knows that it is safe to optimize that into "return +0;" because C specifies what the 'exit' function does. + +In addition to simple library knowledge, it is possible to embed a +variety of other language-specific information into the LLVM IR. If +you have a specific need and run into a wall, please bring the topic +up on the llvmdev list. At the very worst, you can always treat LLVM as +if it were a "dumb code generator" and implement the high-level optimizations +you desire in your front-end, on the language-specific AST. + +-------------- + +Tips and Tricks +============== + +There is a variety of useful tips and tricks that you come to +know after working on/with LLVM that aren't obvious at first glance. +Instead of letting everyone rediscover them, this section talks about +some of these issues. + +-------------- + +Implementing portable offsetof/sizeof +------------------------------------- + +One interesting thing that comes up, if you are trying to keep the +code generated by your compiler "target independent", is that you often +need to know the size of some LLVM type or the offset of some field in an +llvm structure. For example, you might need to pass the size of a type into +a function that allocates memory. + +Unfortunately, this can vary widely across targets: for example the width +of a pointer is trivially target-specific. However, there is a `clever way +to use the getelementptr instruction +`_ that +allows you to compute this in a portable way. + +-------------- + +Garbage Collected Stack Frames +------------------------------ + +Some languages want to explicitly manage their stack frames, often +so that they are garbage collected or to allow easy implementation +of closures. There are often better ways to implement these features +than explicit stack frames, but `LLVM does support them +`_, if you want. +It requires your front-end to convert the code into `Continuation Passing +Style `_ +and the use of tail calls (which LLVM also supports). \ No newline at end of file