Finished editting PythonLangImpl8.rst in docs.

2012-08-17 16:55:23 -05:00 · 2012-08-17 16:55:23 -05:00 · 882c9e0271
commit 882c9e0271
parent f74f5129bd
2 changed files with 276 additions and 1 deletions
--- a/docs/source/doc/kaleidoscope/PythonLangImpl1.rst
+++ b/docs/source/doc/kaleidoscope/PythonLangImpl1.rst
@ -194,7 +194,7 @@ numeric value of a number). First, we define the possibilities:

 Each token yielded by our lexer will be of one of the above types. For
 simple tokens that are always the same, like the "def" keyword, the
-lexer will yield ``DefToken()``>. Identifiers, numbers and characters,
+lexer will yield ``DefToken()``. Identifiers, numbers and characters,
 on the other hand, have extra data, so when the lexer encounteres the
 number 123.45, it will emit it as ``NumberToken(123.45)``. An identifier
 ``foo`` will be emitted as ``IdentifierToken('foo')``. And finally, an
--- a/docs/source/doc/kaleidoscope/PythonLangImpl8.rst
+++ b/docs/source/doc/kaleidoscope/PythonLangImpl8.rst
@ -0,0 +1,275 @@
+*************************************************
+Chapter 8: Conclusion and other useful LLVM tidbits
+*************************************************
+
+Written by Chris Lattner
+
+Tutorial Conclusion
+===================
+
+Welcome to the the final chapter of the "`Implementing a language with LLVM 
+<http://www.llvm.org/docs/tutorial/index.html>`_" tutorial. 
+In the course of this tutorial, we have grown our little Kaleidoscope language 
+from being a useless toy, to being a semi-interesting (but probably still useless) 
+toy. :)
+
+It is interesting to see how far we've come, and how little code it has taken. 
+We built the entire lexer, parser, AST, code generator, and an interactive run-loop 
+(with a JIT!) by-hand in under 540 lines of (non-comment/non-blank) code.
+
+Our little language supports a couple of interesting features: it supports user 
+defined binary and unary operators, it uses JIT compilation for immediate evaluation,
+and it supports a few control flow constructs with SSA construction.
+
+Part of the idea of this tutorial was to show you how easy and fun it can be to 
+define, build, and play with languages. Building a compiler need not be a scary 
+or mystical process! Now that you've seen some of the basics, I strongly encourage 
+you to take the code and hack on it. For example, try adding:
+
+	-	***global variables*** - While global variables have questional value in modern 
+		software engineering, they are often useful when putting together quick 
+		little hacks like the Kaleidoscope compiler itself. Fortunately, our 
+		current setup makes it very easy to add global variables: just have value 
+		lookup check to see if an unresolved variable is in the global variable 
+		symbol table before rejecting it. To create a new global variable, make 
+		an instance of the LLVM GlobalVariable class.
+
+	-	***typed variables*** - Kaleidoscope currently only supports variables of type 
+		double. This gives the language a very nice elegance, because only supporting 
+		one type means that you never have to specify types. Different languages have 
+		different ways of handling this. The easiest way is to require the user to 
+		specify types for every variable definition, and record the type of the variable 
+		in the symbol table along with its Value*.
+		
+	-	***arrays, structs, vectors, etc*** - Once you add types, you can 
+		start extending the type system in all sorts of 
+		interesting ways. Simple arrays are very easy and are quite useful 
+		for many different applications. Adding them is mostly an 
+		exercise in learning how the LLVM `getelementptr 
+		<http://www.llvm.org/docs/LangRef.html#i_getelementptr>`_ instruction works: 
+		it is so nifty/unconventional, it `has its own FAQ! 
+		<http://www.llvm.org/docs/GetElementPtr.html>`_ If you add 
+		support for recursive types (e.g. linked lists), make sure to 
+		read the `section in the LLVM Programmer's Manual 
+		<http://www.llvm.org/docs/ProgrammersManual.html#TypeResolve>`_ that describes 
+		how to construct them.
+
+	-	***standard runtime*** - Our current language allows the user to 
+		access arbitrary external functions, and we use it for things like "putchard". 
+		As you extend the language to add higher-level constructs, often these 
+		constructs make the most sense if they are lowered to calls into a 
+		language-supplied runtime. For example, if you add hash tables to the 
+		language, it would probably make sense to add the routines to a runtime, 
+		instead of inlining them all the way.
+
+	-	***memory management*** - Currently we can only access the stack in Kaleidoscope. 
+		It would also be useful to be able to allocate heap memory, either with calls 
+		to the standard libc malloc/free interface or with a garbage collector. 
+		If you would like to use garbage collection, note that LLVM fully supports 
+		`Accurate Garbage Collection <http://www.llvm.org/docs/GarbageCollection.html>`_ 
+		including algorithms that move objects and need to scan/update the stack.
+
+	-	***debugger support*** - LLVM supports generation of `DWARF Debug info 
+		<http://www.llvm.org/docs/SourceLevelDebugging.html>`_ which 
+		is understood by common debuggers like GDB. Adding support for debug info is 
+		fairly straightforward. The best way to understand it is to compile some C/C++ 
+		code with ``llvm-gcc -g -O0`` and taking a look at what it produces.
+
+	-	***exception handling support*** - LLVM supports generation of `zero cost exceptions 
+		<http://www.llvm.org/docs/ExceptionHandling.html>`_ which interoperate 
+		with code compiled in other languages. You could also generate code by 
+		implicitly making every function return an error value and checking it. 
+		You could also make explicit use of setjmp/longjmp. There are many different 
+		ways to go here.
+		
+	-	***object orientation, generics, database access, complex numbers, geometric 
+		programming, ...*** - Really, there is no end of crazy features that you can 
+		add to the language.
+
+	-	***unusual domains*** - We've been talking about applying LLVM to a domain that 
+		many people are interested in: building a compiler for a specific language. 
+		However, there are many other domains that can use compiler technology that are 
+		not typically considered. For example, LLVM has been used to implement OpenGL 
+		graphics acceleration, translate C++ code to ActionScript, and many other cute 
+		and clever things. Maybe you will be the first to JIT compile a regular expression 
+		interpreter into native code with LLVM?
+
+	-	***Have fun*** - try doing something crazy and unusual. Building a language like 
+		everyone else always has, is much less fun than trying something a little crazy or 
+		off the wall and seeing how it turns out. If you get stuck or want to talk about it, 
+		feel free to email the `llvmdev mailing list 
+		<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>`_: it has lots of people who are 
+		interested in languages and are often willing to help out.
+
+Before we end this tutorial, I want to talk about some "tips and tricks" for 
+generating LLVM IR. These are some of the more subtle things that may not be obvious, 
+but are very useful if you want to take advantage of LLVM's capabilities.
+
+Properties of the LLVM IR
+========================
+
+We have a couple common questions about code in the LLVM IR form - let's 
+just get these out of the way right now, shall we?
+
+--------------
+
+Target Independence
+-------------------
+
+Kaleidoscope is an example of a "portable language": any program 
+written in Kaleidoscope will work the same way on any target that it 
+runs on. Many other languages have this property, e.g. LISP, Java, Haskell,
+Javascript, Python, etc. (note that while these languages are portable, 
+not all their libraries are).
+
+One nice aspect of LLVM is that it is often capable of preserving target 
+independence in the IR: you can take the LLVM IR for a Kaleidoscope-compiled 
+program and run it on any target that LLVM supports, even emitting C code and 
+compiling that on targets that LLVM doesn't support natively. 
+You can trivially tell that the Kaleidoscope compiler generates target-
+independent code because it never queries for any target-specific 
+information when generating code.
+
+The fact that LLVM provides a compact, target-independent, 
+representation for code gets a lot of people excited. Unfortunately, 
+these people are usually thinking about C or a language from the 
+C family when they are asking questions about language portability. 
+I say "unfortunately", because there is really no way to make (fully general) C 
+code portable, other than shipping the source code around (and of course, C 
+source code is not actually portable in general either - ever port a really old 
+application from 32- to 64-bits?).
+
+The problem with C (again, in its full generality) is that it is heavily 
+laden with target specific assumptions. As one simple example, the 
+preprocessor often destructively removes target-independence from the code 
+when it processes the input text::
+
+
+	#ifdef __i386__
+	  int X = 1;
+	#else
+	  int X = 42;
+	#endif
+
+While it is possible to engineer more and more complex solutions to problems like 
+this, it cannot be solved in full generality in a way that is better than 
+shipping the actual source code.
+
+That said, there are interesting subsets of C that can be made portable. 
+If you are willing to fix primitive types to a fixed size (say int = 32-bits, and 
+long = 64-bits), don't care about ABI compatibility with existing binaries, and 
+are willing to give up some other minor features, you can have portable code. 
+This can make sense for specialized domains such as an in-kernel language.
+
+--------------
+
+Safety Guarantees
+----------------
+
+Many of the languages above are also "safe" languages: it is 
+impossible for a program written in Java to corrupt its address space and 
+crash the process (assuming the JVM has no bugs). Safety is an 
+interesting property that requires a combination of language design, 
+runtime support, and often operating system support.
+
+It is certainly possible to implement a safe language in LLVM, but LLVM 
+IR does not itself guarantee safety. The LLVM IR allows unsafe pointer casts, 
+use after free bugs, buffer over-runs, and a variety of other problems. Safety 
+needs to be implemented as a layer on top of LLVM and, conveniently, several groups 
+have investigated this. Ask on the `llvmdev mailing list 
+<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>`_ if you are interested 
+in more details.
+
+--------------
+
+Language-Specific Optimizations
+-------------------------------
+
+One thing about LLVM that turns off many people is that it does not solve all 
+the world's problems in one system (sorry 'world hunger', someone else will 
+have to solve you some other day). One specific complaint is that people perceive 
+LLVM as being incapable of performing high-level language-specific optimization: 
+LLVM "loses too much information".
+
+Unfortunately, this is really not the place to give you a full and unified 
+version of "Chris Lattner's theory of compiler design". Instead, 
+I'll make a few observations:
+
+First, you're right that LLVM does lose information. 
+For example, as of this writing, there is no way to 
+distinguish in the LLVM IR whether an SSA-value came 
+from a C "int" or a C "long" on an ILP32 machine 
+(other than debug info). Both get compiled down to an 'i32' 
+value and the information about what it came from is lost. 
+The more general issue here, is that the LLVM type system 
+uses "structural equivalence" instead of "name equivalence". 
+Another place this surprises people is if you have two types 
+in a high-level language that have the same structure (e.g. 
+two different structs that have a single int field): 
+these types will compile down into a single LLVM type and it 
+will be impossible to tell what it came from.
+
+Second, while LLVM does lose information, LLVM is not a 
+fixed target: we continue to enhance and improve it in many 
+different ways. In addition to adding new features (LLVM did not 
+always support exceptions or debug info), we also extend the IR to 
+capture important information for optimization (e.g. whether an argument 
+is sign or zero extended, information about pointers aliasing, etc). Many 
+of the enhancements are user-driven: people want LLVM to include some specific 
+feature, so they go ahead and extend it.
+
+Third, it is possible and easy to add language-specific optimizations, 
+and you have a number of choices in how to do it. As one trivial example, 
+it is easy to add language-specific optimization passes that "know" things 
+about code compiled for a language. In the case of the C family, there is an 
+optimization pass that "knows" about the standard C library functions. If you 
+call "exit(0)" in main(), it knows that it is safe to optimize that into "return 
+0;" because C specifies what the 'exit' function does.
+
+In addition to simple library knowledge, it is possible to embed a 
+variety of other language-specific information into the LLVM IR. If 
+you have a specific need and run into a wall, please bring the topic 
+up on the llvmdev list. At the very worst, you can always treat LLVM as 
+if it were a "dumb code generator" and implement the high-level optimizations 
+you desire in your front-end, on the language-specific AST.
+
+--------------
+
+Tips and Tricks
+==============
+
+There is a variety of useful tips and tricks that you come to 
+know after working on/with LLVM that aren't obvious at first glance. 
+Instead of letting everyone rediscover them, this section talks about 
+some of these issues.
+
+--------------
+
+Implementing portable offsetof/sizeof
+-------------------------------------
+
+One interesting thing that comes up, if you are trying to keep the 
+code generated by your compiler "target independent", is that you often 
+need to know the size of some LLVM type or the offset of some field in an 
+llvm structure. For example, you might need to pass the size of a type into 
+a function that allocates memory.
+
+Unfortunately, this can vary widely across targets: for example the width 
+of a pointer is trivially target-specific. However, there is a `clever way 
+to use the getelementptr instruction 
+<http://nondot.org/sabre/LLVMNotes/SizeOf-OffsetOf-VariableSizedStructs.txt>`_ that 
+allows you to compute this in a portable way.
+
+--------------
+
+Garbage Collected Stack Frames
+------------------------------
+
+Some languages want to explicitly manage their stack frames, often 
+so that they are garbage collected or to allow easy implementation 
+of closures. There are often better ways to implement these features 
+than explicit stack frames, but `LLVM does support them 
+<http://nondot.org/sabre/LLVMNotes/ExplicitlyManagedStackFrames.txt>`_, if you want. 
+It requires your front-end to convert the code into `Continuation Passing 
+Style <http://en.wikipedia.org/wiki/Continuation-passing_style>`_ 
+and the use of tail calls (which LLVM also supports).