1087 lines
No EOL
37 KiB
ReStructuredText
1087 lines
No EOL
37 KiB
ReStructuredText
*******************************************
|
|
Chapter 3: Code generation to LLVM IR
|
|
*******************************************
|
|
|
|
Written by `Chris Lattner <mailto:sabre@nondot.org>`_ and `Max
|
|
Shawabkeh <http://max99x.com>`_
|
|
|
|
Introduction
|
|
=======================
|
|
|
|
Welcome to Chapter 3 of the `Implementing a language with
|
|
LLVM <http://www.llvm.org/docs/tutorial/index.html>`_ tutorial. This
|
|
chapter shows you how to transform the `Abstract Syntax
|
|
Tree <PythonLangImpl2.html>`_, built in Chapter 2, into LLVM IR. This
|
|
will teach you a little bit about how LLVM does things, as well as
|
|
demonstrate how easy it is to use. It's much more work to build a lexer
|
|
and parser than it is to generate LLVM IR code. :)
|
|
|
|
**Please note**: the code in this chapter and later requires llvmpy 0.6
|
|
and LLVM 2.7. Earlier versions will most likely not work with it. Also
|
|
note that you need to use a version of this tutorial that matches your
|
|
llvmpy release: If you are using an official llvmpy release, use the
|
|
version of the documentation on the `llvmpy examples
|
|
page <http://www.mdevan.org/llvmpy/examples.html>`_
|
|
|
|
--------------
|
|
|
|
Code Generation Setup
|
|
=================================
|
|
|
|
In order to generate LLVM IR, we want some simple setup to get started.
|
|
First we define code generation methods in each AST node class:
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
# Expression class for numeric literals like "1.0".
|
|
class NumberExpressionNode(ExpressionNode):
|
|
|
|
def __init__(self, value):
|
|
self.value = value
|
|
|
|
def CodeGen(self):
|
|
...
|
|
|
|
# Expression class for referencing a variable, like "a".
|
|
class VariableExpressionNode(ExpressionNode):
|
|
|
|
def __init__(self, name):
|
|
self.name = name
|
|
|
|
def CodeGen(self):
|
|
...
|
|
|
|
...
|
|
|
|
|
|
|
|
|
|
|
|
The ``CodeGen`` method says to emit IR for that AST node along with all
|
|
the things it depends on, and they all return an LLVM Value object.
|
|
"Value" is the class used to represent a "`Static Single Assignment
|
|
(SSA) <http://en.wikipedia.org/wiki/Static_single_assignment_form>`_
|
|
register" or "SSA value" in LLVM. The most distinct aspect of SSA values
|
|
is that their value is computed as the related instruction executes, and
|
|
it does not get a new value until (and if) the instruction re-executes.
|
|
In other words, there is no way to "change" an SSA value. For more
|
|
information, please read up on `Static Single
|
|
Assignment <http://en.wikipedia.org/wiki/Static_single_assignment_form>`_
|
|
- the concepts are really quite natural once you grok them.
|
|
|
|
We will also need to define some global variables which we will be used
|
|
during code generation:
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
# The LLVM module, which holds all the IR code.
|
|
g_llvm_module = Module.new('my cool jit')
|
|
|
|
# The LLVM instruction builder. Created whenever a new function is entered.
|
|
g_llvm_builder = None
|
|
|
|
# A dictionary that keeps track of which values are defined in the current scope
|
|
# and what their LLVM representation is.
|
|
g_named_values = {}
|
|
|
|
|
|
|
|
|
|
|
|
``g_llvm_module`` is the LLVM construct that contains all of the
|
|
functions and global variables in a chunk of code. In many ways, it is
|
|
the top-level structure that the LLVM IR uses to contain code.
|
|
|
|
``g_llvm_builder`` is a helper object that makes it easy to generate
|
|
LLVM instructions. Instances of the
|
|
`llvm.core.Builder <llvm.core.Builder.html>`_ class keep track of the
|
|
current place to insert instructions and have methods to create new
|
|
instructions. Note that we do not initialize this variable; instead, it
|
|
will be initialized whenever we start generating code for a function.
|
|
|
|
Finally, ``g_named_values`` is a dictionary that keeps track of which
|
|
values are defined in the current scope and what their LLVM
|
|
representation is. In other words, it is a symbol table for the code. In
|
|
this form of Kaleidoscope, the only things that can be referenced are
|
|
function parameters. As such, function parameters will be in this map
|
|
when generating code for their function body.
|
|
|
|
With these basics in place, we can start talking about how to generate
|
|
code for each expression. Note that this assumes that ``g_llvm_builder``
|
|
has been set up to generate code *into* something. For now, we'll assume
|
|
that this has already been done, and we'll just use it to emit code.
|
|
|
|
--------------
|
|
|
|
Expression Code Generation
|
|
=====================================
|
|
|
|
Generating LLVM code for expression nodes is very straightforward: less
|
|
than 35 lines of commented code for all four of our expression nodes.
|
|
First we'll do numeric literals:
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
def CodeGen(self):
|
|
return Constant.real(Type.double(), self.value)
|
|
|
|
|
|
|
|
|
|
|
|
In llvmpy, floating point numeric constants are represented with the
|
|
``llvm.core.ConstantFP`` class. To create one, we can use the static
|
|
``real()`` method in the ``llvm.core.Constant`` class. This code
|
|
basically just creates and returns a ``ConstantFP``. Note that in the
|
|
LLVM IR constants are all uniqued together and shared. For this reason,
|
|
we create the constant through a factory method instead of instantiating
|
|
one directly.
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
def CodeGen(self):
|
|
if self.name in g_named_values:
|
|
return g_named_values[self.name]
|
|
else:
|
|
raise RuntimeError('Unknown variable name: ' + self.name)
|
|
|
|
|
|
|
|
References to variables are also quite simple using LLVM. In the simple
|
|
version of Kaleidoscope, we assume that the variable has already been
|
|
emitted somewhere and its value is available. In practice, the only
|
|
values that can be in the ``g_named_values`` dictionary are function
|
|
arguments. This code simply checks to see that the specified name is in
|
|
the map (if not, an unknown variable is being referenced) and returns
|
|
the value for it. In future chapters, we'll add support for `loop
|
|
induction variables <PythonLangImpl5.html#for>`_ in the symbol table,
|
|
and for `local variables <PythonLangImpl7.html#localvars>`_.
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
def CodeGen(self):
|
|
left = self.left.CodeGen()
|
|
right = self.right.CodeGen()
|
|
|
|
if self.operator == '+':
|
|
return g_llvm_builder.fadd(left, right, 'addtmp')
|
|
elif self.operator == '-':
|
|
return g_llvm_builder.fsub(left, right, 'subtmp')
|
|
elif self.operator == '*':
|
|
return g_llvm_builder.fmul(left, right, 'multmp')
|
|
elif self.operator == '<':
|
|
result = g_llvm_builder.fcmp(FCMP_ULT, left, right, 'cmptmp')
|
|
# Convert bool 0 or 1 to double 0.0 or 1.0.
|
|
return g_llvm_builder.uitofp(result, Type.double(), 'booltmp')
|
|
else:
|
|
raise RuntimeError('Unknown binary operator.')
|
|
|
|
|
|
|
|
|
|
|
|
Binary operators start to get more interesting. The basic idea here is
|
|
that we recursively emit code for the left-hand side of the expression,
|
|
then the right-hand side, then we compute the result of the binary
|
|
expression depending on which operator is being used.
|
|
|
|
In the example above, the LLVM builder class is starting to show its
|
|
value. ``g_llvm_builder`` knows where to insert the newly created
|
|
instruction, all you have to do is specify what instruction to create
|
|
(e.g. with ``add``), which operands to use (``left`` and ``right`` here)
|
|
and optionally provide a name for the generated instruction.
|
|
|
|
One nice thing about LLVM is that the name is just a hint. For instance,
|
|
if the code above emits multiple "addtmp" variables, LLVM will
|
|
automatically provide each one with an increasing, unique numeric
|
|
suffix. Local value names for instructions are purely optional, but it
|
|
makes it much easier to read the IR dumps.
|
|
|
|
`LLVM instructions <http://www.llvm.org/docs/LangRef.html#instref>`_ are
|
|
constrained by strict rules: for example, the Left and Right operators
|
|
of an `add instruction <http://www.llvm.org/docs/LangRef.html#i_add>`_
|
|
must have the same type, and the result type of the add must match the
|
|
operand types. Because all values in Kaleidoscope are doubles, this
|
|
makes for very simple code for add, sub and mul.
|
|
|
|
On the other hand, LLVM specifies that the `fcmp
|
|
instruction <http://www.llvm.org/docs/LangRef.html#i_fcmp>`_ always
|
|
returns an 'i1' value (a one bit integer). The problem with this is that
|
|
Kaleidoscope wants the value to be a 0.0 or 1.0 value. In order to get
|
|
these semantics, we combine the fcmp instruction with a `uitofp
|
|
instruction <http://www.llvm.org/docs/LangRef.html#i_uitofp>`_. This
|
|
instruction converts its input integer into a floating point value by
|
|
treating the input as an unsigned value. In contrast, if we used the
|
|
`sitofp instruction <http://www.llvm.org/docs/LangRef.html#i_sitofp>`_,
|
|
the Kaleidoscope ``<`` operator would return 0.0 and -1.0, depending on
|
|
the input value.
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
def CodeGen(self):
|
|
# Look up the name in the global module table.
|
|
callee = g_llvm_module.get_function_named(self.callee)
|
|
|
|
# Check for argument mismatch error.
|
|
if len(callee.args) != len(self.args):
|
|
raise RuntimeError('Incorrect number of arguments passed.')
|
|
|
|
arg_values = [i.CodeGen() for i in self.args]
|
|
|
|
return g_llvm_builder.call(callee, arg_values, 'calltmp')
|
|
|
|
|
|
|
|
|
|
|
|
Code generation for function calls is quite straightforward with LLVM.
|
|
The code above initially does a function name lookup in the LLVM
|
|
Module's symbol table. Recall that the LLVM Module is the container that
|
|
holds all of the functions we are JIT'ing. By giving each function the
|
|
same name as what the user specifies, we can use the LLVM symbol table
|
|
to resolve function names for us.
|
|
|
|
Once we have the function to call, we codegen each argument that is to
|
|
be passed in, and create an LLVM `call
|
|
instruction <http://www.llvm.org/docs/LangRef.html#i_call>`_. Note that
|
|
LLVM uses the native C calling conventions by default, allowing these
|
|
calls to also call into standard library functions like "sin" and "cos",
|
|
with no additional effort.
|
|
|
|
This wraps up our handling of the four basic expressions that we have so
|
|
far in Kaleidoscope. Feel free to go in and add some more. For example,
|
|
by browsing the `LLVM language
|
|
reference <http://www.llvm.org/docs/LangRef.html>`_ you'll find several
|
|
other interesting instructions that are really easy to plug into our
|
|
basic framework.
|
|
|
|
--------------
|
|
|
|
Function Code Generation
|
|
===================================
|
|
|
|
Code generation for prototypes and functions must handle a number of
|
|
details, which make their code less beautiful than expression code
|
|
generation, but allows us to illustrate some important points. First,
|
|
let's talk about code generation for prototypes: they are used both for
|
|
function bodies and external function declarations. The code starts
|
|
with:
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
def CodeGen(self):
|
|
# Make the function type, eg. double(double,double).
|
|
funct_type = Type.function(
|
|
Type.double(), [Type.double()] * len(self.args), False)
|
|
|
|
function = Function.new(g_llvm_module, funct_type, self.name)
|
|
|
|
|
|
|
|
|
|
|
|
The call to ``Type.function`` creates the ``FunctionType`` that should
|
|
be used for a given Prototype. Since all function arguments in
|
|
Kaleidoscope are of type double, the first line creates a list of "N"
|
|
LLVM double types. It then uses the ``Type.function`` method to create a
|
|
function type that takes "N" doubles as arguments, returns one double as
|
|
a result, and that is not vararg (the False parameter indicates this).
|
|
Note that Types in LLVM are uniqued just like Constants are, so you
|
|
don't instantiate them directly.
|
|
|
|
The final line above actually creates the function that the prototype
|
|
will correspond to. This indicates the type and name to use, as well as
|
|
which module to insert into. Note that by default, the function will
|
|
have `external
|
|
linkage <http://www.llvm.org/docs/LangRef.html#linkage>`_, which means
|
|
that the function may be defined outside the current module and/or that
|
|
it is callable by functions outside the module. The name passed in is
|
|
the name the user specified: since ``g_llvm_module`` is specified, this
|
|
name is registered in ``g_llvm_module``'s symbol table, which is used by
|
|
the function call code above.
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
# If the name conflicted, there was already something with the same name.
|
|
# If it has a body, don't allow redefinition or reextern.
|
|
if function.name != self.name:
|
|
function.delete()
|
|
function = g_llvm_module.get_function_named(self.name)
|
|
|
|
|
|
|
|
The Module symbol table works just like the Function symbol table when
|
|
it comes to name conflicts: if a new function is created with a name was
|
|
previously added to the symbol table, it will get implicitly renamed
|
|
when added to the Module. The code above exploits this fact to determine
|
|
if there was a previous definition of this function.
|
|
|
|
In Kaleidoscope, we choose to allow redefinitions of functions in two
|
|
cases: first, we want to allow 'extern'ing a function more than once, as
|
|
long as the prototypes for the externs match (since all arguments have
|
|
the same type, we just have to check that the number of arguments
|
|
match). Second, we want to allow 'extern'ing a function and then
|
|
defining a body for it. This is useful when defining mutually recursive
|
|
functions.
|
|
|
|
In order to implement this, the code above first checks to see if there
|
|
is a collision on the name of the function. If so, it deletes the
|
|
function we just created (by calling ``delete``) and then calling
|
|
``get_function_named`` to get the existing function with the specified
|
|
name.
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
# If the function already has a body, reject this.
|
|
if not function.is_declaration:
|
|
raise RuntimeError('Redefinition of function.')
|
|
|
|
# If F took a different number of args, reject.
|
|
if len(callee.args) != len(self.args):
|
|
raise RuntimeError('Redeclaration of a function with different number '
|
|
'of args.')
|
|
|
|
|
|
|
|
|
|
|
|
In order to verify the logic above, we first check to see if the
|
|
pre-existing function is a forward declaration. Since we don't allow
|
|
anything after a full definition of the function, the code rejects this
|
|
case. If the previous reference to a function was an 'extern', we simply
|
|
verify that the number of arguments for that definition and this one
|
|
match up. If not, we emit an error.
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
# Set names for all arguments and add them to the variables symbol table.
|
|
for arg, arg_name in zip(function.args, self.args):
|
|
arg.name = arg_name
|
|
# Add arguments to variable symbol
|
|
table. g_named_values[arg_name] = arg
|
|
|
|
return function
|
|
|
|
|
|
|
|
|
|
|
|
The last bit of code for prototypes loops over all of the arguments in
|
|
the function, setting the name of the LLVM Argument objects to match,
|
|
and registering the arguments in the ``g_named_values`` map for future
|
|
use by the ``VariableExpressionNode``. Note that we don't check for
|
|
conflicting argument names here (e.g. "extern foo(a b a)"). Doing so
|
|
would be very straight-forward with the mechanics we have already used
|
|
above. Once this is all set up, it returns the Function object to the
|
|
caller.
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
def CodeGen(self):
|
|
# Clear scope.
|
|
g_named_values.clear()
|
|
|
|
# Create a function object.
|
|
function = self.prototype.CodeGen()
|
|
|
|
|
|
|
|
|
|
|
|
Code generation for function definitions starts out simply enough: we
|
|
just clear out the ``g_named_values`` dictionary to make sure that there
|
|
isn't anything in it from the last function we compiled and codegen the
|
|
prototype. Code generation of the prototype ensures that there is an
|
|
LLVM Function object that is ready to go for us.
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
# Create a new basic block to start insertion into.
|
|
block = function.append_basic_block('entry')
|
|
global g_llvm_builder
|
|
g_llvm_builder = Builder.new(block)
|
|
|
|
|
|
|
|
|
|
|
|
Now we get to the point where ``g_llvm_builder`` is set up. The first
|
|
line creates a new `basic
|
|
block <http://en.wikipedia.org/wiki/Basic_block>`_ (named "entry"),
|
|
which is inserted into the function. The second line declares that the
|
|
global ``g_llvm_builder`` object is to be changed. The last line creates
|
|
a new builder that is set up to insert new instructions into the basic
|
|
block we just created. Basic blocks in LLVM are an important part of
|
|
functions that define the `Control Flow
|
|
Graph <http://en.wikipedia.org/wiki/Control_flow_graph>`_. Since we
|
|
don't have any control flow, our functions will only contain one block
|
|
at this point. We'll fix this in `Chapter 5 <PythonLangImpl5.html>`_ :).
|
|
|
|
.. code-block:: python
|
|
|
|
# Finish off the function.
|
|
try:
|
|
return_value = self.body.CodeGen()
|
|
g_llvm_builder.ret(return_value)
|
|
|
|
# Validate the generated code, checking for consistency.
|
|
function.verify()
|
|
|
|
|
|
|
|
|
|
Once the insertion point is set up, we call the ``CodeGen`` method for
|
|
the root expression of the function. If no error happens, this emits
|
|
code to compute the expression into the entry block and returns the
|
|
value that was computed. Assuming no error, we then create an LLVM `ret
|
|
instruction <http://www.llvm.org/docs/LangRef.html#i_ret>`_, which
|
|
completes the function. Once the function is built, we call ``verify``,
|
|
which is provided by LLVM. This function does a variety of consistency
|
|
checks on the generated code, to determine if our compiler is doing
|
|
everything right. Using this is important: it can catch a lot of bugs.
|
|
Once the function is finished and validated, we return it.
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
except:
|
|
function.delete()
|
|
raise
|
|
|
|
return function
|
|
|
|
|
|
|
|
|
|
|
|
The only piece left here is handling of the error case. For simplicity,
|
|
we handle this by merely deleting the function we produced with the
|
|
``delete`` method. This allows the user to redefine a function that they
|
|
incorrectly typed in before: if we didn't delete it, it would live in
|
|
the symbol table, with a body, preventing future redefinition.
|
|
|
|
This code does have a bug, though. Since the ``PrototypeNode::CodeGen``
|
|
can return a previously defined forward declaration, our code can
|
|
actually delete a forward declaration. There are a number of ways to fix
|
|
this bug; see what you can come up with! Here is a testcase:
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
extern foo(a b) # ok, defines foo.
|
|
def foo(a b) c # error, 'c' is invalid.
|
|
def bar() foo(1, 2) # error, unknown function "foo"
|
|
|
|
|
|
|
|
--------------
|
|
|
|
Driver Changes and Closing Thoughts
|
|
===============================================
|
|
|
|
For now, code generation to LLVM doesn't really get us much, except that
|
|
we can look at the pretty IR calls. The sample code inserts calls to
|
|
CodeGen into the ``Handle*`` functions, and then dumps out the LLVM IR.
|
|
This gives a nice way to look at the LLVM IR for simple functions. For
|
|
example:
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
ready> 4+5
|
|
Read a top-level expression:
|
|
define double @0() {
|
|
entry:
|
|
ret double 9.000000e+00
|
|
}
|
|
|
|
|
|
|
|
Note how the parser turns the top-level expression into anonymous
|
|
functions for us. This will be handy when we add JIT support in the next
|
|
chapter. Also note that the code is very literally transcribed, no
|
|
optimizations are being performed except simple constant folding done by
|
|
the Builder. We will add optimizations explicitly in the next chapter.
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
ready> def foo(a b) a *a + 2* a *b + b* b
|
|
Read a function definition:
|
|
define double @foo(double %a, double %b) {
|
|
entry:
|
|
%multmp = fmul double %a, %a ; <double> [#uses=1]
|
|
%multmp1 = fmul double 2.000000e+00, %a ; <double> [#uses=1]
|
|
%multmp2 = fmul double %multmp1, %b ; <double> [#uses=1]
|
|
%addtmp = fadd double %multmp, %multmp2 ; <double> [#uses=1]
|
|
%multmp3 = fmul double %b, %b ; <double> [#uses=1]
|
|
%addtmp4 = fadd double %addtmp, %multmp3 ; <double> [#uses=1]
|
|
ret double %addtmp4
|
|
}
|
|
|
|
|
|
|
|
This shows some simple arithmetic. Notice the striking similarity to the
|
|
LLVM builder calls that we use to create the instructions.
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
ready> def bar(a) foo(a, 4.0) + bar(31337)
|
|
Read a function definition:
|
|
define double @bar(double %a) {
|
|
entry:
|
|
%calltmp = call double @foo(double %a, double 4.000000e+00) ; <double> [#uses=1]
|
|
%calltmp1 = call double @bar(double 3.133700e+04) ; <double> [#uses=1]
|
|
%addtmp = fadd double %calltmp, %calltmp1 ; <double> [#uses=1]
|
|
ret double %addtmp
|
|
}
|
|
|
|
|
|
|
|
This shows some function calls. Note that this function will take a long
|
|
time to execute if you call it. In the future we'll add conditional
|
|
control flow to actually make recursion useful :).
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
ready> extern cos(x)
|
|
Read extern:
|
|
declare double @cos(double)
|
|
|
|
ready> cos(1.234)
|
|
Read a top-level expression:
|
|
define double @1() {
|
|
entry:
|
|
%calltmp = call double @cos(double 1.234000e+00) ; <double> [#uses=1]
|
|
ret double %calltmp
|
|
}
|
|
|
|
|
|
|
|
This shows an extern for the libm "cos" function, and a call to it.
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
ready> ^C
|
|
; ModuleID = 'my cool jit'
|
|
|
|
define double @0() {
|
|
entry:
|
|
ret double 9.000000e+00
|
|
}
|
|
|
|
define double @foo(double %a, double %b) {
|
|
entry:
|
|
%multmp = fmul double %a, %a ; <double> [#uses=1]
|
|
%multmp1 = fmul double 2.000000e+00, %a ; <double> [#uses=1]
|
|
%multmp2 = fmul double %multmp1, %b ; <double> [#uses=1]
|
|
%addtmp = fadd double %multmp, %multmp2 ; <double> [#uses=1]
|
|
%multmp3 = fmul double %b, %b ; <double> [#uses=1]
|
|
%addtmp4 = fadd double %addtmp, %multmp3 ; <double> [#uses=1]
|
|
ret double %addtmp4
|
|
}
|
|
|
|
define double @bar(double %a) {
|
|
entry:
|
|
%calltmp = call double @foo(double %a, double 4.000000e+00) ; <double> [#uses=1]
|
|
%calltmp1 = call double @bar(double 3.133700e+04) ; <double> [#uses=1]
|
|
%addtmp = fadd double %calltmp, %calltmp1 ; <double> [#uses=1]
|
|
ret double %addtmp
|
|
}
|
|
|
|
declare double @cos(double)
|
|
|
|
define double @1() {
|
|
entry:
|
|
%calltmp = call double @cos(double 1.234000e+00) ; <double> [#uses=1]
|
|
ret double %calltmp
|
|
}
|
|
|
|
|
|
|
|
When you quit the current demo, it dumps out the IR for the entire
|
|
module generated. Here you can see the big picture with all the
|
|
functions referencing each other.
|
|
|
|
This wraps up the third chapter of the Kaleidoscope tutorial. Up next,
|
|
we'll describe how to `add JIT codegen and optimizer
|
|
support <PythonLangImpl4.html>`_ to this so we can actually start
|
|
running code!
|
|
|
|
--------------
|
|
|
|
Full Code Listing
|
|
===========================
|
|
|
|
Here is the complete code listing for our running example, enhanced with
|
|
the LLVM code generator. Because this uses the llvmpy libraries, you
|
|
need to `download <../download.html>`_ and
|
|
`install <../userguide.html#install>`_ them.
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
#!/usr/bin/env python
|
|
|
|
import re
|
|
from llvm.core import Module, Constant, Type, Function, Builder, FCMP_ULT
|
|
|
|
Globals
|
|
-------
|
|
|
|
.. code-block:: python
|
|
|
|
# The LLVM module, which holds all the IR code.
|
|
g_llvm_module = Module.new('my cool jit')
|
|
|
|
# The LLVM instruction builder. Created whenever a new function is entered.
|
|
g_llvm_builder = None
|
|
|
|
# A dictionary that keeps track of which values are defined in the current scope
|
|
# and what their LLVM representation is.
|
|
g_named_values = {}
|
|
|
|
Lexer
|
|
-----
|
|
|
|
.. code-block:: python
|
|
|
|
# The lexer yields one of these types for each token.
|
|
class EOFToken(object):
|
|
pass
|
|
|
|
class DefToken(object):
|
|
pass
|
|
|
|
class ExternToken(object):
|
|
pass
|
|
|
|
class IdentifierToken(object):
|
|
def __init__(self, name):
|
|
self.name = name
|
|
|
|
class NumberToken(object):
|
|
def __init__(self, value):
|
|
self.value = value
|
|
|
|
class CharacterToken(object):
|
|
def __init__(self, char):
|
|
self.char = char
|
|
def __eq__(self, other):
|
|
return isinstance(other, CharacterToken)and self.char == other.char
|
|
def __ne__(self, other):
|
|
return not self == other
|
|
|
|
# Regular expressions that tokens and comments of our language.
|
|
REGEX_NUMBER = re.compile('[0-9]+(?:\.[0-9]+)?')
|
|
REGEX_IDENTIFIER = re.compile('[a-zA-Z][a-zA-Z0-9]*')
|
|
REGEX_COMMENT = re.compile('#.*')
|
|
|
|
def Tokenize(string):
|
|
while string:
|
|
# Skip whitespace.
|
|
if string[0].isspace():
|
|
string = string[1:]
|
|
continue
|
|
|
|
# Run regexes.
|
|
comment_match = REGEX_COMMENT.match(string)
|
|
number_match = REGEX_NUMBER.match(string)
|
|
identifier_match = REGEX_IDENTIFIER.match(string)
|
|
|
|
# Check if any of the regexes matched and yield the appropriate result.
|
|
if comment_match:
|
|
comment = comment_match.group(0)
|
|
string = string[len(comment):]
|
|
elif number_match:
|
|
number = number_match.group(0)
|
|
yield NumberToken(float(number))
|
|
string = string[len(number):]
|
|
elif identifier_match:
|
|
identifier = identifier_match.group(0)
|
|
# Check if we matched a keyword.
|
|
if identifier == 'def':
|
|
yield DefToken()
|
|
elif identifier == 'extern':
|
|
yield ExternToken()
|
|
else:
|
|
yield IdentifierToken(identifier)
|
|
string = string[len(identifier):]
|
|
else:
|
|
# Yield the ASCII value of the unknown character.
|
|
yield CharacterToken(string[0])
|
|
string = string[1:]
|
|
|
|
yield EOFToken()
|
|
|
|
Abstract Syntax Tree (aka Parse Tree)
|
|
-------------------------------------
|
|
|
|
.. code-block:: python
|
|
|
|
# Base class for all expression nodes.
|
|
class ExpressionNode(object):
|
|
pass
|
|
|
|
# Expression class for numeric literals like "1.0".
|
|
class NumberExpressionNode(ExpressionNode):
|
|
|
|
def __init__(self, value):
|
|
self.value = value
|
|
|
|
def CodeGen(self):
|
|
return Constant.real(Type.double(), self.value)
|
|
|
|
# Expression class for referencing a variable, like "a".
|
|
class VariableExpressionNode(ExpressionNode):
|
|
|
|
def __init__(self, name):
|
|
self.name = name
|
|
|
|
def CodeGen(self):
|
|
if self.name in g_named_values:
|
|
return g_named_values[self.name]
|
|
else:
|
|
raise RuntimeError('Unknown variable name: ' + self.name)
|
|
|
|
# Expression class for a binary operator.
|
|
class BinaryOperatorExpressionNode(ExpressionNode):
|
|
|
|
def __init__(self, operator, left, right):
|
|
self.operator = operator
|
|
self.left = left
|
|
self.right = right
|
|
|
|
def CodeGen(self):
|
|
left = self.left.CodeGen()
|
|
right = self.right.CodeGen()
|
|
|
|
if self.operator == '+':
|
|
return g_llvm_builder.fadd(left, right, 'addtmp')
|
|
elif self.operator == '-':
|
|
return g_llvm_builder.fsub(left, right, 'subtmp')
|
|
elif self.operator == '*':
|
|
return g_llvm_builder.fmul(left, right, 'multmp')
|
|
elif self.operator == '<':
|
|
result = g_llvm_builder.fcmp(FCMP_ULT, left, right, 'cmptmp')
|
|
# Convert bool 0 or 1 to double 0.0 or 1.0.
|
|
return g_llvm_builder.uitofp(result, Type.double(), 'booltmp')
|
|
else:
|
|
raise RuntimeError('Unknown binary operator.')
|
|
|
|
# Expression class for function calls.
|
|
class CallExpressionNode(ExpressionNode):
|
|
|
|
def __init__(self, callee, args):
|
|
self.callee = callee
|
|
self.args = args
|
|
|
|
def CodeGen(self):
|
|
# Look up the name in the global module table.
|
|
callee = g_llvm_module.get_function_named(self.callee)
|
|
|
|
# Check for argument mismatch error.
|
|
if len(callee.args) != len(self.args):
|
|
raise RuntimeError('Incorrect number of arguments passed.')
|
|
|
|
arg_values = [i.CodeGen() for i in self.args]
|
|
|
|
return g_llvm_builder.call(callee, arg_values, 'calltmp')
|
|
|
|
# This class represents the "prototype" for a function, which captures its name,
|
|
# and its argument names (thus implicitly the number of arguments the function
|
|
# takes).
|
|
class PrototypeNode(object):
|
|
|
|
def __init__(self, name, args):
|
|
self.name = name
|
|
self.args = args
|
|
|
|
def CodeGen(self):
|
|
# Make the function type, eg. double(double,double).
|
|
funct_type = Type.function(
|
|
Type.double(), [Type.double()] * len(self.args), False)
|
|
|
|
function = Function.new(g_llvm_module, funct_type, self.name)
|
|
|
|
# If the name conflicted, there was already something with the same name.
|
|
# If it has a body, don't allow redefinition or reextern.
|
|
if function.name != self.name:
|
|
function.delete()
|
|
function = g_llvm_module.get_function_named(self.name)
|
|
|
|
# If the function already has a body, reject this.
|
|
if not function.is_declaration:
|
|
raise RuntimeError('Redefinition of function.')
|
|
|
|
# If F took a different number of args, reject.
|
|
if len(callee.args) != len(self.args):
|
|
raise RuntimeError('Redeclaration of a function with different number '
|
|
'of args.')
|
|
|
|
# Set names for all arguments and add them to the variables symbol table.
|
|
for arg, arg_name in zip(function.args, self.args):
|
|
arg.name = arg_name
|
|
# Add arguments to variable symbol table.
|
|
g_named_values[arg_name] = arg
|
|
|
|
return function
|
|
|
|
# This class represents a function definition itself.
|
|
class FunctionNode(object):
|
|
|
|
def __init__(self, prototype, body):
|
|
self.prototype = prototype
|
|
self.body = body
|
|
|
|
def CodeGen(self):
|
|
# Clear scope.
|
|
g_named_values.clear()
|
|
|
|
# Create a function object.
|
|
function = self.prototype.CodeGen()
|
|
|
|
# Create a new basic block to start insertion into.
|
|
block = function.append_basic_block('entry')
|
|
global g_llvm_builder
|
|
g_llvm_builder = Builder.new(block)
|
|
|
|
# Finish off the function.
|
|
try:
|
|
return_value = self.body.CodeGen()
|
|
g_llvm_builder.ret(return_value)
|
|
|
|
# Validate the generated code, checking for consistency.
|
|
function.verify()
|
|
except:
|
|
function.delete()
|
|
raise
|
|
|
|
return function
|
|
|
|
Parser
|
|
------
|
|
|
|
.. code-block:: python
|
|
|
|
class Parser(object):
|
|
|
|
def __init__(self, tokens, binop_precedence):
|
|
self.tokens = tokens
|
|
self.binop_precedence = binop_precedence
|
|
self.Next()
|
|
|
|
# Provide a simple token buffer. Parser.current is the current token the
|
|
# parser is looking at. Parser.Next() reads another token from the lexer and
|
|
# updates Parser.current with its results.
|
|
def Next(self):
|
|
self.current = self.tokens.next()
|
|
|
|
# Gets the precedence of the current token, or -1 if the token is not a binary
|
|
# operator.
|
|
def GetCurrentTokenPrecedence(self):
|
|
if isinstance(self.current, CharacterToken):
|
|
return self.binop_precedence.get(self.current.char, -1)
|
|
else:
|
|
return -1
|
|
|
|
# identifierexpr ::= identifier | identifier '(' expression* ')'
|
|
def ParseIdentifierExpr(self):
|
|
identifier_name = self.current.name
|
|
self.Next() # eat identifier.
|
|
|
|
if self.current != CharacterToken('('): # Simple variable reference.
|
|
return VariableExpressionNode(identifier_name)
|
|
|
|
# Call.
|
|
self.Next() # eat '('.
|
|
args = []
|
|
if self.current != CharacterToken(')'):
|
|
while True:
|
|
args.append(self.ParseExpression())
|
|
if self.current == CharacterToken(')'):
|
|
break
|
|
elif self.current != CharacterToken(','):
|
|
raise RuntimeError('Expected ")" or "," in argument list.')
|
|
self.Next()
|
|
|
|
self.Next() # eat ')'.
|
|
return CallExpressionNode(identifier_name, args)
|
|
|
|
# numberexpr ::= number
|
|
def ParseNumberExpr(self):
|
|
result = NumberExpressionNode(self.current.value)
|
|
self.Next() # consume the number.
|
|
return result
|
|
|
|
# parenexpr ::= '(' expression ')'
|
|
def ParseParenExpr(self):
|
|
self.Next() # eat '('.
|
|
|
|
contents = self.ParseExpression()
|
|
|
|
if self.current != CharacterToken(')'):
|
|
raise RuntimeError('Expected ")".')
|
|
self.Next() # eat ')'.
|
|
|
|
return contents
|
|
|
|
# primary ::= identifierexpr | numberexpr | parenexpr
|
|
def ParsePrimary(self):
|
|
if isinstance(self.current, IdentifierToken):
|
|
return self.ParseIdentifierExpr()
|
|
elif isinstance(self.current, NumberToken):
|
|
return self.ParseNumberExpr()
|
|
elif self.current == CharacterToken('('):
|
|
return self.ParseParenExpr()
|
|
else:
|
|
raise RuntimeError('Unknown token when expecting an expression.')
|
|
|
|
# binoprhs ::= (operator primary)*
|
|
def ParseBinOpRHS(self, left, left_precedence):
|
|
# If this is a binary operator, find its precedence.
|
|
while True:
|
|
precedence = self.GetCurrentTokenPrecedence()
|
|
|
|
# If this is a binary operator that binds at least as tightly as the
|
|
# current one, consume it; otherwise we are done.
|
|
if precedence < left_precedence:
|
|
return left
|
|
|
|
binary_operator = self.current.char
|
|
self.Next() # eat the operator.
|
|
|
|
# Parse the primary expression after the binary operator.
|
|
right = self.ParsePrimary()
|
|
|
|
# If binary_operator binds less tightly with right than the operator after
|
|
# right, let the pending operator take right as its left.
|
|
next_precedence = self.GetCurrentTokenPrecedence()
|
|
if precedence < next_precedence:
|
|
right = self.ParseBinOpRHS(right, precedence + 1)
|
|
|
|
# Merge left/right.
|
|
left = BinaryOperatorExpressionNode(binary_operator, left, right)
|
|
|
|
# expression ::= primary binoprhs
|
|
def ParseExpression(self):
|
|
left = self.ParsePrimary()
|
|
return self.ParseBinOpRHS(left, 0)
|
|
|
|
# prototype ::= id '(' id* ')'
|
|
def ParsePrototype(self):
|
|
if not isinstance(self.current, IdentifierToken):
|
|
raise RuntimeError('Expected function name in prototype.')
|
|
|
|
function_name = self.current.name
|
|
self.Next() # eat function name.
|
|
|
|
if self.current != CharacterToken('('):
|
|
raise RuntimeError('Expected "(" in prototype.')
|
|
self.Next() # eat '('.
|
|
|
|
arg_names = []
|
|
while isinstance(self.current, IdentifierToken):
|
|
arg_names.append(self.current.name)
|
|
self.Next()
|
|
|
|
if self.current != CharacterToken(')'):
|
|
raise RuntimeError('Expected ")" in prototype.')
|
|
|
|
# Success.
|
|
self.Next() # eat ')'.
|
|
|
|
return PrototypeNode(function_name, arg_names)
|
|
|
|
# definition ::= 'def' prototype expression
|
|
def ParseDefinition(self):
|
|
self.Next() # eat def.
|
|
proto = self.ParsePrototype()
|
|
body = self.ParseExpression()
|
|
return FunctionNode(proto, body)
|
|
|
|
# toplevelexpr ::= expression
|
|
def ParseTopLevelExpr(self):
|
|
proto = PrototypeNode('', [])
|
|
return FunctionNode(proto, self.ParseExpression())
|
|
|
|
# external ::= 'extern' prototype
|
|
def ParseExtern(self):
|
|
self.Next() # eat extern.
|
|
return self.ParsePrototype()
|
|
|
|
# Top-Level parsing
|
|
def HandleDefinition(self):
|
|
self.Handle(self.ParseDefinition, 'Read a function definition:')
|
|
|
|
def HandleExtern(self):
|
|
self.Handle(self.ParseExtern, 'Read an extern:')
|
|
|
|
def HandleTopLevelExpression(self):
|
|
self.Handle(self.ParseTopLevelExpr, 'Read a top-level expression:')
|
|
|
|
def Handle(self, function, message):
|
|
try:
|
|
print message, function().CodeGen()
|
|
except Exception, e:
|
|
print 'Error:', e
|
|
try:
|
|
self.Next() # Skip for error recovery.
|
|
except:
|
|
pass
|
|
|
|
Main driver code.
|
|
-----------------
|
|
|
|
.. code-block:: python
|
|
|
|
def main():
|
|
# Install standard binary operators.
|
|
# 1 is lowest possible precedence. 40 is the highest.
|
|
operator_precedence = {
|
|
'<': 10,
|
|
'+': 20,
|
|
'-': 20,
|
|
'*': 40
|
|
}
|
|
|
|
# Run the main "interpreter loop".
|
|
while True:
|
|
print 'ready>',
|
|
try:
|
|
raw = raw_input()
|
|
except KeyboardInterrupt:
|
|
break
|
|
|
|
parser = Parser(Tokenize(raw), operator_precedence)
|
|
while True:
|
|
# top ::= definition | external | expression | EOF
|
|
if isinstance(parser.current, EOFToken):
|
|
break
|
|
if isinstance(parser.current, DefToken):
|
|
parser.HandleDefinition()
|
|
elif isinstance(parser.current, ExternToken):
|
|
parser.HandleExtern()
|
|
else:
|
|
parser.HandleTopLevelExpression()
|
|
|
|
# Print out all of the generated code.
|
|
print '\n', g_llvm_module
|
|
|
|
if __name__ == '__main__':
|
|
main() |