git-svn-id: https://swig.svn.sourceforge.net/svnroot/swig/trunk@6231 626c5289-ae23-0410-ae9c-e8d60b6d4f22
275 lines
9.7 KiB
Text
275 lines
9.7 KiB
Text
Thoughts on the Insanity C++ Parsing
|
|
|
|
<h2>Thoughts on the Insanity of C++ Parsing</h2>
|
|
|
|
<center>
|
|
<em>
|
|
"Parsing C++ is simply too complex to do correctly." -- Anonymous
|
|
</em>
|
|
</center>
|
|
<p>
|
|
Author: David Beazley (beazley@cs.uchicago.edu)
|
|
|
|
<p>
|
|
August 12, 2002
|
|
|
|
<p>
|
|
A central goal of the SWIG project is to generate extension modules by
|
|
parsing the contents of C++ header files. It's not too hard to come up
|
|
with reasons why this might be useful---after all, if you've got
|
|
several hundred class definitions, do you really want to go off and
|
|
write a bunch of hand-crafted wrappers? No, of course not---you're
|
|
busy and like everyone else, you've got better things to do with
|
|
your time.
|
|
|
|
<p>
|
|
Okay, so there are many reasons why parsing C++ would be nice.
|
|
However, parsing C++ is also a nightmare. In fact, C++ would
|
|
probably the last language that any normal person would choose to
|
|
serve as an interface specification language. It's hard to parse,
|
|
hard to analyze, and it involves all sorts
|
|
of nasty little problems related to scoping, typenames, templates,
|
|
access, and so forth. Because of this, most of the tools that claim
|
|
to "parse" C++ don't. Instead, they parse a subset of the language
|
|
that happens to match the C++ programming style used by the tool's
|
|
creator (believe me, I know---this is how SWIG started). Not
|
|
surprisingly, these tools tend to break down when presented with code
|
|
that starts to challenge the capabilities of the C++ compiler.
|
|
Needless to say, critics see this as opportunity to make bold claims
|
|
such as "writing a C++ parser is folly" or "this whole approach is too
|
|
hard to ever work correctly."
|
|
|
|
<p>
|
|
Well, one does have to give the critics a little credit---writing a
|
|
C++ parser certainly <em>is</em> hard and writing a parser that
|
|
actually works correctly is even harder. However, these tasks are
|
|
certainly not "impossible." After all, there would be no working C++
|
|
compiler if such claims were true! Therefore, the question of whether
|
|
or not a wrapper generator can parse C++ is clearly the wrong question
|
|
to ask. Instead, the real question is whether or not a wrapper
|
|
generation tool that parses C++ can actually do anything useful.
|
|
|
|
<h3>The problem with using C++ as an interface definition language</h3>
|
|
|
|
If you cut through all of the low-level details of parsing, the primary
|
|
problem of using C++ as an module specification language is that of
|
|
ambiguity. Consider a declaration like this:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
void foo(double *x, int n);
|
|
</pre>
|
|
</blockquote>
|
|
|
|
If you look at this declaration, you can ask yourself the question,
|
|
what is "x"? Is it a single input value? Is it an output value
|
|
(modified by the function)? Is it an array? Is "n" somehow related?
|
|
Perhaps the real problem in this example is that of expressing the
|
|
programmer's intent. Yes, the function clearly accepts a pointer to
|
|
some object and an integer, but the declaration does not contain
|
|
enough additional information to determine the purpose of these
|
|
parameters--information that could be useful in generating a suitable
|
|
set of a wrappers.
|
|
|
|
<p>
|
|
IDL compilers associated with popular component frameworks (e.g.,
|
|
CORBA, COM, etc.) get around this problem by requiring interfaces to
|
|
be precisely specified--input and output values are clearly indicated
|
|
as such. Thus, one might adopt a similar approach and extend C++
|
|
syntax with some special modifiers or qualifiers. For example:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
void foo(%output double *x, int n);
|
|
</pre>
|
|
</blockquote>
|
|
|
|
The problem with this approach is that it breaks from C++ syntax and
|
|
it requires the user to annotate their input files (a task that C++
|
|
wrapper generators are supposed to eliminate). Meanwhile, critics sit
|
|
back and say "Ha! I told you C++ parsing would never work."
|
|
|
|
<p>
|
|
Another problem with using C++ as an input language is that interface
|
|
building often involves more than just blindly wrapping declarations. For instance,
|
|
users might want to rename declarations, specify exception handling procedures,
|
|
add customized code, and so forth. This suggests that a
|
|
wrapper generator really needs to do
|
|
more than just parse C++---it must give users the freedom to customize
|
|
various aspects of the wrapper generation process. Again, things aren't
|
|
looking too good for C++.
|
|
|
|
<h3>The SWIG approach: pattern matching</h3>
|
|
|
|
SWIG takes a different approach to the C++ wrapping problem.
|
|
Instead of trying to modify C++ with all sorts of little modifiers and
|
|
add-ons, wrapping is largely controlled by a pattern matching mechanism that is
|
|
built into the underlying C++ type system.
|
|
|
|
<p>
|
|
One part of the pattern matcher is programmed to look for specific sequences of
|
|
datatypes and argument names. These patterns, known as typemaps, are
|
|
responsible for all aspects of data conversion. They work by simply attaching
|
|
bits of C conversion code to specific datatypes and argument names in the
|
|
input file. For example, a typemap might be used like this:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
%typemap(in) <b>double *items</b> {
|
|
// Get an array from the input
|
|
...
|
|
}
|
|
...
|
|
void foo(<b>double *items</b>, int n);
|
|
</pre>
|
|
</blockquote>
|
|
|
|
With this approach, type and argument names are used as
|
|
a basis for specifying customized wrapping behavior. For example, if a program
|
|
always used an argument of <tt>double *items</tt> to refer to an
|
|
array, SWIG can latch onto that and use it to provide customized
|
|
processing. It is even possible to write pattern matching rules for
|
|
sequences of arguments. For example, you could write the following:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
%typemap(in) (<b>double *items, int n</b>) {
|
|
// Get an array of items. Set n to number of items
|
|
...
|
|
}
|
|
...
|
|
void foo(<b>double *items, int n</b>);
|
|
</pre>
|
|
</blockquote>
|
|
|
|
The precise details of typemaps are not so important (in fact, most of
|
|
this pattern matching is hidden from SWIG users). What is important
|
|
is that pattern matching allows customized data handling to be
|
|
specified without breaking C++ syntax--instead, a user merely has to
|
|
define a few patterns that get applied across the declarations that
|
|
appear in C++ header files. In some sense, you might view this
|
|
approach as providing customization through naming conventions rather than
|
|
having to annotate arguments with extra qualifiers.
|
|
|
|
<p>
|
|
The other pattern matching mechanism used by SWIG is a declaration annotator
|
|
that is used to attach properties to specific declarations. A simple example of declaration
|
|
annotation might be renaming. For example:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
%rename(cprint) print; // Rename all occurrences of 'print' to 'cprint'
|
|
</pre>
|
|
</blockquote>
|
|
|
|
A more advanced form of declaration matching would be exception handling.
|
|
For example:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
%exception Foo::getitem(int) {
|
|
try {
|
|
$action
|
|
} catch (std::out_of_range& e) {
|
|
SWIG_exception(SWIG_IndexError,const_cast<char*>(e.what()));
|
|
}
|
|
}
|
|
|
|
...
|
|
template<class T> class Foo {
|
|
public:
|
|
...
|
|
T &getitem(int index); // Exception handling code attached
|
|
...
|
|
};
|
|
</pre>
|
|
</blockquote>
|
|
|
|
Like typemaps, declaration matching does not break from C++ syntax.
|
|
Instead, a user merely specifies special processing rules in advance.
|
|
These rules are then attached to any matching C++
|
|
declaration that appears later in the input. This means that raw C++
|
|
header files can often be parsed and customized with few, if any,
|
|
modifications.
|
|
|
|
<h3>The SWIG difference</h3>
|
|
|
|
Pattern based approaches to wrapper code generation are not unique to SWIG.
|
|
However, most prior efforts have based their pattern matching engines on simple
|
|
regular-expression matching. The key difference between SWIG and these systems
|
|
is that SWIG's customization features are fully integrated into the
|
|
underlying C++ type system. This means that SWIG is able to deal with very
|
|
complicated types of C/C++ code---especially code that makes heavy use of
|
|
<tt>typedef</tt>, namespaces, aliases, class hierarchies, and more. To
|
|
illustrate, consider some code like this:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
// A simple SWIG typemap
|
|
%typemap(in) int {
|
|
$1 = PyInt_AsLong($input);
|
|
}
|
|
|
|
...
|
|
// Some raw C++ code (included later)
|
|
namespace X {
|
|
typedef int Integer;
|
|
|
|
class _FooImpl {
|
|
public:
|
|
typedef Integer value_type;
|
|
};
|
|
typedef _FooImpl Foo;
|
|
}
|
|
|
|
namespace Y = X;
|
|
using Y::Foo;
|
|
|
|
class Bar : public Foo {
|
|
};
|
|
|
|
void spam(Bar::value_type x);
|
|
</pre>
|
|
</blockquote>
|
|
|
|
If you trace your way through this example, you will find that the
|
|
<tt>Bar::value_type</tt> argument to function <tt>spam()</tt> is
|
|
really an integer. What's more, if you take a close look at the SWIG
|
|
generated wrappers, you will find that the typemap pattern defined for
|
|
<tt>int</tt> is applied to it--in other words, SWIG does exactly the right thing despite
|
|
our efforts to make the code confusing.
|
|
|
|
<p>
|
|
Similarly, declaration annotation is integrated into the type system
|
|
and can be used to define properties that span inheritance hierarchies
|
|
and more (in fact, there are many similarities between the operation of
|
|
SWIG and tools developed for Aspect Oriented Programming).
|
|
|
|
<h3>What does this mean?</h3>
|
|
|
|
Pattern-based approaches allow wrapper generation tools to parse C++
|
|
declarations and to provide a wide variety of high-level customization
|
|
features. Although this approach is quite different than that found
|
|
in a typical IDL, the use of patterns makes it possible to work from
|
|
existing header files without having to make many (if any) changes to
|
|
those files. Moreover, when the underlying pattern matching mechanism
|
|
is integrated with the C++ type system, it is possible to build
|
|
reliable wrappers to real software---even if that software is filled
|
|
with namespaces, templates, classes, <tt>typedef</tt> declarations,
|
|
pointers, and other bits of nastiness.
|
|
|
|
<h3>The bottom line</h3>
|
|
|
|
Not only is it possible to generate extension modules by parsing C++,
|
|
it is possible to do so with real software and with a high degree of
|
|
reliability. Don't believe me? Download SWIG-1.3.14 and try it for
|
|
yourself.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|