Re: A86: What C-compilers have we got? [82/83/83+/85/86]


[Prev][Next][Index][Thread]

Re: A86: What C-compilers have we got? [82/83/83+/85/86]




Machine code and assembly language are similar, although there are some
differences.  Typically, assembly language means that it can use symbols,
thus the assembler will have to be a symbolic assembler.  Thus, the
difference between assembly language and machine code in this instance is
that the code can be relocatable, not having to use fixed addresses.  This
means that inserting a byte doesn't mean redoing all bytes below it.  To a
human programmer, this is very important.  To a compiler, it isn't, to a
degree.

If you really want to understand this, I highly suggest you do some reading
and research on compiler design.  A recommended book (by me) is the dragon
book, Compilers: Principles, Techniques, and Tools, published by Addison
Wesley, ISBN 0-201-10088-6.  Being published by Addison Wesley, it reads
like a textbook (I would imagine that it has/is used by many college courses
as a text), so it's a bit terse.  However, writing a compiler is not for the
inexperienced or faint of heart, especially an optimizing compiler.  This
book is the best reference I have on compilers, especially optimizing
compilers.  The techniques used by the Watcom C/C++ compilers (probably the
best optimizing compilers ever for x86 hardware) were based on that book.
There are probably books that are easier to understand, and if someone knows
of one (that you have actually read), I would be interested in hearing about
it.  However, topics like parsing, grammar, compiler-compilers, etc., are
going to be complex no matter what.  In programming, you can either do
something correctly, or you can just sit down and write it and hope it turns
out.  If it's simple, you just might get it done.  With a compiler, unless
you are a programming genius, there is no way you will be able to make it
work without doing it "correctly".

The most common technique for writing a C-type compiler is to have a front
end and a back end.  The front end handles the lexical analysis and parsing,
while the back end handles code generation.  This is the reason that you can
(relatively) easily port a compiler like GCC to a new platform, without
having to rewrite the entire thing.  It is merely necessary to rewrite the
back end to generate code for the new platform.  The front end creates a
intermediate language that is used by the back end to generate the actual
code for the target platform.  Optimization is done on both the intermediate
code by the front end and on the output code by the back end.  With a
language like basic or Pascal, it would be possible to generate the output
code during the parsing stage, but this would result in very poor code, and
would be difficult to program (although this would probably be the most
intuitive way to do it).  For something such as C, it is not possible to do
it all in one step (compare the differences in compile times for Delphi and
C++Builder).

When generating code, it is necessary to determine where in memory
everything will reside.  The actual address can be determined by the
compiler, or by the assembler.  Whether or not the assembler is integrated
into the code generator is a design decision.  The most common and
simplistic design for an assembler is a two-pass assembler.  The first pass
consists of scanning the code and determining the sizes of all the
instructions and directives, and filling in as much of the symbol table as
possible using this information.  For example, a forward jump's location
cannot be known when the instruction is reached on the first pass, because
the assembler does not know how far away the destination is until it gets to
it.  The second pass consists of filling in the rest of the symbol table and
generating output code.  To simplify the design, most assemblers will
actually scan and parse the source code twice.  However, an alternative
technique could be to actually generate the output code on the first pass,
and make notes in the symbol table of where a label is used, and on the
second pass the address would be fixed.

When a compiler is outputting code, it cannot, the same way an assembler
cannot, know the address of a forward jump, data block, etc.  Thus, it must
somehow keep track of everything and fix the output code later, after the
intial code generation is complete.  If the compiler outputs assembly
language that is to be run through an assembler as a final step, then the
compiler will not need to keep track of this information interally.
However, if the compiler outputs machine code directly, then it will need to
track this information internally using a type of symbol table, in the same
method that an assembler would.  Doing this is going to make outputting code
more difficult, because it would involve calling routines to add this
information to the compiler's internal tables.  It is also going to force
the opcodes to be hard coded into the compiler.  There are ways to deal with
this, such as using macros, or by actually passing the instructions not to a
file but to an internal assembler.  Most compilers do actually generate
assembly code, but because the assembler is embedded in the compiler, you
are not aware of it being used.

With something as complex as a compiler, I feel that the small time lost by
needing to use an assembler (in whatever form) somewhere in the process is
made up for by large reduction in code complexity.

> As I've read through all of this, I've noticed you've all left out a
> possibility, and I'm wondering why.
>
> Why can't you just make a C-compiler that compiles into machine code? If
it
> compiles to asm, then it is only obvious that it will have to be compiled
> AGAIN to machine code, so why don't you just skip the middleman and go
> straight to machine code.
>
> Or, does this require too much time and/or programming experience?





References: