re: re: Re: A89: the a68k assembler and how it makes machine code


[Prev][Next][Index][Thread]

re: re: Re: A89: the a68k assembler and how it makes machine code




 > What I want to know is how the assembler gets the hex code for instructions that you write 
 > down.  Like how does "4e444e750000" stand for "Trap #4" and then "rts"?
 > Bryan

The CPU has elementary actions that the hardware is capable of. 
In the broad term a CPU does the following: Read a memory word from
address pointed by the internal register PC. Interpret the memory word 
and initiate HW activities accordingly. When all associated activites
are finished, go and read a new word from memory addressed by PC.

Now these elementary action packs that can be triggered by a certain
bit pattern are called instructions. For example, when the hardware
sees the bit patter $4e93, it will do this: 

  substract 4 from the contents of register a7, write the result 
  back to a7
  add 2 to the PC register and write result to two consecutive words
  in memory starting at the address contained in a7
  copy the contents of register a3 to register PC
  read a word from memory address PC and start interpreting the bit
  pattern as an instruction 

it is called a subroutine call instruction with address register 3
indirect target address.
It is an atomic operation, you can not execute only parts of it. 
When the CPU reads the triggering bit pattern it will do all the 
things above. If you want, say, only execute the copy a3 to PC part,
then it is an other instruction, namely a jmp (a3) and it is encoded
by a different bit pattern.

The Motorola book will tell you what bit patterns trigger what
instruction and they also suggest a so-called mnemonic for each
instruction. That is, there is an elementary hardware action which
does the things described above, it has the bit pattern of .... it is
called a "subroutine call". Motorola suggest that you use the letter 
combination (mnemonic) "jsr" [Jump to SubRoutine] to identify it.
They also give a suggestion for syntax, that is, they suggest that
when you want to say "address register 3 indirect" use the "(a3)" 
notation or that you use the # to indicate immediate operand and so on. 

Now from that to an assembler the route is relatively straightforward. 
You build a big table in the assembler.  The table contains pairs of 
a string and a bit pattern (number):

  NOP        $4e71
  RTS        $4e74
  RTE        $4e73
  RTD        $4e77
  
and so on. Then the assembler reads your source, line by line. It cuts 
the lines into fields, namely a label, and iinstruction mnemonic then
operands. It looks up the mnemonic in its big table. If it finds it,
say it was an RTE, then it gets $4e73 from the table and writes it out 
to the output file. It also increases its internal memory counter by 
two, for the bit pattern for the RTE insn  occupies 2 bytes of memory.

Now what if you have an operand, that is, something after the
instruction ? Well, the assembler's table is somewhat more complex
than in the example above. Usually the operands for instructions and
the bitpatterns representing them can be divided into groups. You then 
write routines that process a group each. Then you put those into your 
table:

 Mnemonic    Code     Size   Oper1.       Oper2
-----------------------------------------------
  NOP        $4e71    -      -            -
  RTS        $4e74    -      -            -
  RTE        $4e73    -      -            -
  RTD        $4e77    -      -            -
  SWAP       $4840    -      Dreg1        -
  SUBQ       $5100    S76    Immed1_8     Alterable
  
and so on. Now Dreg1 means that the operand must be from d0 to d7 and
that it must be encoded as the bottom 3 bits of the insns word. 
Immed1_8 means a #X where X is between 1 and 8 and it should be
encoded by bitwise AND-ing X with 7 then shifting the result to the
left by 9 bits. Alterable means a whole lot of addressing modes which
will be encoded in the bottom 6 bits and, if they encode an operand
which needs explicite addresses, then these will be stored in further
words (they are called extension words in Motorola lingvo). S76 means
that the instruction is available in byte, word and long and the size
information is encoded in bits 7 and 6 of the insn bit pattern.

So, when the asssembler sees the menmonic "SUBQ" it will look for a
size (.B .W or .L). If it can't find it, it will assume the default,
which is .W. It generates the bit pattern for that. Then it will see
that there is a first operand and calls the routine Immed1-8. The
routine will see if the first operand is in the form of #<constant>
and that the constant is between 1 and 8. If yes, it will generate the 
bit pattern for that and return to the line processor. It will again
consult the table and call the Alterable routine to check the second
operand. Alterable, in turn, will try to match patterns on the
operand, like a[0-7] (that is a0 to a7) and (a[07]) and 
<constant>(a[0-7],[ad][0-7]{.[wl]}) that is, a constant expression
followed by an opening parenthesis followed by a0 to a7 then a comma
then a0 to a7 or d0 to d7, optionally followed by a period and the
letter w or l then a closing paren. If it finds one of the patterns, 
it will generate the 6-bit encoding for that pattern and also all the
necessary extension words. (On a 68000 the longest instruction is 5 
words: one instruction word and two longwords containing addresses,
on a 68020 or 68030 it peaks at 11 words !).

An other issue is, how does it know what bit pattern belongs to
"mylabel", in the context of

mylabel:
   do something
   jmp mylabel
   
Well, obviously the bit pattern is dependent on the actual memory 
address mylabel represents when the program is running. However, the
assembler when it starts assembling knows where your program will be
loaded, that is what address corrsponds to the first instruction of
your code. Since from that point on the assembler generates all
instructions, it knows how much space they take up therefore it can 
keep track of the address of every instruction (and thus label).
In reality, the assembler very rarely knows the absolute address of
your code, however, it can assume a starting address of 0, calculate
with that and on top of the code belonging to your program it can also
generate a table (so-called "relocation table") which contains 
all locations in the code which contain position dependent bit
patterns and some attributes of that location. Then the linker or the 
loader can generate the final bit pattern for these places when the
real start address is known.

There are other issues with assemblers like segments, directives,
macros and soe on but they are not relevant to your question.

I hope the above clarifies a few things.

Regards,

Zoltan


References: