A89: Shifting/Multiplying. . .


[Prev][Next][Index][Thread]

A89: Shifting/Multiplying. . .




 > Another optimization (note: "optomization" is spelled with a _z_ =)

Actually, in British English both forms are correct, at least
according to the Longman dictionary (which is a British one :-).
 
In regards to the manual, you can download the 68000 User's Manual 
from Motorola:

   http://www.mot.com/SPS/HPESD/aesop/680X0/000/68000UM.pdf

It is about 1MB.

 > question:  If I'm multiplying one variable number by a constant that is the
 > sum of several powers of two, is it better to flat-out multiply or to
 > perform several left shifts (multiplying by the powers of two) and adding
 > those numbers?

It depends on the actual constant.
See below.

 > by 12.  Ten is the sum of 2 powers of two: 8 (2^3) and 4 (2^2).  I can use a
 > simple "mulu.w #12,d0" or I can use the following little algorithm:
 > 
 > mul_d0_12:
 >  move.w d0,d1    ;arbitrary value is a word
 >  lsl.w #2,d0     ;d0 = (4 * d0)
 >  lsl.w #3,d1     ;d1 = (8 * d0)
 >  add.w d1,d0     ;d0 = ((8 * d0) + (4 * d0))
 >                  ;   = (d0 * (8 + 4))
 >                  ;   = (d0 * 12)
 >                  ;[via distributive property of multiplication]

Well, your code created a 16 bit result, mulu generates 32 bit result, 
so your version is not fully compatible with the mulu insn. 
If you change all insns to long, you will be compatible, assuming that 
the top word of d0 was 0.

The code above, as is (i.e. using words instead of longs) takes
4+10+12+4 = 30 clocks. With longs it would take 36 clocks.
This algorithm can be sped up with a simple trick, based on two 
observations:

- a long reg-reg add is 6 clocks while a long shift is 8+2n where n
  is the shift count (word/byte add/shift are 4 and 6+2n, respectively)
   
- while 12 can be described as 8+4, it also can be decomposed as
  4*3 which, in turn, is 2*(2*(1+(1+1)))

Therefore, the revised method (with long and word version timing)
would be:  
                .x= .l  .w
                ----------				 
  move.x d0,d1    ;  4   4 
  add.x  d0,d0    ;  6   4   d0' = d0+d0 = 2*d0
  add.x  d1,d0    ;  6   4   d0' = (d0+d0)+d0 = 3*d0
  add.x  d0,d0    ;  6   4   d0' = 2*((d0+d0)+d0) =  6*d0
  add.x  d0,d0    ;  6   4   d0' = 2*(2*((d0+d0)+d0)) = 12*d0
                   --- ---    
                    28  20   clocks
 
Multiplication (mulu.w <ea>,Dn) takes 38+2N+EA clocks where N is the
number of '1'-s in <ea> and EA is the number of clocks needed to
process the effective address. 
In your case <ea> is #12, which has an EA=4 and N=2 resulting
46 clocks. The add/shift version is obviously much faster. 

 > Also, to what point is this more efficient?  If I were to expand upon this
 > to multiply by 31 -- 2^4 + 2^3 + 2^2 + 2^1 + 2^0 (w/ 2^0 not requiring any

If you were to multiply by 31, you wouldn't use this method. Rather,
you'd do this:

  move.l d0,d1
  lsl.l  #5,d0    ; d0' = 32*d0
  sub.l  d1,d0    ; d0' = 32*d0 - d0 = 31 * d0
  
 > shifting), would it still be more efficient than multiplying?  It's
 > beginning to seem to me like the 68k's multiplication/division operations
 > are more or less wasteful. . .

Well, if you multiply with a constant then for quite a few cases you
can be better off with add/sub/shift. On the other hand, when you have 
to multiply two *variables* together, you can't beat the HW multiply.

Division is a different kind of animal. It is *very* slow, divu takes
140 clocks, divs takes 158. On the other hand, even divide by constant 
is a tough cookie without doing an actual division (except for powers
of two, of course) and the div instruction will laughingly beat any
general purpose division routine you can come up with.

Regards,

Zoltan


References: