![]() |
|
Published 1999-04-19 Printer-friendly version
(Editor's note: Some of the source code on this page exceeds the page boundaries when printed from a browser. You may wish to download the April 19 PDF which takes care of this problem.)
Recently Clarion magazine ran a competition to see who could produce the fastest and tightest code for a given problem. The results are given elsewhere but David has asked me to write an article dealing not with what the developer was doing, but what the compiler was doing underneath.
The object is not to score points from one developer to another but simply to use the code that has been produced to illustrate some of the internals of the way the compiler works. These are not issues you would want to think about when you code but, if they are in the back of your mind, they may just help you keep an edge on your coding.
In order to make the analysis concrete I shall actually include snippets of disassembled code from the functions. Don't worry if you're not an assembler coder. The text has all the information required, and this may even ease you in to using some of the debugger disassembly functions. The 32 bit debugger is brilliant in mixed code/assembler mode. I have also given a byte count for each function. The byte count is often an indication of just how easy the compiler found the code to chew. Generally I look for around 20 bytes per line as a fair indication that the code is doing what you expect.
As well as analysing what the compiler has produced I shall make a few comments about how I might have altered the code provided. These comments may not actually help (they are not tested), but they are the comments I would have made to one of my team had they submitted the code as part of our code base.
JeffSlarve Procedure(Long StartTime,Long EndTime,
Short Increment,Byte Round=False)
Result Real
Code
Result = ((EndTime - StartTime + |
CHOOSE(StartTime > EndTime, 8640000, 1)) |
/ Increment / 60) * .01
Return CHOOSE(~Round,Result+CHOOSE|
(~(Result-Int(Result)),0,1),Round(Result,1))
Jeff is one of the many that went for minimum line count. The
Clarion CHOOSE function is central to this philosophy
and the fragment below shows why.
!39 C6! cmp esi,eax !7E 07! jle 7 !BB 00 D6 83 00! mov ebx,83D600H !EB 05! jmp 5 !BB 01 00 00 00! mov ebx,1
This is the code for the inner CHOOSE of the first
line. The compiler has the two incoming LONGs in
registers, compares them and stores 8640000 or
1 in ebx dependant on the equality.
Adding EndTime and then subtracting
StartTime becomes a short and sweet operation.
!29 F0! sub eax,esi ! eax = EndTime-StartTime !01 D8! add eax,ebx ! eax += CHOOSE result (in ebx)
Having got the body of the expression into eax the
compiler now has to divide by Increment. Here is the
first issue. In Clarion division actually works. A division (even
of an integer by an integer) is defined to produce a decimal
result. So now the compiler uses the decimal stack. The first point
to note is that the compiler didn't compute the result in
eax by accident. It was to make it easy to push the
value onto the decimal stack.
!E8 00 00 00 00! call Cla$DpushLong ! Push result !0F BF C1! movsx eax,cx ! Push increment onto stack !E8 00 00 00 00! call Cla$DPushLong !E8 00 00 00 00! call Cla$DecDivide ! Perform division, result on stack
To get increment onto the stack it has to work a little harder.
First increment ( a SHORT in
cx ) is moved into eax with sign
extension, then the push happens, then the divide. Next the
compiler has to divide by 60. Here is a nice feature of the decimal
stack: the result of the division was left on the stack so to
divide by 60 it's only necessary to push one of the operands
(not both).
!B8 3C 00 00 00! mov eax,3CH ! 60 onto stack !E8 00 00 00 00! call Cla$DPushLong !E8 00 00 00 00! call Cla$DecDivide
Next comes the multiplication by .01. Another little tweak: the compiler itself does not have a decimal library built in so cannot perform decimal arithmetic. Thus it stores decimals internally as strings and hands them on to the decimal library as such. Again the result of the division is still on the stack so the constant is pushed, then the multiply routine is called.
!B8 00 00 00 00! mov eax,$7 !B3 02! mov bl,2 !66 B9 02 00! mov cx,2 !E8 00 00 00 00! call Cla$DpushConstant ! .01 on stack !E8 00 00 00 00! call Cla$DecMul ! x top two items
Finally on the first line is the assignment itself. This is made
more complex (at least in the assembler) by the fact that the
result variable is a REAL. To store one of those the
floating point chip is used. DpopReal moves a result
from the decimal stack to register 0 of the FPU. This is then
stored as required.
!E8 00 00 00 00! call Cla$DpopReal !DD 5C 24 30! fstp qword [esp][30H],st0
In the second line a CHOOSE is again evaluated (the
outer one). Simply checking a byte for true is very simple. Note
that the compiler doesn't actually evaluate
~Round; it simply reverses the jump condition. The
other key here is that CHOOSE is guaranteed to perform
a short-circuit evaluation. In other words only one of the last two
parameters of CHOOSE will be evaluated. This is vital!
Without it all solutions using CHOOSE would have been
much slower.
!80 FA 00! cmp dl,0 !75 4C! jne 4CH
Going down the non-rounding arm the compiler first gets the
evaluation of ~(Result-INT(Result)). This can be done
almost entirely on the FPU although the Clarion INT
semantics have been encoded in a run-time library function. Note
too that it's necessary to use the Windows calling convention
which means floating point numbers are passed on the stack. (But
they are returned in FP0! Don't ask.)
!83 EC 08! sub esp,8 !DD 44 24 40! fld st0,qword [esp][40H] !DD 1C 24! fstp qword [esp][0],st0 !E8 00 00 00 00! call Cla$INT ! INT(Result) !DD 5C 24 28! fstp qword [esp][28H],st0 !DD 44 24 38! fld st0,qword [esp][38H] !DD 44 24 28! fld st0,qword [esp][28H] !DE E9! fsubp st1,st0 !Result-INT(Result) !DD 05 00 00 00 00! fld st0,qword [dword $8] ! Load up 0 !DE D9! fcompp st0,st1 ! Compare against zero
Having computed the result into the FPU the compiler now has to
transfer the results into the normal registers and then execute the
CHOOSE. The CHOOSE will return 0 or 1
(both of which are longs).
!DF E0! fstsw ax ! FPU status goes into eax !F6 C4 40! test ah,40H !74 07! je 7 !B8 00 00 00 00! mov eax,0 !EB 05! jmp 5 !B8 01 00 00 00! mov eax,1 ! EAX now holds 0 or 1 from CHOOSE
Having the result of the CHOOSE in eax
it now has to be cast back into a REAL to perform the
addition:
!DD 44 24 38! fld st0,qword [esp][38H] ! Result into fpu !89 44 24 24! mov [esp][24H],eax !DB 44 24 24! fild st0,dword [esp][24H] ! EAX into fpu !DE C1! faddp st1,st0 ! Addition performed !DD 5C 24 1C! fstp qword [esp][1CH],st0 ! Answer now ready for returning.
The computation of the other arm is simpler as the result of a
ROUND is deemed to be dependant upon the second (not
first) parameter. Thus the decimal stack is used.
!83 EC 08! sub esp,8 !DD 44 24 40! fld st0,qword [esp][40H] !DD 1C 24! fstp qword [esp][0],st0 !E8 00 00 00 00! call Cla$DpushReal ! Result onto decimal stack !B8 01 00 00 00! mov eax,1 !E8 00 00 00 00! call Cla$DpushLong ! 1 onto decimal stack !E8 00 00 00 00! call Cla$Dround ! Rounding performed !E8 00 00 00 00! call Cla$DpopReal ! Result back in st0
The last line may seem a bit odd. Why bring the result back as a
REAL? A CHOOSE function has to return the
same result type whichever arm is executed, so the compiler chooses
the lowest common denominator as the final result type. In the case
of REAL and DECIMAL this is deemed to be
REAL.
Having computed the result as a REAL it actually
has to be returned as a SHORT. This can be a
real performance hit as the FPU cannot be guaranteed to have
the correct truncation mode set. Don't ask what the lines
below do. I knew it five years ago but have long since forgotten. I
do know we need them.
!DD 44 24 1C! fld st0,qword [esp][1CH] !D9 7C 24 10! fstcw [esp][10H] !D9 7C 24 0E! fstcw [esp][0EH] !66 81 4C 24 0E 3F 0C! or word [esp][0EH],0C3FH !D9 6C 24 0E! fldcw [esp][0EH] !DF 5C 24 12! fistp word [esp][12H],st0 !DB E2! fclex !D9 6C 24 10! fldcw [esp][10H] !66 8B 44 24 12! mov ax,[esp][12H]
The compiler was quite happy with this code, and the
CHOOSE made the execution path short and quick. The
only thing that made the assembler ugly was the use of the real.
Generally I would use a decimal rather than a real when mixing with
longs. A small savings could also be achieved by using
,AUTO on the variable.
DavidBayliss PROCEDURE(s_time,e_time,incr,rnd)
dur long
CODE
dur = e_time-s_time
IF dur < 0 THEN
dur += 2 * 12 * 60 * 60 * 100
.
IF rnd
RETURN ROUND(dur / incr / ( 60 * 100 ),1)
ELSE
dur /= incr * 60 * 100
RETURN dur + 1
END
My main aim here was to stay at long precision for as much of
the time as I could. Hence my only intermediate is a
LONG. (Yuk, sloppy. This should have had
,AUTO on it). I compute the duration in the first
line; this can be done easily as the incoming parameters are in
registers.
!29 C3! sub ebx,eax ! ebx is now dur !89 5C 24 04! mov [esp][4],ebx ! stored in memory
The comparison of a LONG with zero is very cheap as
the compiler doesn't even bother to load the value from
memory. The compiler is also very good at folding integral
expressions so I didn't bother to fold the multiplication by
hand. I used += as this avoids using the value of dur
and hence avoids a load into a register. The whole of the first
IF construct is below.
!83 7C 24 04 00! cmp dword [esp][4],0 !7D 08! jge 8 !81 44 24 04 00 D6 83 00! add dword [esp][4],83D600H
Then comes the actual computation of the result. The rounding arm comes first. Here I missed a trick which cost me the poor rounding result.
!8B 44 24 04! mov eax,[esp][4] !E8 00 00 00 00! call Cla$DpushLong ! dur onto decimal stack !89 D8! mov eax,ebx !E8 00 00 00 00! call Cla$DpushLong ! increment onto decimal stack !E8 00 00 00 00! call Cla$DecDivide ! Divide and leave on stack !B8 70 17 00 00! mov eax,1770H !E8 00 00 00 00! call Cla$DpushLong ! 60 * 100 onto stack !E8 00 00 00 00! call Cla$DecDivide ! Divide and leave on stack !B8 01 00 00 00! mov eax,1 !E8 00 00 00 00! call Cla$DpushLong ! One onto stack !E8 00 00 00 00! call Cla$Dround ! Round it !E8 00 00 00 00! call Cla$DpopLong ! Return as short
The compiler has coped with the expression just fine, but the trick I missed is that I could have avoided the second division by using a multiplication instead.
RETURN ROUND(dur / (incr * ( 60 * 100 )),1)
The reason this matters is the dur/incr can often
leave a result with many decimal places. The division by 6000
(which the decimal library does to 31 decimal places) is thus very
costly.
The non-rounding arm actually missing a trick too.
!8B 44 24 04! mov eax,[esp][4] !E8 00 00 00 00! call Cla$DpushLong ! Dur onto decimal stack !6B C3 3C! imul eax,ebx,3CH ! Incr (in ebx) times 60, into eax !6B C0 64! imul eax,eax,64H ! Eax times 100 into eax !E8 00 00 00 00! call Cla$DpushLong ! Result onto decimal stack !E8 00 00 00 00! call Cla$DecDivide ! Division performed !E8 00 00 00 00! call Cla$DpopLong ! Result into eax !89 44 24 04! mov [esp][4],eax ! Result into dur
So how come the silly compiler didn't fold my constants? Well, it was actually being ultra-cautious. In a fixed-width number system (such as long or decimal) multiplication is actually not quite associative. If you rewrite the line as
dur /= incr * (60 * 100)
the compiler folds as expected. Moral: If you want to make sure constants are folded in a mixed expression then use some parenthesis.
Adding on one and return is then easy:
!66 8B 44 24 04! mov ax,[esp][4] ! Long to short cast is free !66 40! inc ax ! Adding one to a long is easy
A painfully sloppy piece of coding. I missed three tricks all of which cost time, one of which cost compactness. However, I actually find this quite encouraging. You see, when I sent this to David I said that I hadn't checked for speed or compactness; I was just trying to code the function "cleanly." The result was a function that performed reasonably well and can be cleaned up into a very nice performer.
NikJohnsonA Procedure(Long T1,Long T2,Short B,Byte R=False) Result DECIMAL(4) CODE Result = .5-R/2+(T2-T1-1+CHOOSE(T2-T1<1,8640000,0))/(B*6000) return(Result)
Here again the decimal stack is in full swing. First
R is converted to a LONG and divided by
two (which is a decimal operation).
!0F B6 C2! movzx eax,dl ! R (in dl) goes into EAX
!E8 00 00 00 00! call Cla$DpushLong ! And onto the decimal stack
!B8 02 00 00 00! mov eax,2
!E8 00 00 00 00! call Cla$DpushLong ! 2 onto decimal stack
!E8 00 00 00 00! call Cla$DecDivide ! Division performed,
! result on stack
Then the 0.5 is pushed onto the stack
!B8 00 00 00 00! mov eax,$5 ! 0.5 stored as string !B3 01! mov bl,1 !66 B9 01 00! mov cx,1 !E8 00 00 00 00! call Cla$DpushConstant ! 0.5 on stack
Then the subtraction. This shows a useful tweak in the stack system. Usually a stack subtraction would take the top of stack from the one below, the opposite of what we want. We could issue some "stack rolling" calls, however we have avoided this by producing some "reversed" functions (that actually want their parameters inside out).
!E8 00 00 00 00! call Cla$DecSubR ! Take 2nd on stack from top,
! result on stack.
The compiler leaves this result on the stack, and it will come
back to it much later. Now the compiler moves onto to the inner
CHOOSE (very similar to Jeff Slarve's code).
!89 E8! mov eax,ebp !29 F0! sub eax,esi ! EAX is T2-T1 !83 F8 01! cmp eax,1 ! Compare to 1 !7D 07! jge 7 !BB 00 D6 83 00! mov ebx,83D600H !EB 05! jmp 5 !BB 00 00 00 00! mov ebx,0 ! EBX is result of CHOOSE
The T2-T1-1 is all long arithmetic which you would
expect the compiler to chew easily. In fact it is ultra-clever.
Note that in computing the CHOOSE T2-T1
is computed into eax, it so happens that
T2-T1 is the first part of the expression it needs to
compute so it doesn't have too! It can get straight on with
the subtracting of one and then adding the CHOOSE
result.
!48! dec eax ! EAX = T2-T1-1 !01 D8! add eax,ebx ! Add on CHOOSE result.
The compiler has again chosen to compute the result into
eax. You'll see why in the next stage which is
the division of B * 6000.
!E8 00 00 00 00! call Cla$DpushLong ! Result onto decimal stack !0F BF C7! movsx eax,di ! B into eax !69 C0 70 17 00 00! imul eax,eax,1770H ! Multiply by 6000 !E8 00 00 00 00! call Cla$DpushLong ! Onto decimal stack !E8 00 00 00 00! call Cla$DecDivide ! Division
Finally this result has to be added to the result of the subtraction performed previously. Remember that had been left on the stack so the compiler simply does:
!E8 00 00 00 00! call Cla$DecAdd
This result is then placed into the local variable:
!8D 44 24 15! lea eax,[esp][15H] !B3 03! mov bl,3 !B1 00! mov cl,0 !E8 00 00 00 00! call Cla$DpopDec
The local variable is then converted to a SHORT
(via the stack) and returned.
!B3 03! mov bl,3 !B1 00! mov cl,0 !E8 00 00 00 00! call Cla$DPushDec !E8 00 00 00 00! call Cla$DpopLong
This function is very similar to Jeff's except the
computation stays with DECIMALs rather than diving
into REALs. The benefit is seen in terms of code size
and speed. I think there are two ways to improve it. Firstly
changing 0.5-R/2 to (1-R)/2 would save
the "constant push" and would allow the subtraction to happen at
LONG width. Secondly I would try ditching the
DECIMAL variable and simply returning a
ROUNDed result. The DpopDec contains an
implicit ROUND so I would guess this will not slow
things down and would save some code.
NikJohnsonB Procedure(Long T1,Long T2,Short B,Byte R=False) Result DECIMAL(4) CODE Result = ((((T2-T1)+CHOOSE((T2-T1)<0,8639999,-1))| /B/3000)+1-R)/2 return(Result)
The innermost expression (up to the first /) is a joy to behold.
Again the CHOOSE is neat and the compiler repeats the
trick of spotting the T2-T1 is used twice and so
computes it once.
!29 F0! sub eax,esi ! EAX = T2-T1 !83 F8 00! cmp eax,0 !7D 07! jge 7 !BB FF D5 83 00! mov ebx,83D5FFH !EB 05! jmp 5 !BB FF FF FF FF! mov ebx,-1 ! EBX is result of CHOOSE !01 D8! add eax,ebx ! Result of addition
Again the result is in eax which makes the
following two divisions very neat.
!E8 00 00 00 00! call Cla$DpushLong !Onto decimal stack !0F BF C1! movsx eax,cx ! B onto decimal stack !E8 00 00 00 00! call Cla$DPushLong !E8 00 00 00 00! call Cla$DecDivide ! Division done !B8 B8 0B 00 00! mov eax,0BB8H !E8 00 00 00 00! call Cla$DpushLong ! 3000 onto stack !E8 00 00 00 00! call Cla$DecDivide ! Division again
Now, here the compiler has a decimal result to which it wants to
add 1 (again addition & subtraction are not quite associative
so the compiler works left to right). Because it cannot cast from
DECIMAL to LONG without losing precision
it has to perform the addition (and then subtraction) upon the
decimal stack.
!B8 01 00 00 00! mov eax,1 !E8 00 00 00 00! call Cla$DpushLong ! 1 onto stack !E8 00 00 00 00! call Cla$DecAdd ! Add !0F B6 C2! movzx eax,dl ! R into EAX !E8 00 00 00 00! call Cla$DpushLong ! and onto stack !E8 00 00 00 00! call Cla$DecSub ! Subtract
On the plus side the decimal computation leaves the result on the stack, ripe for the division by 2, again on the decimal stack.
!B8 02 00 00 00! mov eax,2 !E8 00 00 00 00! call Cla$DPushLong !E8 00 00 00 00! call Cla$DecDivide
The result is then used as in Nik's A version.
The assembler actually looks quite nice, certainly very compact.
So why was it so slow? I suspect the culprit is the three
divisions. The problem of doing repeated divisions within one
expression is that the intermediate result gets very long which
slows the decimal library down. This could be reduced by combining
/B/3000 into /(B*3000).
The other little tweak would be to move the +1-R to
the beginning of the bracketed expression so that it is done at
long width.
BrianStaff Procedure(Long T1,Long T2,Short Block,Byte Rounding=False)
X LONG
Answer long
code
X = CHOOSE(T2 > T1,(T2-1) - (T1-1),|
(T2-1) + (24*60*60*100) - (T1-1))
EXECUTE(ROUNDING+1)
Answer = ( X / (Block*60*100) ) + |
CHOOSE(X % (Block*60*100) = 0,0,1)
Answer = ROUND( X / (Block*60*100), 1)
END
return(Answer)
The compiler has really bust a gut getting the code for
Brian's CHOOSE good. The jumping code is clean
(as usual) but it also spots that T1-1 is used down
both arms and thus pre-computed it. Here is the code for the whole
line:
!39 C3! cmp ebx,eax ! Test T2>T1 !9C! pushfd !48! dec eax ! Pre-Compute T1-1 !9D! popfd !7E 05! jle 5 ! Then pick an arm !4B! dec ebx ! EBX = T2-1 !29 C3! sub ebx,eax ! EBX = (T2-1) - (T1-1) !EB 08! jmp 8 !81 C3 FF D5 83 00! add ebx,83D5FFH ! EBX = T2-1+24*60*60*100 !29 C3! sub ebx,eax ! EBX is required result !89 5C 24 0C! mov [esp][0CH],ebx ! Store into X
Brian has also persuaded the compiler to pre-compute the
Block*60*100 outside of the execute statement.
!0F B6 D2! movzx edx,dl ! Move ROUNDING into edx !42! inc edx ! Add 1 !83 FA 01! cmp edx,1 ! Compare against 1 (first line) !9C! pushfd !0F BF C1! movsx eax,cx ! Move Block into EAX !6B C0 3C! imul eax,eax,3CH ! * 60 !6B D8 64! imul ebx,eax,64H ! EBX now holds Block*60*100 !9D! popfd !75 4E! jne 4EH ! Now split into the two arms of the execute.
The first part of the non-rounding case (the division) is handled in the way you should now expect.
!8B 44 24 0C! mov eax,[esp][0CH]! X into EAX !E8 00 00 00 00! call Cla$DPushLong !89 D8! mov eax,ebx ! Compiler has pre-computed Block*60*100 !E8 00 00 00 00! call Cla$DPushLong !E8 00 00 00 00! call Cla$DecDivide ! Result as required
A modulus operation between integers is handled by the compiler inline (although the code is a little hard to understand unless you sit down with pencil and paper!).
!8B 44 24 0C! mov eax,[esp][0CH] ! X into EAX !99! cdq ! Make X an INT64 !F7 FB! idiv ebx ! Modules with Block*60*100 (Pre-computed) !89 D0! mov eax,edx ! You now need to fiddle to deal with negatives !31 D8! xor eax,ebx !7D 06! jge 6 !85 D2! test edx,edx !74 02! je 2 !01 DA! add edx,ebx ! Get the result into edx by magic (Memory failure ...)
Having got the result magically into edx the CHOOSE
is its usual snappy self:
!83 FA 00! cmp edx,0 !75 07! jne 7 !B8 00 00 00 00! mov eax,0 !EB 05! jmp 5 !B8 01 00 00 00! mov eax,1
This leaves the addition. The result being added to is presently
on the decimal stack so decimal addition is used, and the result is
then stored in Answer.
!E8 00 00 00 00! call Cla$DpushLong ! CHOOSE result in EAX, now pushed. !E8 00 00 00 00! call Cla$DecAdd ! Add to division result already on stack !E8 00 00 00 00! call Cla$DpopLong ! Pop result into eax !89 44 24 08! mov [esp][8],eax ! And into answer
The other arm is fairly standard decimal stack stuff.
!8B 44 24 0C! mov eax,[esp][0CH] !X into eax !E8 00 00 00 00! call Cla$DpushLong! X onto stack !89 D8! mov eax,ebx ! Pre-computed Block*60*100 !E8 00 00 00 00! call Cla$DPushLong !E8 00 00 00 00! call Cla$DecDivide ! Division result on stack !B8 01 00 00 00! mov eax,1 ! 1 onto stack !E8 00 00 00 00! call Cla$DPushLong !E8 00 00 00 00! call Cla$Dround ! Perform the round !E8 00 00 00 00! call Cla$DpopLong ! Result into eax !89 44 24 08! mov [esp][8],eax ! And into answer
Brian has kept the computations at a suitable precision, avoided excessive divisions and somehow persuaded the compiler to perform some extremely useful pre-computations. A worthy winner.
I'm not sure I should comment given he has the best figures
but I think he could get a better result yet by putting the
60*100 into parenthesis (same as David Bayliss'
code). He can also put ,AUTO on his variables. The
other thing I would try is returning the result directly rather
than going through Answer. (Editor's note: Some
entries were provided as routines and the extra step in
RETURNing values may be mine.)
Part Two of this analysis provides some surprising information about how the compiler translates the code we write.
David Bayliss is a Systems Architect for The TopSpeed Development Center. He has worked upon TopSpeed's compiler and was the chief architect of the Application Builder Classes.
Copyright © 1999-2008 by CoveComm Inc. All Rights Reserved. Reproduction in any form without the express written consent of CoveComm Inc., except as described in the subscription agreement, is prohibited.
Clarion Magazine ISSN 1718-9942
One year: $159
(reg $189, save $30)
(includes all back issues since '99)
Renewals from $109
Two years: $249
(reg $289, save $40)
Renewals from $199