Assembly 80x86 - Zeroing a register

DaSilva · 26-01-2007 03:30PM #1

Kinda trivial, but at the same time has me wondering.

Which instruction is faster to zero out a register?

xor AX,AX
mov AX,0

I could check which one is faster on this machine, but it wouldnt tell me in general which instruction takes less cycles (i imagine both are very low anyway).

I see both methods being used all the time, so just makes me wonder, are they both the same really, and just some people prefare one over the other?

robfitz · 26-01-2007 06:27PM

The xor will be faster. It generates less code taking up less space in memory and caches, the mov needs to encode the zero in the code so you get something like this:

xor ax, ax # 31 c0
mov ax, 0 # b8 00 00 00 00

Martyr · 26-01-2007 09:51PM

robfitz wrote:

The xor will be faster. It generates less code taking up less space in memory and caches, the mov needs to encode the zero in the code so you get something like this:

xor ax, ax # 31 c0
mov ax, 0 # b8 00 00 00 00

it may interest you to know that on 32 and 64-bit cpus, bigger code usually runs faster than smaller.

there are good examples here

for example, in zeroing a register, you could have

small but slow:
      push 0
      pop eax

small and fast on pre-p4 systems.
      xor eax,eax

small and fast on all systems.
      sub eax,eax

big, but faster than push/pop
      mov eax,0

big but fast.
      and eax,0

and i've seen some asm programmers use the loop instruction which is small, but terribly slow on post-pentium processors, as are similar conveniant instructions like lodsb/scasb (for example in strlen()) movsb (in memcpy() or strcpy() functions)

for loops, i've seen:
small but slower

label:
; body of loop
loop label

most compilers optimising pre-p4 will use something like

big but faster
label:
      ; execute body
      dec ecx
      jnz label

this is because the pentiums > execute more than one instructions at once..code that "pairs" is much faster than some smaller code..but not on 16-bit processors obviously.

and (imho) p4's don't work well with pentium optimised code same way as amd64 does.

they (intel) recommend replacing DEC with SUB in loops for better performance.they also don't like using LEA which is very useful in optimisation.

good source is the mark larson tutorial above

like you could have:

     mov eax, 12345678h                    ;5 bytes
     add eax, ebp                          ;2 bytes
     imul ecx, 4                          ;3 bytes
     add eax, ecx                          ;2 bytes

and optimise it into:

     lea eax, [ebp+ecx*4+12345678h]        ;7 bytes

the xor as rob says (or sub) would be best on 16-bit

Assembly 80x86 - Zeroing a register

Comments