Lair Of The Multimedia Guru

September 27, 2006

X86 assembly instructions you always wanted but intel didnt give them to you

Everyone who wrote assembly code probably has cursed the lack of some instuctions in the instrcution set of their target architecture, be it due to the work needed to workaround it or the complexity and consequently speed loss of the resulting code

Heres a list of what i think is lacking in the x86 instruction set …

btw, i had to use shl and shr instead of 2 less than and 2 right than symbols as wordpress randomizes the html code if & lt ; or & gt ; occur in it (i dunno why, but that did work in the past)

cadd/csub (conditional add / sub)
While there are conditional move instructions there are no conditional add/subtract instructions even though code like if(...) x+=... is pretty common
max/min
While there are mmx and sse maximum and minimum instruction there are sadly no normal integer equivalents
negative index addressing
addressing stuff like base[constant+index] is possible base[constant-index] is not, not only would that be usefull on its own it would also allow to subtract a left shifted register from another like: lea eax, [eax-8*ebx]
imulfix
For fixed point calculations the mul and imul instructions are very hard or slow to use as using the low 32bit result often would overflow and the high 32bit limits the coefficient to -0.5 … 0.5, a instruction which does (a*b + (1 shl 23)) shr 24 would solve this
sarr (shift arithmetic right with rounding)
Shift right is nice but often you need to do it with rounding, a (a + (0.5 shl s)) shr s would come in handy in these cases
random
a pseudo random number generator and a true random number generator would be nice
abs
While there is pabs for mmx theres no equivalent for normal integers
select
A a&c | b&(~c) would come in handy for combining bits from 2 registers
readbits
a hardware based bitstream reader which takes a base pointer, index in bits and length of bits it should read would be very usefull for many audio and video formats, bitstream&vlc decoding generally takes 1/3 of the time spend decoding
nand/nor
(a&~b) and (a|~b) and maybe others
pavgb_nornd (packed byte average without rounding)
while theres a average with rounding theres none without sadly mpeg need the later too
integer instructions like sarr and select for mmx/sse too
pcmpneq / pcmplt / pcmple / pcmpge
Just the missing compare instructions for mmx
punpckh/l0
unpack instructions which unpack the source with an implicit 0, this would avoid 1 move per 2 unpack and unpacking is very common
packsrr (pack with shift right and rounding)
like pack* but with rounding and shift right integrated
plea
A mmx/sse instruction similar to lea which does the extreemly common operation of adding a left shifted value to another, and if this isnt able to do a=2*a+-b then an additional instruction for that as it would reduce the very common a-b, a+b butterfly operation to 2 instructions instead of 3
p(add/sub/mul)unpackh/l
having to unpack 2 source operands into 4 to perform 1 operation with them sucks (=needs 6-8 instructions) so combined foobar+unpack high/low would be pretty nice, of course only if its fast though
ps(r/l)lb (packed right/left shift unsigned byte)
the missing packed byte shifts
pdabs (packed absolute difference)
absolute value of the difference of 2 values
paddsusb
add a signed into a unsiged (byte) value with unsigned saturation, usefull for brightness correcture and loop/postprocessing filters
Filed under: Optimization — Michael @ 13:16

14 Comments »

  1. i386 had “readbits” but it was undocumented and removed. :( The name of the opcode was XBTS. There was also a “putbits” opcode, “IBTS”. See the NASM manual for (incomplete) documentation on them.

    Comment by Rich — September 27, 2006 @ 19:32

  2. A few notes on substitutions/why some of these aren’t so necessary (not to say they wouldn’t still be usedul):

    cadd/csub: add/sub followed by cmov.
    max/min: isn’t there a fast formula with bitwise ops? iirc libpostproc uses it for “c mmx”.
    negative address indexing: yes it’s annoying not having it, mainly because x86 has pathetically few registers..
    imulfix: you can use the 64bit shift opcodes (present since i386) but doing the add/carry is a pain and somewhat slow.
    abs/select/nand/nor/…: easy to do with standard bitwise arithmetic.

    IMO one of the main conclusions that should be drawn from your article is that cpu designers are spending way too much time on stupid float optimizations and neglecting fast integer arithmetic. If both were given equal attention, integer/fixedpoint would be several times faster than float (and several times faster than it is now) rather than roughly the same speed like it is on modern cpus…

    Comment by Rich — September 27, 2006 @ 22:16

  3. > A few notes on substitutions/why some of these aren’t so necessary (not to say
    > they wouldn’t still be usedul):

    […]
    > max/min: isn’t there a fast formula with bitwise ops? iirc libpostproc uses
    > it for “c mmx”.

    sub a,b
    sbb c,c
    and b,c
    add a,c
    or
    cmp a,b
    cmov a,b
    works for scalars,

    but i would prefer a fast min/max instead, mmx has min/max instructions too

    […]

    > imulfix: you can use the 64bit shift opcodes (present since i386) but doing the
    > add/carry is a pain and somewhat slow.

    of course and thats the problem, its slow, the shrd, the fact that you need to use eax/edx, … which is why id like a instruction to do that all at once :)

    without that code which needs lots of multiplies and which needs 16bit or more precission is faster as floating point code instead of integer code
    with less then 16bit precission mmx can be used, but audio for example always needs more precission internally, unless you are deaf

    > abs/select/nand/nor/…: easy to do with standard bitwise arithmetic.

    yes but that wansnt my point, a cpu with just nand, ror and jz or something like that will also be able to do everything, my point was that these operations are common and simple and a cpu which could execute them directly would be faster then one which would emulate them through a sequence of other instructions

    […]

    Comment by Michael — September 27, 2006 @ 23:10

  4. Some of these can be efficiently implemented with few more instructions

    SARR
    shr eax,imm8 // stores MSB of remainder in Carry Flag
    adc eax,0 // propagates rouding

    MAX/MIN

    if cmovcc are available:

    cmp eax,ebx // sets flags
    cmovlt eax,ebx

    otherwise, use :

    max(a,b) = b + ((a-b) & ~sign-extend(a-b))

    // eax holds a,
    // ebx holds b
    sub eax,ebx // eax is (a-b)
    cdq // sign extends eax (a-b) into edx
    not edx //
    and eax,edx
    add ebx,eax

    a similar hack exists for min

    ABS

    cdq // sign extends eax into edx
    xor eax,edx // two’s complement(a) = ~a + 1
    sub eax,edx // edx is -1 if eax was negative

    NAND/NOR

    “nand” means ~(a & b), not a & ~b. It should be andn as in mmx/sse mnemonics. Also,
    it’s only two assembly instructions, not a big deal.

    Comment by Steven — January 17, 2007 @ 14:59

  5. Some of these can be efficiently implemented with few instructions

    SARR
    shr eax,imm8 // stores MSB of remainder (shifted out bits) in Carry Flag
    adc eax,0 // propagates rouding

    mmx doesn’t help at all.

    MAX/MIN

    if CMOVs are available:

    cmp eax,ebx // sets flags
    cmovlt eax,ebx // eax is max(eax,ebx)

    otherwise, use :

    max(a,b) = b – ((b-a) & sign-extend(b-a))
    = a – ((a-b) & sign-extend(a-b));
    = b + ((a-b) & ~sign-extend(a-b));

    // eax holds b,
    // ebx holds a
    sub eax,ebx // eax is (a-b)
    cdq // sign extends eax (a-b) into edx
    not edx // ~sign-extend(a-b)
    and eax,edx // (a-b) & ~sign-extend(a-b)
    add ebx,eax // ebx is max(a,b)

    a similar hack exists for min. While not ONE instruction,
    it is branchless.

    ABS
    abs is not a very common operation in general (it is for video codecs, yes,
    but not very common otherwise.)

    abs(v) = v ^ sign-extend(v) – sign-extend(v)

    cdq // sign extends eax into edx
    xor eax,edx // two’s complement(a) = ~a + 1
    sub eax,edx // edx is -1 if eax was negative

    NAND/NOR

    “nand” means ~(a & b), not a & ~b. It should be pandn as in mmx/sse mnemonics.
    And it does exist ;)

    pandn mm0,mm1 // mm0 = mm1 & ~mm0

    the or-not version should give “porn”, for which I’m sure Intel isn’t quite ready yet :)
    a | ~b is same as ~(~a & b) (de Morgan’s law)

    PSEUDORANOM

    It’s fun to think that you could write something like

    prng eax,[seed] // loads seed, generates a PRNG, writes back seed

    But the problem with pseudorandom generators is that they’re easy to get wrong. The simpler
    linear congruential generators do a moderately good job and aren’t very hard to implement

    mul eax,some_immediate_value
    add eax,some_other_immediate_value

    Hardware generators (“true” random) are also nice, but suffers the problem of
    non-reproductibility. You have to store the sequence if for some reason you want
    to reuse it (eg, decryption)

    What I rather’d like is an incremental hash instruction:

    hash edx,al // updates hash in edx with al : edx = hash(edx,al)
    hash eax // generates a hash of eax (not necessarily the same algorithm)

    not a cryptographic hash, just a good checksum / lookup-table hash function

    SSE4 is supposed to have a CRC32 instruction, which is basically the same,
    but I don’t know how it’s supposed to work.

    punpckh/l0

    I was wondering too why there’s isn’t either a zero-extend instructions or a “zero register”
    as they’re needed all the time. More specifically, why there’sn’t a “load bytes as words”

    pmovb2dzx mm0,[mem32] // loads 4 bytes as 4 zero-extended words
    pmovb2dsx mm1,[mem32] // loads 4 bytes as 4 signed-extended words

    and all other combination of bytes/word/doublewords to word/doublewords/quadwords

    pdabs
    I agree! In the mean time, it’s not very hard to implement:

    movq mm2,mm0
    pmax mm0,mm1 // mm0 = min(mm0,mm1)
    pmin mm1,mm2 // mm1 = max(mm2,mm1) (mm2 a copy of the original mm0)
    psub mm0,mm1 // abs(a-b) = max(a,b)-min(a,b)

    also works for xmm

    If abs isn’t very frequent in scalar programming (normal integer instructions)
    it’s so important in video codecs that I wonder why they haven’t put this
    instruction in mmx/sse.

    Comment by Steven — January 17, 2007 @ 16:41

  6. Hi Steven!

    > What I rather’d like is an incremental hash instruction…

    A quite good scrambler (eax=>eax) is e.g. the following sequence:

    mov edx,0x965d25db
    mul edx
    xor eax,edx

    You can make a hash function by chaining scrambled values, e.g.:

    #define scramble_magic 0x965d25db
    mov eax,value1
    mov edx,scramble_magic
    mul edx
    xor eax,edx

    add eax,value2
    mov edx,scramble_magic
    mul edx
    xor eax,edx

    To force this hash value into a given range of 0..value-1 you can then apply this:
    mul eax,value
    mov edx,eax

    Comments?

    Comment by Jasper — June 14, 2007 @ 11:26

  7. For the average command I have the following:

    Unsigned average command (a+b+1)/2, similar to PAVGB/PAVGW:

    stc
    adc eax,edx
    rcr eax,1

    Signed average command (a+b+1)/2:

    mov ecx,$80000000
    xor eax,ecx
    xor edx,ecx
    stc
    adc eax,edx
    cmc
    rcr eax,1

    Simple unsigned average command (a+b)/2:

    add eax,edx
    rcr eax,1

    Simple signed average command (a+b)/2:

    mov ecx,eax
    xor ecx,edx
    sar ecx,1
    and eax,edx
    add eax,ecx

    The signed ones are the evil ones…

    Comment by Jasper — June 14, 2007 @ 11:38

  8. I wrote “To force this hash value into a given range of 0..value-1 …”.
    When value is a power of two of course “AND eax,value-1” is sufficient.

    Comment by Jasper — June 15, 2007 @ 11:45

  9. I would like to suggest the following commands:

    Return on condition like on good old 8080/8085/Z80

    subr like the one for float; replaces sub+neg

    An enhanced version of pmovmskb acting on every bit, i.e. transposing a complete 8*8 bit matrix

    An enhanced movs command for copying columns, i.e. a replacement of the following code fragment:
    for (i=0; i

    Comment by Jasper — June 18, 2007 @ 13:32

  10. It seems that here a less sign (now replaced with .lt.) is not allowed :-(

    I would like to suggest the following commands:

    Return on condition like on good old 8080/8085/Z80

    subr like the one for float; replaces sub+neg

    An enhanced version of pmovmskb acting on every bit, i.e. transposing a complete 8*8 bit matrix

    An enhanced movs command for copying columns, i.e. a replacement of the following code fragment:
    for (i=0; i .lt. count; i++) MoveMem(target[i*target_width],source[i*source_width],len);
    With this command one might as well fill a region of memory with a pattern (source_width==0; target_width==len) such as RGB values (len==3).
    Other uses include transposing matrixes (one line or column).
    This command would use whopping 6 parameters (e.g. eax,edx,ecx,ebx,esi,edi)…
    In our sources we often use this command currently implemented as a procedure.

    Well, and of course a tidied up version of movsb which should replace all the *** MoveMem routines (fast, automatic copy direction). Why should a programmer deal with all those stuff like alignment, moving 4 bytes as a dword etc.? Analog for stosb (see also paragraph above).

    Comment by Jasper — June 18, 2007 @ 13:33

  11. > It seems that here a less sign (now replaced with .lt.) is not allowed

    it is < as you can see :)
    if you want to know how, read:
    http://www.w3.org/TR/html401/html40.txt

    Comment by Michael — June 18, 2007 @ 19:42

  12. How about pushea, i.e. push effective address, a combination of lea and push without an auxiliary register? The NS32xxx has this command.

    Comment by Jasper — July 12, 2007 @ 7:00

  13. How about an enhanced setxx? Currently only a byte can be set to 0 (false) or 1 (true).
    Very often a word or dword is needed.
    The alternative value -1 for true is also often needed.
    And finally it sometime helps if the carry flag is set when the condition is met.
    These 3 enhancements would need 3 bits which are currently unused (i.e. ignored) in the setxx command’s reg field.

    Comment by Jasper — June 17, 2009 @ 8:12

  14. …and the enhanced setxx could also perform and/or/xor/add/sub/not_and/not_or in addition to store.
    sgn(eax) could be written as
    test eax,eax
    setg eax
    setl eax,sub

    sbb eax,eax could be substitutes as
    setc eax,-1

    …but now I need 6 bits…
    …one bit for creating the carry flag could be saved if this is to be done when esp is set since it normally makes no sense to do so.

    Comment by Jasper — June 24, 2009 @ 7:08

RSS feed for comments on this post. TrackBack URI

Leave a comment

Powered by WordPress