Lair Of The Multimedia Guru

2006-09-27

X86 assembly instructions you always wanted but intel didnt give them to you

Everyone who wrote assembly code probably has cursed the lack of some instuctions in the instrcution set of their target architecture, be it due to the work needed to workaround it or the complexity and consequently speed loss of the resulting code

Heres a list of what i think is lacking in the x86 instruction set …

btw, i had to use shl and shr instead of 2 less than and 2 right than symbols as wordpress randomizes the html code if & lt ; or & gt ; occur in it (i dunno why, but that did work in the past)

cadd/csub (conditional add / sub): While there are conditional move instructions there are no conditional add/subtract instructions even though code like if(...) x+=... is pretty common
max/min: While there are mmx and sse maximum and minimum instruction there are sadly no normal integer equivalents
negative index addressing: addressing stuff like base[constant+index] is possible base[constant-index] is not, not only would that be usefull on its own it would also allow to subtract a left shifted register from another like: lea eax, [eax-8*ebx]
imulfix: For fixed point calculations the mul and imul instructions are very hard or slow to use as using the low 32bit result often would overflow and the high 32bit limits the coefficient to -0.5 … 0.5, a instruction which does (a*b + (1 shl 23)) shr 24 would solve this
sarr (shift arithmetic right with rounding): Shift right is nice but often you need to do it with rounding, a (a + (0.5 shl s)) shr s would come in handy in these cases
random: a pseudo random number generator and a true random number generator would be nice
abs: While there is pabs for mmx theres no equivalent for normal integers
select: A a&c | b&(~c) would come in handy for combining bits from 2 registers
readbits: a hardware based bitstream reader which takes a base pointer, index in bits and length of bits it should read would be very usefull for many audio and video formats, bitstream&vlc decoding generally takes 1/3 of the time spend decoding
nand/nor: (a&~b) and (a|~b) and maybe others
pavgb_nornd (packed byte average without rounding): while theres a average with rounding theres none without sadly mpeg need the later too
integer instructions like sarr and select for mmx/sse too: …
pcmpneq / pcmplt / pcmple / pcmpge: Just the missing compare instructions for mmx
punpckh/l0: unpack instructions which unpack the source with an implicit 0, this would avoid 1 move per 2 unpack and unpacking is very common
packsrr (pack with shift right and rounding): like pack* but with rounding and shift right integrated
plea: A mmx/sse instruction similar to lea which does the extreemly common operation of adding a left shifted value to another, and if this isnt able to do a=2*a+-b then an additional instruction for that as it would reduce the very common a-b, a+b butterfly operation to 2 instructions instead of 3
p(add/sub/mul)unpackh/l: having to unpack 2 source operands into 4 to perform 1 operation with them sucks (=needs 6-8 instructions) so combined foobar+unpack high/low would be pretty nice, of course only if its fast though
ps(r/l)lb (packed right/left shift unsigned byte): the missing packed byte shifts
pdabs (packed absolute difference): absolute value of the difference of 2 values
paddsusb: add a signed into a unsiged (byte) value with unsigned saturation, usefull for brightness correcture and loop/postprocessing filters

Filed under: Optimization — Michael @ 13:16

Comments (14)

14 Comments »

i386 had “readbits” but it was undocumented and removed. :( The name of the opcode was XBTS. There was also a “putbits” opcode, “IBTS”. See the NASM manual for (incomplete) documentation on them.

Comment by Rich — 2006-09-27 @ 19:32
A few notes on substitutions/why some of these aren’t so necessary (not to say they wouldn’t still be usedul):

cadd/csub: add/sub followed by cmov.
max/min: isn’t there a fast formula with bitwise ops? iirc libpostproc uses it for “c mmx”.
negative address indexing: yes it’s annoying not having it, mainly because x86 has pathetically few registers..
imulfix: you can use the 64bit shift opcodes (present since i386) but doing the add/carry is a pain and somewhat slow.
abs/select/nand/nor/…: easy to do with standard bitwise arithmetic.

IMO one of the main conclusions that should be drawn from your article is that cpu designers are spending way too much time on stupid float optimizations and neglecting fast integer arithmetic. If both were given equal attention, integer/fixedpoint would be several times faster than float (and several times faster than it is now) rather than roughly the same speed like it is on modern cpus…

Comment by Rich — 2006-09-27 @ 22:16
> A few notes on substitutions/why some of these arenâ€™t so necessary (not to say
> they wouldnâ€™t still be usedul):

[…]
> max/min: isnâ€™t there a fast formula with bitwise ops? iirc libpostproc uses
> it for â€œc mmxâ€.

sub a,b
sbb c,c
and b,c
add a,c
or
cmp a,b
cmov a,b
works for scalars,

but i would prefer a fast min/max instead, mmx has min/max instructions too

[…]

> imulfix: you can use the 64bit shift opcodes (present since i386) but doing the
> add/carry is a pain and somewhat slow.

of course and thats the problem, its slow, the shrd, the fact that you need to use eax/edx, … which is why id like a instruction to do that all at once :)

without that code which needs lots of multiplies and which needs 16bit or more precission is faster as floating point code instead of integer code
with less then 16bit precission mmx can be used, but audio for example always needs more precission internally, unless you are deaf

> abs/select/nand/nor/â€¦: easy to do with standard bitwise arithmetic.

yes but that wansnt my point, a cpu with just nand, ror and jz or something like that will also be able to do everything, my point was that these operations are common and simple and a cpu which could execute them directly would be faster then one which would emulate them through a sequence of other instructions

[…]

Comment by Michael — 2006-09-27 @ 23:10
Some of these can be efficiently implemented with few more instructions

SARR
shr eax,imm8 // stores MSB of remainder in Carry Flag
adc eax,0 // propagates rouding

MAX/MIN

if cmovcc are available:

cmp eax,ebx // sets flags
cmovlt eax,ebx

otherwise, use :

max(a,b) = b + ((a-b) & ~sign-extend(a-b))

// eax holds a,
// ebx holds b
sub eax,ebx // eax is (a-b)
cdq // sign extends eax (a-b) into edx
not edx //
and eax,edx
add ebx,eax

a similar hack exists for min

ABS

cdq // sign extends eax into edx
xor eax,edx // two’s complement(a) = ~a + 1
sub eax,edx // edx is -1 if eax was negative

NAND/NOR

“nand” means ~(a & b), not a & ~b. It should be andn as in mmx/sse mnemonics. Also,
it’s only two assembly instructions, not a big deal.

Comment by Steven — 2007-01-17 @ 14:59
Some of these can be efficiently implemented with few instructions

SARR
shr eax,imm8 // stores MSB of remainder (shifted out bits) in Carry Flag
adc eax,0 // propagates rouding

mmx doesn’t help at all.

MAX/MIN

if CMOVs are available:

cmp eax,ebx // sets flags
cmovlt eax,ebx // eax is max(eax,ebx)

otherwise, use :

max(a,b) = b – ((b-a) & sign-extend(b-a))
= a – ((a-b) & sign-extend(a-b));
= b + ((a-b) & ~sign-extend(a-b));

// eax holds b,
// ebx holds a
sub eax,ebx // eax is (a-b)
cdq // sign extends eax (a-b) into edx
not edx // ~sign-extend(a-b)
and eax,edx // (a-b) & ~sign-extend(a-b)
add ebx,eax // ebx is max(a,b)

a similar hack exists for min. While not ONE instruction,
it is branchless.

ABS
abs is not a very common operation in general (it is for video codecs, yes,
but not very common otherwise.)

abs(v) = v ^ sign-extend(v) – sign-extend(v)

cdq // sign extends eax into edx
xor eax,edx // two’s complement(a) = ~a + 1
sub eax,edx // edx is -1 if eax was negative

NAND/NOR

“nand” means ~(a & b), not a & ~b. It should be pandn as in mmx/sse mnemonics.
And it does exist ;)

pandn mm0,mm1 // mm0 = mm1 & ~mm0

the or-not version should give “porn”, for which I’m sure Intel isn’t quite ready yet :)
a | ~b is same as ~(~a & b) (de Morgan’s law)

PSEUDORANOM

It’s fun to think that you could write something like

prng eax,[seed] // loads seed, generates a PRNG, writes back seed

But the problem with pseudorandom generators is that they’re easy to get wrong. The simpler
linear congruential generators do a moderately good job and aren’t very hard to implement

mul eax,some_immediate_value
add eax,some_other_immediate_value

Hardware generators (“true” random) are also nice, but suffers the problem of
non-reproductibility. You have to store the sequence if for some reason you want
to reuse it (eg, decryption)

What I rather’d like is an incremental hash instruction:

hash edx,al // updates hash in edx with al : edx = hash(edx,al)
hash eax // generates a hash of eax (not necessarily the same algorithm)

not a cryptographic hash, just a good checksum / lookup-table hash function

SSE4 is supposed to have a CRC32 instruction, which is basically the same,
but I don’t know how it’s supposed to work.

punpckh/l0

I was wondering too why there’s isn’t either a zero-extend instructions or a “zero register”
as they’re needed all the time. More specifically, why there’sn’t a “load bytes as words”

pmovb2dzx mm0,[mem32] // loads 4 bytes as 4 zero-extended words
pmovb2dsx mm1,[mem32] // loads 4 bytes as 4 signed-extended words

and all other combination of bytes/word/doublewords to word/doublewords/quadwords

pdabs
I agree! In the mean time, it’s not very hard to implement:

movq mm2,mm0
pmax mm0,mm1 // mm0 = min(mm0,mm1)
pmin mm1,mm2 // mm1 = max(mm2,mm1) (mm2 a copy of the original mm0)
psub mm0,mm1 // abs(a-b) = max(a,b)-min(a,b)

also works for xmm

If abs isn’t very frequent in scalar programming (normal integer instructions)
it’s so important in video codecs that I wonder why they haven’t put this
instruction in mmx/sse.

Comment by Steven — 2007-01-17 @ 16:41
Hi Steven!

> What I ratherâ€™d like is an incremental hash instruction…

A quite good scrambler (eax=>eax) is e.g. the following sequence:

mov edx,0x965d25db
mul edx
xor eax,edx

You can make a hash function by chaining scrambled values, e.g.:

#define scramble_magic 0x965d25db
mov eax,value1
mov edx,scramble_magic
mul edx
xor eax,edx

add eax,value2
mov edx,scramble_magic
mul edx
xor eax,edx
…

To force this hash value into a given range of 0..value-1 you can then apply this:
mul eax,value
mov edx,eax

Comments?

Comment by Jasper — 2007-06-14 @ 11:26
For the average command I have the following:

Unsigned average command (a+b+1)/2, similar to PAVGB/PAVGW:

stc
adc eax,edx
rcr eax,1

Signed average command (a+b+1)/2:

mov ecx,$80000000
xor eax,ecx
xor edx,ecx
stc
adc eax,edx
cmc
rcr eax,1

Simple unsigned average command (a+b)/2:

add eax,edx
rcr eax,1

Simple signed average command (a+b)/2:

mov ecx,eax
xor ecx,edx
sar ecx,1
and eax,edx
add eax,ecx

The signed ones are the evil ones…

Comment by Jasper — 2007-06-14 @ 11:38
I wrote “To force this hash value into a given range of 0..value-1 …”.
When value is a power of two of course “AND eax,value-1” is sufficient.

Comment by Jasper — 2007-06-15 @ 11:45
I would like to suggest the following commands:

Return on condition like on good old 8080/8085/Z80

subr like the one for float; replaces sub+neg

An enhanced version of pmovmskb acting on every bit, i.e. transposing a complete 8*8 bit matrix

An enhanced movs command for copying columns, i.e. a replacement of the following code fragment:
for (i=0; i

Comment by Jasper — 2007-06-18 @ 13:32
It seems that here a less sign (now replaced with .lt.) is not allowed :-(

I would like to suggest the following commands:

Return on condition like on good old 8080/8085/Z80

subr like the one for float; replaces sub+neg

An enhanced version of pmovmskb acting on every bit, i.e. transposing a complete 8*8 bit matrix

An enhanced movs command for copying columns, i.e. a replacement of the following code fragment:
for (i=0; i .lt. count; i++) MoveMem(target[i*target_width],source[i*source_width],len);
With this command one might as well fill a region of memory with a pattern (source_width==0; target_width==len) such as RGB values (len==3).
Other uses include transposing matrixes (one line or column).
This command would use whopping 6 parameters (e.g. eax,edx,ecx,ebx,esi,edi)…
In our sources we often use this command currently implemented as a procedure.

Well, and of course a tidied up version of movsb which should replace all the *** MoveMem routines (fast, automatic copy direction). Why should a programmer deal with all those stuff like alignment, moving 4 bytes as a dword etc.? Analog for stosb (see also paragraph above).

Comment by Jasper — 2007-06-18 @ 13:33
> It seems that here a less sign (now replaced with .lt.) is not allowed

it is < as you can see :)
if you want to know how, read:
http://www.w3.org/TR/html401/html40.txt

Comment by Michael — 2007-06-18 @ 19:42
How about pushea, i.e. push effective address, a combination of lea and push without an auxiliary register? The NS32xxx has this command.

Comment by Jasper — 2007-07-12 @ 07:00
How about an enhanced setxx? Currently only a byte can be set to 0 (false) or 1 (true).
Very often a word or dword is needed.
The alternative value -1 for true is also often needed.
And finally it sometime helps if the carry flag is set when the condition is met.
These 3 enhancements would need 3 bits which are currently unused (i.e. ignored) in the setxx command’s reg field.

Comment by Jasper — 2009-06-17 @ 08:12
…and the enhanced setxx could also perform and/or/xor/add/sub/not_and/not_or in addition to store.
sgn(eax) could be written as
test eax,eax
setg eax
setl eax,sub

sbb eax,eax could be substitutes as
setc eax,-1

…but now I need 6 bits…
…one bit for creating the carry flag could be saved if this is to be done when esp is set since it normally makes no sense to do so.

Comment by Jasper — 2009-06-24 @ 07:08

RSS feed for comments on this post. TrackBack URI

2006-09-27

X86 assembly instructions you always wanted but intel didnt give them to you

14 Comments »

Leave a comment

Links