Lair Of The Multimedia Guru

2005-11-22

Fast integer square root

How to calculate the square root of an integer? (unsigned int)sqrt(x) sure but that is slow on many CPUs (~127 cpu cycles on a P3) it also needs a FPU, which may or may not be a problem, how can we do this quicker? lets see

one obvious way is to use bisection:

static inline int sqrt1(int a){
    int ret=0;
    int s;
    int ret_sq=-a-1;
    for(s=30; s>=0; s-=2){
        int b;
        ret+= ret; 
        b=ret_sq + ((2*ret+1)<<s);
        if(b<0){
            ret_sq=b;
            ret++;
        }
    }
    return ret;
}

or variations, but these at least in my tests are slightly slower (~10 cpu cycles) then simply using sqrt(), one way to significantly speed this up if small numbers are more common is to check the size of the number and set the initial s accordingly, still it wont beat the later methods, we can also write this by hand and avoid the terrible mess gcc generates:

    asm volatile(
        "xorl   %0, %0          \n\t"
        "xorl   %%ebx, %%ebx    \n\t"
        "incl   %1              \n\t"
#define SQRT_CORE(s) \
        "leal   1(,%0,4), %%edx \n\t"\
        "shll   $" #s ", %%edx  \n\t"\
        "addl   %%ebx, %%edx    \n\t"\
        "cmpl   %1, %%edx       \n\t"\
        "cmovcl %%edx, %%ebx    \n\t"\
        "adcl   %0, %0          \n\t"
        SQRT_CORE(30)
        SQRT_CORE(28)
        SQRT_CORE(26)
        SQRT_CORE(24)
        SQRT_CORE(22)
        SQRT_CORE(20)
        SQRT_CORE(18)
        SQRT_CORE(16)
        SQRT_CORE(14)
        SQRT_CORE(12)
        SQRT_CORE(10)
        SQRT_CORE(8)
        SQRT_CORE(6)
        SQRT_CORE(4)
        SQRT_CORE(2)
        SQRT_CORE(0)
        : "=&a" (ret), "+r" (a)
        : 
        : "%ebx", "%edx"
    );

which needs just ~102 cpu cycles on my P3, the trick with smaller initial s could also easily be added here, but lets forget using bisection and try something else, after all no matter what we do for a 16bit result we need to do 16checks with bisection which is not exactly fast

Look up tables and refine:

for(i=0; i<=256; i++){
    sqrt_tab[i]= i==255 ? 255 : rintf(sqrt(i)*16+0.3);
}

static inline unsigned int sqrt3(unsigned int a)
{
    unsigned int b;

    if(a<(1<<16)){
        if(a<(1<<10)-3)      b=sqrt_tab[(a+ 3)>>2 ]>>3;
        else{
            if(a<(1<<14)-28) b=sqrt_tab[(a+28)>>6 ]>>1;
            else             b=sqrt_tab[ a    >>8 ];
        }
    }else{
        if(a<(1<<24)){
            if(a<(1<<20)){
                b=sqrt_tab[a>>12];
                b= (FASTDIV(a,b)>>3) + (b<<1);
            }else{
                b=sqrt_tab[a>>16];
                b= (FASTDIV(a,b)>>5) + (b<<3);
            }
        }else{
            if(a<(1<<28)){
                if(a<(1<<26)){
                    b=sqrt_tab[a>>18];
                    b= (FASTDIV(a,b)>>6) + (b<<4);
                }else{
                    b=sqrt_tab[a>>20];
                    b= (FASTDIV(a,b)>>7) + (b<<5);
                }
            }else{
                if(a<(1<<30)){
                    b=sqrt_tab[a>>22];
                    b= (FASTDIV(a,b)>>8) + (b<<6);
                }else{
                    b=sqrt_tab[a>>24];
                    b= (FASTDIV(a,b)>>9) + (b<<7);
                }
            }
        }
    }

    if(a<b*b) b--;

    return b;
}

FASTDIV() is from ffmpeg, and performs a quick x/8bit division by using a LUT and a multiplication, the whole sqrt needs just ~23 cpu cycles
if you think thats not obfuscated enough, then a quick search with google will help:

    int isqrt (long r) {
        float tempf, x, y, rr;
        int is;

        rr = (long) r;
        y = rr*0.5;
        *(unsigned long *) &tempf = (0xbe6f0000 - *(unsigned long *) &rr) >> 1;
        x = tempf;
        x = (1.5*x) - (x*x)*(x*y);
        if (r > 101123) x = (1.5*x) - (x*x)*(x*y);
        is = (int) (x*rr + 0.5);
        return is + ((signed int) (r - is*is)) >> 31;
    }

from Paul Hsieh’s Square Root page has a missing () in the last line, after thats fixed it still gives wrong values for some input (0-4), and needs ~93 cpu cycles

Filed under: Optimization — Michael @ 17:21

3 Comments »

  1. this doesnt fix my problems

    Comment by ugubuga — 2007-03-02 @ 00:30

  2. And what about Newton’s iteration?

    Comment by r0b0t — 2008-03-10 @ 23:27

  3. Check out quake’s fast inverse sqrt. Here is a snippet:

    float InvSqrt(float x){
    float xhalf = 0.5f * x;
    int i = *(int*)&x; // store floating-point bits in integer
    i = 0x5f3759d5 – (i >> 1); // initial guess for Newton’s method
    x = *(float*)&i; // convert new bits into float
    x = x*(1.5f – xhalf*x*x); // One round of Newton’s method
    return x;
    }

    Comment by ilyes — 2008-03-28 @ 15:35

RSS feed for comments on this post. TrackBack URI

Leave a comment

Powered by WordPress