计算一个字中1位的数目有时被称为“种群计数”
以一个32位的int为例
最朴素的方法:检查最后一位,计数,然后无符号右移。
[c-sharp] view plain copy print ?
- int bit_count(unsigned int i)
- {
- int count=0;
-
- while(i){
- count+=i&0x00000001;
- i>>=1;
- }
-
- return count;
- }
int bit_count(unsigned int i) { int count=0; while(i){ count+=i&0x00000001; i>>=1; } return count; }
这样的算法复杂度为O(logi),最坏的时候要做32次循环。
O(m)算法,m为二进制中1的个数
x-1操作将x的二进制表示中的最右边的1改成0,而该位右边所有的0都改成1,x&(x-1)就将x二进制中最右边的1去掉,于是
[c-sharp] view plain copy print ?
- int bit_count2(int i)
- {
- int count=0;
-
- while(i){
- count++;
- i&=i-1;
- }
-
- return count;
- }
int bit_count2(int i) { int count=0; while(i){ count++; i&=i-1; } return count; }
复杂度为O(m),在种群稀疏的字计数中效率尤其明显。如果较为密集,可以改为计数0的个数,然后从32中减去
计数0的个数与计数1的个数操作是对称的:x+1将最右边的0改成1,该0右边的所有1都改成0,x|(x+1)就将x二进制中最右边的0去掉,循环检测条件改成检测该数是否为-1即可。
分治算法
上述的O(m)算法已经足够巧妙了,然而还存在一般情况下更好的算法。
首先计算二进制位中相邻两个位的和,并将结果存放在这两位中。然后计算相邻的两个两位的和,放在这四位中,以此类推。。。例如
1 0 1 1 1 1 0 0 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
0 1|1 0|1 0|0 0|0 1|0 1|0 0|1 0|0 1|1 0|1 0|0 1|1 0|1 0|1 0|1 0
0 0 1 1|0 0 1 0| 0 0 1 0|0 0 1 0|0 0 1 1|0 0 1 1 |0 1 0 0|0 1 0 0
0 0 0 0 0 1 0 1| 0 0 0 0 0 1 0 0|0 0 0 0 0 1 1 0|0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 | 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1
最后的结果就是1个个数,上例中为23
每一步都可以用一个mask与移位之后加法来实现
[c-sharp] view plain copy print ?
- int bit_count3(int i)
- {
-
- i=(i&0x55555555)+((i>>1)&0x55555555);
- i=(i&0x33333333)+((i>>2)&0x33333333);
- i=(i&0x0f0f0f0f)+((i>>4)&0x0f0f0f0f);
- i=(i&0x00ff00ff)+((i>>8)&0x00ff00ff);
- i=(i&0x0000ffff)+((i>>16)&0x0000ffff);
-
- return i;
- }
int bit_count3(int i) { i=(i&0x55555555)+((i>>1)&0x55555555); i=(i&0x33333333)+((i>>2)&0x33333333); i=(i&0x0f0f0f0f)+((i>>4)&0x0f0f0f0f); i=(i&0x00ff00ff)+((i>>8)&0x00ff00ff); i=(i&0x0000ffff)+((i>>16)&0x0000ffff); return i; }
以下是VC6.0下反编译的结果
[c-sharp] view plain copy print ?
- 30: i=(i&0x55555555)+((i>>1)&0x55555555);
- 00401108 8B 45 08 mov eax,dword ptr [ebp+8]
- 0040110B 25 55 55 55 55 and eax,55555555h
- 00401110 8B 4D 08 mov ecx,dword ptr [ebp+8]
- 00401113 D1 F9 sar ecx,1
- 00401115 81 E1 55 55 55 55 and ecx,55555555h
- 0040111B 03 C1 add eax,ecx
- 0040111D 89 45 08 mov dword ptr [ebp+8],eax
30: i=(i&0x55555555)+((i>>1)&0x55555555); 00401108 8B 45 08 mov eax,dword ptr [ebp+8] 0040110B 25 55 55 55 55 and eax,55555555h 00401110 8B 4D 08 mov ecx,dword ptr [ebp+8] 00401113 D1 F9 sar ecx,1 00401115 81 E1 55 55 55 55 and ecx,55555555h 0040111B 03 C1 add eax,ecx 0040111D 89 45 08 mov dword ptr [ebp+8],eax
每一步需要7个汇编指令,一共需要35个汇编指令。
以下是bit_count2的反编译结果
[c-sharp] view plain copy print ?
- 17: int count=0;
- 004010A8 C7 45 FC 00 00 00 00 mov dword ptr [ebp-4],0
- 18:
- 19: while(i){
- 004010AF 83 7D 08 00 cmp dword ptr [ebp+8],0
- 004010B3 74 19 je bit_count2+3Eh (004010ce)
- 20: count++;
- 004010B5 8B 45 FC mov eax,dword ptr [ebp-4]
- 004010B8 83 C0 01 add eax,1
- 004010BB 89 45 FC mov dword ptr [ebp-4],eax
- 21: i&=i-1;
- 004010BE 8B 4D 08 mov ecx,dword ptr [ebp+8]
- 004010C1 83 E9 01 sub ecx,1
- 004010C4 8B 55 08 mov edx,dword ptr [ebp+8]
- 004010C7 23 D1 and edx,ecx
- 004010C9 89 55 08 mov dword ptr [ebp+8],edx
- 22: }
- 004010CC EB E1 jmp bit_count2+1Fh (004010af)
17: int count=0; 004010A8 C7 45 FC 00 00 00 00 mov dword ptr [ebp-4],0 18: 19: while(i){ 004010AF 83 7D 08 00 cmp dword ptr [ebp+8],0 004010B3 74 19 je bit_count2+3Eh (004010ce) 20: count++; 004010B5 8B 45 FC mov eax,dword ptr [ebp-4] 004010B8 83 C0 01 add eax,1 004010BB 89 45 FC mov dword ptr [ebp-4],eax 21: i&=i-1; 004010BE 8B 4D 08 mov ecx,dword ptr [ebp+8] 004010C1 83 E9 01 sub ecx,1 004010C4 8B 55 08 mov edx,dword ptr [ebp+8] 004010C7 23 D1 and edx,ecx 004010C9 89 55 08 mov dword ptr [ebp+8],edx 22: } 004010CC EB E1 jmp bit_count2+1Fh (004010af)
每一个循环需要10个汇编指令,可见在位数超过4的时候,效率就已经不如bit_count3。
优化是没有止境的
细细分析bit_count3,还有很多值得优化的地方。
首先对于一个二进制数各位数字的和有公式
pop(x)=x-(x>>1)-(x>>2)-...-(x>>31)
证明:对于一般的32位字,unsigned int的二进制表示的第i位可以表示为
bi=(x>>i)-((x>>i+1)<<1)
从0到31累加即可得到结论。
二进制中1的个数也就是二进制中的各位数字之和,一般情况下用上述公式进行计算的效率是不如bit_count3的,但是,在二进制数只有2位的时候,可以直接用x-(x>>1)来计算,可以利用这个结论来优化bit_count3,bit_count3的第一步就是计算相邻两位的各位数字之和。
i=(i&0x55555555)+((i>>1)&0x55555555);
就可以改写为
i=i-((i>>1)&0x55555555);
另外,对于不可能对相邻位产生进位的加法,不需要进行与运算。
2个2位的时候最多为10+10,有进位,故第二步不能省
2个4位的时候最多为100+100,无进位,第三步可以省,但是由于第四步要进行8位相加,故仍然需要用0x0f0f0f0f进行掩码
2个8位的时候最多为1000+1000,可以省,而且剩余位不会对结果产生影响,不需要掩码
16位的时候同理
位数不会超过32,所以最后结果需要对x进行0x0000003F的掩码
这样就产生了bit_count4
[c-sharp] view plain copy print ?
- int bit_count4(int i)
- {
- i-=((i>>1)&0x55555555);
- i=(i&0x33333333)+((i>>2)&0x33333333);
- i=(i+(i>>4))&0x0f0f0f0f;
- i+=i>>8;
- i+=i>>16;
- return i&0x0000003f;
- }
int bit_count4(int i) { i-=((i>>1)&0x55555555); i=(i&0x33333333)+((i>>2)&0x33333333); i=(i+(i>>4))&0x0f0f0f0f; i+=i>>8; i+=i>>16; return i&0x0000003f; }
反汇编的结果如下
[c-sharp] view plain copy print ?
- 41: i-=((i>>1)&0x55555555);
- 004011D8 8B 45 08 mov eax,dword ptr [ebp+8]
- 004011DB D1 F8 sar eax,1
- 004011DD 25 55 55 55 55 and eax,55555555h
- 004011E2 8B 4D 08 mov ecx,dword ptr [ebp+8]
- 004011E5 2B C8 sub ecx,eax
- 004011E7 89 4D 08 mov dword ptr [ebp+8],ecx
- 42: i=(i&0x33333333)+((i>>2)&0x33333333);
- 004011EA 8B 55 08 mov edx,dword ptr [ebp+8]
- 004011ED 81 E2 33 33 33 33 and edx,33333333h
- 004011F3 8B 45 08 mov eax,dword ptr [ebp+8]
- 004011F6 C1 F8 02 sar eax,2
- 004011F9 25 33 33 33 33 and eax,33333333h
- 004011FE 03 D0 add edx,eax
- 00401200 89 55 08 mov dword ptr [ebp+8],edx
- 43: i=(i+(i>>4))&0x0f0f0f0f;
- 00401203 8B 4D 08 mov ecx,dword ptr [ebp+8]
- 00401206 C1 F9 04 sar ecx,4
- 00401209 8B 55 08 mov edx,dword ptr [ebp+8]
- 0040120C 03 D1 add edx,ecx
- 0040120E 81 E2 0F 0F 0F 0F and edx,0F0F0F0Fh
- 00401214 89 55 08 mov dword ptr [ebp+8],edx
- 44: i+=i>>8;
- 00401217 8B 45 08 mov eax,dword ptr [ebp+8]
- 0040121A C1 F8 08 sar eax,8
- 0040121D 8B 4D 08 mov ecx,dword ptr [ebp+8]
- 00401220 03 C8 add ecx,eax
- 00401222 89 4D 08 mov dword ptr [ebp+8],ecx
- 45: i+=i>>16;
- 00401225 8B 55 08 mov edx,dword ptr [ebp+8]
- 00401228 C1 FA 10 sar edx,10h
- 0040122B 8B 45 08 mov eax,dword ptr [ebp+8]
- 0040122E 03 C2 add eax,edx
- 00401230 89 45 08 mov dword ptr [ebp+8],eax
- 46: return i&0x0000003f;
- 00401233 8B 45 08 mov eax,dword ptr [ebp+8]
- 00401236 83 E0 3F and eax,3Fh
41: i-=((i>>1)&0x55555555); 004011D8 8B 45 08 mov eax,dword ptr [ebp+8] 004011DB D1 F8 sar eax,1 004011DD 25 55 55 55 55 and eax,55555555h 004011E2 8B 4D 08 mov ecx,dword ptr [ebp+8] 004011E5 2B C8 sub ecx,eax 004011E7 89 4D 08 mov dword ptr [ebp+8],ecx 42: i=(i&0x33333333)+((i>>2)&0x33333333); 004011EA 8B 55 08 mov edx,dword ptr [ebp+8] 004011ED 81 E2 33 33 33 33 and edx,33333333h 004011F3 8B 45 08 mov eax,dword ptr [ebp+8] 004011F6 C1 F8 02 sar eax,2 004011F9 25 33 33 33 33 and eax,33333333h 004011FE 03 D0 add edx,eax 00401200 89 55 08 mov dword ptr [ebp+8],edx 43: i=(i+(i>>4))&0x0f0f0f0f; 00401203 8B 4D 08 mov ecx,dword ptr [ebp+8] 00401206 C1 F9 04 sar ecx,4 00401209 8B 55 08 mov edx,dword ptr [ebp+8] 0040120C 03 D1 add edx,ecx 0040120E 81 E2 0F 0F 0F 0F and edx,0F0F0F0Fh 00401214 89 55 08 mov dword ptr [ebp+8],edx 44: i+=i>>8; 00401217 8B 45 08 mov eax,dword ptr [ebp+8] 0040121A C1 F8 08 sar eax,8 0040121D 8B 4D 08 mov ecx,dword ptr [ebp+8] 00401220 03 C8 add ecx,eax 00401222 89 4D 08 mov dword ptr [ebp+8],ecx 45: i+=i>>16; 00401225 8B 55 08 mov edx,dword ptr [ebp+8] 00401228 C1 FA 10 sar edx,10h 0040122B 8B 45 08 mov eax,dword ptr [ebp+8] 0040122E 03 C2 add eax,edx 00401230 89 45 08 mov dword ptr [ebp+8],eax 46: return i&0x0000003f; 00401233 8B 45 08 mov eax,dword ptr [ebp+8] 00401236 83 E0 3F and eax,3Fh
需要31条汇编指令。
如果我们对每4位进行各位之和的计算
即前两步为
i-=((i>>1)&0x77777777)+((i>>2)&0x33333333)+((i>>3)&0x11111111);
[c-sharp] view plain copy print ?
- 004011A8 8B 45 08 mov eax,dword ptr [ebp+8]
- 004011AB D1 E8 shr eax,1
- 004011AD 25 77 77 77 77 and eax,77777777h
- 004011B2 8B 4D 08 mov ecx,dword ptr [ebp+8]
- 004011B5 C1 E9 02 shr ecx,2
- 004011B8 81 E1 33 33 33 33 and ecx,33333333h
- 004011BE 03 C1 add eax,ecx
- 004011C0 8B 55 08 mov edx,dword ptr [ebp+8]
- 004011C3 C1 EA 03 shr edx,3
- 004011C6 81 E2 11 11 11 11 and edx,11111111h
- 004011CC 03 C2 add eax,edx
- 004011CE 8B 4D 08 mov ecx,dword ptr [ebp+8]
- 004011D1 2B C8 sub ecx,eax
- 004011D3 89 4D 08 mov dword ptr [ebp+8],ecx
004011A8 8B 45 08 mov eax,dword ptr [ebp+8] 004011AB D1 E8 shr eax,1 004011AD 25 77 77 77 77 and eax,77777777h 004011B2 8B 4D 08 mov ecx,dword ptr [ebp+8] 004011B5 C1 E9 02 shr ecx,2 004011B8 81 E1 33 33 33 33 and ecx,33333333h 004011BE 03 C1 add eax,ecx 004011C0 8B 55 08 mov edx,dword ptr [ebp+8] 004011C3 C1 EA 03 shr edx,3 004011C6 81 E2 11 11 11 11 and edx,11111111h 004011CC 03 C2 add eax,edx 004011CE 8B 4D 08 mov ecx,dword ptr [ebp+8] 004011D1 2B C8 sub ecx,eax 004011D3 89 4D 08 mov dword ptr [ebp+8],ecx
没什么效果。。。若直接用汇编编写,效率应该还有待提高。。。
真的没有止境
如果对空间的要求不是那么苛刻的话,为了将时间效率发挥到极致,可以采用查表法。
例如8位的查表法,代码如下
[c-sharp] view plain copy print ?
- static char a[256] = {0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,
- 3,3,4,2,3,3,4,3,4,4,5,1,2,2,3,2,3,3,4,2,
- 3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,
- 5,5,6,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,
- 3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,2,3,3,4,3,
- 4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,
- 5,5,6,5,6,6,7,1,2,2,3,2,3,3,4,2,3,3,4,3,
- 4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,2,
- 3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,
- 5,5,6,4,5,5,6,5,6,6,7,2,3,3,4,3,4,4,5,3,
- 4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,
- 6,6,7,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,
- 5,5,6,5,6,6,7,5,6,6,7,6,7,7,8
- };
-
- int bit_count5(unsigned int i)
- {
- return a[i&0xffu]+a[(i>>8)&0xffu]+a[(i>>16)&0xffu]+a[i>>24];
- }
static char a[256] = {0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2, 3,3,4,2,3,3,4,3,4,4,5,1,2,2,3,2,3,3,4,2, 3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4, 5,5,6,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2, 3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,2,3,3,4,3, 4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4, 5,5,6,5,6,6,7,1,2,2,3,2,3,3,4,2,3,3,4,3, 4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,2, 3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4, 5,5,6,4,5,5,6,5,6,6,7,2,3,3,4,3,4,4,5,3, 4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5, 6,6,7,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4, 5,5,6,5,6,6,7,5,6,6,7,6,7,7,8 }; int bit_count5(unsigned int i) { return a[i&0xffu]+a[(i>>8)&0xffu]+a[(i>>16)&0xffu]+a[i>>24]; }
效率更高请采用16位查表法。。。。需要65536个字节的空间来存储。。。。
位图中的种群计数
在一个数组的种群计数中,我们可以用上述方法依次求出每个元素的计数,然后累加。
但是,注意到,4位的计数最多可以到15,而按照上述算法每一个数的4位最多只有4,那么我们运用上述分治算法的前两步,然后3个元素相加,求得和之后,再对和进行后面步数的处理,而8位的情况,每个8位最多是24,8位可以统计255/24=10个字,以此类推,这样的算法效率会很高,但是过多的循环控制语句会大大降低算法节省的效率。
由此,只利用一个中间层次,计算出4个8位部分的和之后,每个8位部分的和最多为8,最多可以将255/8=31个8位和相加而不溢出。以下代码来自Hacker's Delight。。。
[c-sharp] view plain copy print ?
- int bit_count_array(unsigned *a,int n)
- {
- int i,j,lim;
- unsigned x,s8,s=0;
-
- for(i=0;i
- lim=(n<(i+31)?n:(i+31));
- s8=0;
- for(j=i;j
- x=a[j];
- x-=((x>>1)&0x55555555);
- x=(x&0x33333333)+((x>>2)&0x33333333);
- x=(x+(x>>4))&0x0f0f0f0f;
- s8+=x;
- }
- x=(s8&0x00ff00ff)+((s8>>8)&0x00ff00ff);
- x=(x&0x0000ffff)+(x>>16);
- s+=x;
- }
- return s;
- }