我的博客:http://blog.striveforfreedom.net
Table of Contents
- 1 简介
- 2 实现
- 2.1 简洁版
- 2.2 循环展开版
- 2.3 SSE版
- 3 小结
1 简介
索引压缩算法New PForDelta基于算法PForDelta,它消除了算法PForDelta对异常数位置距离的限制。算法PForDelta请看论文[Super-Scalar RAM-CPU Cache Compression],New PForDelta算法介绍请 点击这里 。本文其实只给出了pack/unpack函数,选择较优位宽和异常数处理请 点击这里 。下面给出了pack/unpack函数的三个版本的实现。
2 实现
2.1 简洁版
1: #include2: 3: static const uint32_t BIT_MASK[] = 4: { 5: 0U, 6: 1U, 7: ( 1U << 2 ) - 1, 8: ( 1U << 3 ) - 1, 9: ( 1U << 4 ) - 1, 10: ( 1U << 5 ) - 1, 11: ( 1U << 6 ) - 1, 12: ( 1U << 7 ) - 1, 13: ( 1U << 8 ) - 1, 14: ( 1U << 9 ) - 1, 15: ( 1U << 10 ) - 1, 16: ( 1U << 11 ) - 1, 17: ( 1U << 12 ) - 1, 18: ( 1U << 13 ) - 1, 19: ( 1U << 14 ) - 1, 20: ( 1U << 15 ) - 1, 21: ( 1U << 16 ) - 1, 22: ( 1U << 17 ) - 1, 23: ( 1U << 18 ) - 1, 24: ( 1U << 19 ) - 1, 25: ( 1U << 20 ) - 1, 26: ( 1U << 21 ) - 1, 27: ( 1U << 22 ) - 1, 28: ( 1U << 23 ) - 1, 29: ( 1U << 24 ) - 1, 30: ( 1U << 25 ) - 1, 31: ( 1U << 26 ) - 1, 32: ( 1U << 27 ) - 1, 33: ( 1U << 28 ) - 1, 34: ( 1U << 29 ) - 1, 35: ( 1U << 30 ) - 1, 36: ( 1U << 31) - 1, 37: 0xFFFFFFFFU, 38: }; 39: 40: //data指向count个bit_width位数,packed指向的内存不得少于count * bit_width / 8字节 41: void pack_general(const uint32_t* data, uint32_t count, uint32_t bit_width, uint32_t* packed) 42: { 43: uint32_t pos = 0U; 44: uint32_t start_bit = 0U; 45: 46: for(uint32_t i = 0; i < count; ++i){ 47: packed[pos] = (packed[pos] & BIT_MASK[start_bit]) | (data[i] << start_bit); 48: const uint32_t free_bit = 32U - start_bit; 49: start_bit += bit_width; 50: start_bit &= 0x1F; 51: if(free_bit < bit_width){ 52: packed[++pos] = data[i] >> free_bit; 53: }else if(free_bit == bit_width){ 54: ++pos; 55: } 56: } 57: } 58: 59: //packed指向count个bit_width位数调用函数pack_general后的输出,data指向的内存不得少于count * sizeof(uint32_t)字节 60: void unpack_general(const uint32_t* packed, uint32_t count, uint32_t bit_width, uint32_t* data) 61: { 62: uint32_t pos = 0U; 63: uint32_t start_bit = 0U; 64: 65: for(uint32_t i = 0; i < count; ++i){ 66: data[i] = (packed[pos] >> start_bit) & BIT_MASK[bit_width]; 67: const uint32_t free_bit = 32U - start_bit; 68: start_bit += bit_width; 69: start_bit &= 0x1F; 70: if(free_bit < bit_width){ 71: data[i] |= (packed[++pos] & BIT_MASK[bit_width - free_bit]) << free_bit; 72: }else if(free_bit == bit_width){ 73: ++pos; 74: } 75: } 76: }
pack_general/unpack_general的优点是代码很简洁,一个函数便处理了不同位宽数的pack或unpack。然而缺点也是显而易见的,pack_general/unpack_general性能无疑是很差的,因为每循环一次,只pack/unpack一个数,并且循环体内有分支语句,如果分支预测失败会带来很大的性能开销。因此,可以通过这两方面来提高性能,一是降低循环开销,这可以通过循环展开来达到;二是消除循环体内的分支语句,这可以通过为每种位宽的数写一个特定的函数来达到。下面给出了按照这种思路写出的位宽为5的pack/unpack函数。
2.2 循环展开版
1: #include2: 3: //data指向128个5位数,packed所指内存不能小于128 * 5 / 8字节 4: void pack_5(const uint32_t* data, uint32_t* packed) 5: { 6: packed[0] = (data[6] << 30) | (data[5] << 25) | (data[4] << 20) | (data[3] << 15) | (data[2] << 10) | (data[1] << 5) | data[0]; 7: packed[1] = (data[12] << 28) | (data[11] << 23) | (data[10] << 18) | (data[9] << 13) | (data[8] << 8) | (data[7] << 3) | (data[6] >> 2); 8: packed[2] = (data[19] << 31) | (data[18] << 26) | (data[17] << 21) | (data[16] << 16) | (data[15] << 11) | (data[14] << 6) | (data[13] << 1) | (data[12] >> 4); 9: packed[3] = (data[25] << 29) | (data[24] << 24) | (data[23] << 19) | (data[22] << 14) | (data[21] << 9) | (data[20] << 4) | (data[19] >> 1); 10: packed[4] = (data[31] << 27) | (data[30] << 22) | (data[29] << 17) | (data[28] << 12) | (data[27] << 7) | (data[26] << 2) | (data[25] >> 3); 11: //以上为第1组,pack 32个数 12: 13: packed[5] = (data[38] << 30) | (data[37] << 25) | (data[36] << 20) | (data[35] << 15) | (data[34] << 10) | (data[33] << 5) | data[32]; 14: packed[6] = (data[44] << 28) | (data[43] << 23) | (data[42] << 18) | (data[41] << 13) | (data[40] << 8) | (data[39] << 3) | (data[38] >> 2); 15: packed[7] = (data[51] << 31) | (data[50] << 26) | (data[49] << 21) | (data[48] << 16) | (data[47] << 11) | (data[46] << 6) | (data[45] << 1) | (data[44] >> 4); 16: packed[8] = (data[57] << 29) | (data[56] << 24) | (data[55] << 19) | (data[54] << 14) | (data[53] << 9) | (data[52] << 4) | (data[51] >> 1); 17: packed[9] = (data[63] << 27) | (data[62] << 22) | (data[61] << 17) | (data[60] << 12) | (data[59] << 7) | (data[58] << 2) | (data[57] >> 3); 18: //以上为第2组,pack 32个数 19: 20: packed[10] = (data[70] << 30) | (data[69] << 25) | (data[68] << 20) | (data[67] << 15) | (data[66] << 10) | (data[65] << 5) | data[64]; 21: packed[11] = (data[76] << 28) | (data[75] << 23) | (data[74] << 18) | (data[73] << 13) | (data[72] << 8) | (data[71] << 3) | (data[70] >> 2); 22: packed[12] = (data[83] << 31) | (data[82] << 26) | (data[81] << 21) | (data[80] << 16) | (data[79] << 11) | (data[78] << 6) | (data[77] << 1) | (data[76] >> 4); 23: packed[13] = (data[89] << 29) | (data[88] << 24) | (data[87] << 19) | (data[86] << 14) | (data[85] << 9) | (data[84] << 4) | (data[83] >> 1); 24: packed[14] = (data[95] << 27) | (data[94] << 22) | (data[93] << 17) | (data[92] << 12) | (data[91] << 7) | (data[90] << 2) | (data[89] >> 3); 25: //以上为第3组,pack 32个数 26: 27: packed[15] = (data[102] << 30) | (data[101] << 25) | (data[100] << 20) | (data[99] << 15) | (data[98] << 10) | (data[97] << 5) | data[96]; 28: packed[16] = (data[108] << 28) | (data[107] << 23) | (data[106] << 18) | (data[105] << 13) | (data[104] << 8) | (data[103] << 3) | (data[102] >> 2); 29: packed[17] = (data[115] << 31) | (data[114] << 26) | (data[113] << 21) | (data[112] << 16) | (data[111] << 11) | (data[110] << 6) | (data[109] << 1) | (data[108] >> 4); 30: packed[18] = (data[121] << 29) | (data[120] << 24) | (data[119] << 19) | (data[118] << 14) | (data[117] << 9) | (data[116] << 4) | (data[115] >> 1); 31: packed[19] = (data[127] << 27) | (data[126] << 22) | (data[125] << 17) | (data[124] << 12) | (data[123] << 7) | (data[122] << 2) | (data[121] >> 3); 32: //以上为第4组,pack 32个数 33: } 34: 35: //packed指向128个5位数调用函数pack_5后的输出,data指向的内存不得少于128 * sizeof(uint32_t)字节 36: void unpack_5(const uint32_t* packed, uint32_t* data) 37: { 38: data[0] = packed[0] & 0x1F; 39: data[1] = (packed[0] >> 5) & 0x1F; 40: data[2] = (packed[0] >> 10) & 0x1F; 41: data[3] = (packed[0] >> 15) & 0x1F; 42: data[4] = (packed[0] >> 20) & 0x1F; 43: data[5] = (packed[0] >> 25) & 0x1F; 44: data[6] = (packed[0] >> 30) | ((packed[1] & 0x7) << 2); 45: 46: data[7] = (packed[1] >> 3) & 0x1F; 47: data[8] = (packed[1] >> 8) & 0x1F; 48: data[9] = (packed[1] >> 13) & 0x1F; 49: data[10] = (packed[1] >> 18) & 0x1F; 50: data[11] = (packed[1] >> 23) & 0x1F; 51: data[12] = (packed[1] >> 28) | ((packed[2] & 0x1) << 4); 52: 53: data[13] = (packed[2] >> 1) & 0x1F; 54: data[14] = (packed[2] >> 6) & 0x1F; 55: data[15] = (packed[2] >> 11) & 0x1F; 56: data[16] = (packed[2] >> 16) & 0x1F; 57: data[17] = (packed[2] >> 21) & 0x1F; 58: data[18] = (packed[2] >> 26) & 0x1F; 59: data[19] = (packed[2] >> 31) | ((packed[3] & 0xF) << 1); 60: 61: data[20] = (packed[3] >> 4) & 0x1F; 62: data[21] = (packed[3] >> 9) & 0x1F; 63: data[22] = (packed[3] >> 14) & 0x1F; 64: data[23] = (packed[3] >> 19) & 0x1F; 65: data[24] = (packed[3] >> 24) & 0x1F; 66: data[25] = (packed[3] >> 29) | ((packed[4] & 0x3) << 3); 67: 68: data[26] = (packed[4] >> 2) & 0x1F; 69: data[27] = (packed[4] >> 7) & 0x1F; 70: data[28] = (packed[4] >> 12) & 0x1F; 71: data[29] = (packed[4] >> 17) & 0x1F; 72: data[30] = (packed[4] >> 22) & 0x1F; 73: data[31] = (packed[4] >> 27) & 0x1F; 74: //以上为第1组,unpack 32个数 75: 76: data[32] = packed[5] & 0x1F; 77: data[33] = (packed[5] >> 5) & 0x1F; 78: data[34] = (packed[5] >> 10) & 0x1F; 79: data[35] = (packed[5] >> 15) & 0x1F; 80: data[36] = (packed[5] >> 20) & 0x1F; 81: data[37] = (packed[5] >> 25) & 0x1F; 82: data[38] = (packed[5] >> 30) | ((packed[6] & 0x7) << 2); 83: 84: data[39] = (packed[6] >> 3) & 0x1F; 85: data[40] = (packed[6] >> 8) & 0x1F; 86: data[41] = (packed[6] >> 13) & 0x1F; 87: data[42] = (packed[6] >> 18) & 0x1F; 88: data[43] = (packed[6] >> 23) & 0x1F; 89: data[44] = (packed[6] >> 28) | ((packed[7] & 0x1) << 4); 90: 91: data[45] = (packed[7] >> 1) & 0x1F; 92: data[46] = (packed[7] >> 6) & 0x1F; 93: data[47] = (packed[7] >> 11) & 0x1F; 94: data[48] = (packed[7] >> 16) & 0x1F; 95: data[49] = (packed[7] >> 21) & 0x1F; 96: data[50] = (packed[7] >> 26) & 0x1F; 97: data[51] = (packed[7] >> 31) | ((packed[8] & 0xF) << 1); 98: 99: data[52] = (packed[8] >> 4) & 0x1F; 100: data[53] = (packed[8] >> 9) & 0x1F; 101: data[54] = (packed[8] >> 14) & 0x1F; 102: data[55] = (packed[8] >> 19) & 0x1F; 103: data[56] = (packed[8] >> 24) & 0x1F; 104: data[57] = (packed[8] >> 29) | ((packed[9] & 0x3) << 3); 105: 106: data[58] = (packed[9] >> 2) & 0x1F; 107: data[59] = (packed[9] >> 7) & 0x1F; 108: data[60] = (packed[9] >> 12) & 0x1F; 109: data[61] = (packed[9] >> 17) & 0x1F; 110: data[62] = (packed[9] >> 22) & 0x1F; 111: data[63] = (packed[9] >> 27) & 0x1F; 112: //以上为第2组,unpack 32个数 113: 114: data[64] = packed[10] & 0x1F; 115: data[65] = (packed[10] >> 5) & 0x1F; 116: data[66] = (packed[10] >> 10) & 0x1F; 117: data[67] = (packed[10] >> 15) & 0x1F; 118: data[68] = (packed[10] >> 20) & 0x1F; 119: data[69] = (packed[10] >> 25) & 0x1F; 120: data[70] = (packed[10] >> 30) | ((packed[11] & 0x7) << 2); 121: 122: data[71] = (packed[11] >> 3) & 0x1F; 123: data[72] = (packed[11] >> 8) & 0x1F; 124: data[73] = (packed[11] >> 13) & 0x1F; 125: data[74] = (packed[11] >> 18) & 0x1F; 126: data[75] = (packed[11] >> 23) & 0x1F; 127: data[76] = (packed[11] >> 28) | ((packed[12] & 0x1) << 4); 128: 129: data[77] = (packed[12] >> 1) & 0x1F; 130: data[78] = (packed[12] >> 6) & 0x1F; 131: data[79] = (packed[12] >> 11) & 0x1F; 132: data[80] = (packed[12] >> 16) & 0x1F; 133: data[81] = (packed[12] >> 21) & 0x1F; 134: data[82] = (packed[12] >> 26) & 0x1F; 135: data[83] = (packed[12] >> 31) | ((packed[13] & 0xF) << 1); 136: 137: data[84] = (packed[13] >> 4) & 0x1F; 138: data[85] = (packed[13] >> 9) & 0x1F; 139: data[86] = (packed[13] >> 14) & 0x1F; 140: data[87] = (packed[13] >> 19) & 0x1F; 141: data[88] = (packed[13] >> 24) & 0x1F; 142: data[89] = (packed[13] >> 29) | ((packed[14] & 0x3) << 3); 143: 144: data[90] = (packed[14] >> 2) & 0x1F; 145: data[91] = (packed[14] >> 7) & 0x1F; 146: data[92] = (packed[14] >> 12) & 0x1F; 147: data[93] = (packed[14] >> 17) & 0x1F; 148: data[94] = (packed[14] >> 22) & 0x1F; 149: data[95] = (packed[14] >> 27) & 0x1F; 150: //以上为第3组,unpack 32个数 151: 152: data[96] = packed[15] & 0x1F; 153: data[97] = (packed[15] >> 5) & 0x1F; 154: data[98] = (packed[15] >> 10) & 0x1F; 155: data[99] = (packed[15] >> 15) & 0x1F; 156: data[100] = (packed[15] >> 20) & 0x1F; 157: data[101] = (packed[15] >> 25) & 0x1F; 158: data[102] = (packed[15] >> 30) | ((packed[16] & 0x7) << 2); 159: 160: data[103] = (packed[16] >> 3) & 0x1F; 161: data[104] = (packed[16] >> 8) & 0x1F; 162: data[105] = (packed[16] >> 13) & 0x1F; 163: data[106] = (packed[16] >> 18) & 0x1F; 164: data[107] = (packed[16] >> 23) & 0x1F; 165: data[108] = (packed[16] >> 28) | ((packed[17] & 0x1) << 4); 166: 167: data[109] = (packed[17] >> 1) & 0x1F; 168: data[110] = (packed[17] >> 6) & 0x1F; 169: data[111] = (packed[17] >> 11) & 0x1F; 170: data[112] = (packed[17] >> 16) & 0x1F; 171: data[113] = (packed[17] >> 21) & 0x1F; 172: data[114] = (packed[17] >> 26) & 0x1F; 173: data[115] = (packed[17] >> 31) | ((packed[18] & 0xF) << 1); 174: 175: data[116] = (packed[18] >> 4) & 0x1F; 176: data[117] = (packed[18] >> 9) & 0x1F; 177: data[118] = (packed[18] >> 14) & 0x1F; 178: data[119] = (packed[18] >> 19) & 0x1F; 179: data[120] = (packed[18] >> 24) & 0x1F; 180: data[121] = (packed[18] >> 29) | ((packed[19] & 0x3) << 3); 181: 182: data[122] = (packed[19] >> 2) & 0x1F; 183: data[123] = (packed[19] >> 7) & 0x1F; 184: data[124] = (packed[19] >> 12) & 0x1F; 185: data[125] = (packed[19] >> 17) & 0x1F; 186: data[126] = (packed[19] >> 22) & 0x1F; 187: data[127] = (packed[19] >> 27) & 0x1F; 188: //以上为第4组,unpack 32个数 189: }
函数pack_5非常冗长,手工写很容易出错,仔细观察一下就会发现这128个数可以分成4组,每组pack 32个数,写完第1组之后,只要对第1组的packed数组下标分别加5/10/15以及对data数组下标分别加32/64/96,就可以得到第2/3/4组,但这样对数组下标手工进行加操作还是很容易出错,emacs里有个函数query-replace-regexp可以帮我们自动完成这项操作:举个例子,我们可以复制一份第1组的代码并选定,运行这个函数,参数依次输入data\[\([0-9]+\)\]和data[\,(+ 32 \#1)]就可以自动得到第2组代码了,其中,紧接着\,的是一个Lisp表达式,这个表达式的求值结果会作为替换字符串的一部分,而\#1则表示第一匹配的括号里的内容转换成的数字。函数unpack_5同样可以利用emacs函数query-replace-regexp来对data下标进行加32/64/96和对packed下标进行加5/10/15。
2.3 SSE版
普通指令一次只能操作一份数据,而SSE指令一次能操作多份数据,SSE的这种这种特性无疑可以利用来提高程序性能,代码如下:
1: #include2: #include 3: 4: //data指向128个5位数,packed所指内存不能小于128 * 5 / 8字节 5: //函数pack_5_sse并没有使用sse指令,命名为pack_5_sse是为了和函数unpack_5_sse对应 6: void pack_5_sse(const uint32_t* data, uint32_t* packed) 7: { 8: packed[0] = (data[120] << 30) | (data[20] << 25) | (data[16] << 20) | (data[12] << 15) | (data[8] << 10) | (data[4] << 5) | data[0]; 9: packed[1] = (data[121] << 30) | (data[21] << 25) | (data[17] << 20) | (data[13] << 15) | (data[9] << 10) | (data[5] << 5) | data[1]; 10: packed[2] = (data[122] << 30) | (data[22] << 25) | (data[18] << 20) | (data[14] << 15) | (data[10] << 10) | (data[6] << 5) | data[2]; 11: packed[3] = (data[123] << 30) | (data[23] << 25) | (data[19] << 20) | (data[15] << 15) | (data[11] << 10) | (data[7] << 5) | data[3]; 12: 13: packed[4] = ((data[120] >> 2) << 30) | (data[44] << 25) | (data[40] << 20) | (data[36] << 15) | (data[32] << 10) | (data[28] << 5) | data[24]; 14: packed[5] = ((data[121] >> 2) << 30) | (data[45] << 25) | (data[41] << 20) | (data[37] << 15) | (data[33] << 10) | (data[29] << 5) | data[25]; 15: packed[6] = ((data[122] >> 2) << 30) | (data[46] << 25) | (data[42] << 20) | (data[38] << 15) | (data[34] << 10) | (data[30] << 5) | data[26]; 16: packed[7] = ((data[123] >> 2) << 30) | (data[47] << 25) | (data[43] << 20) | (data[39] << 15) | (data[35] << 10) | (data[31] << 5) | data[27]; 17: 18: packed[8] = ((data[124] & 0x1) << 31) | ((data[120] >> 4) << 30) | (data[68] << 25) | (data[64] << 20) | (data[60] << 15) | (data[56] << 10) | (data[52] << 5) | data[48]; 19: packed[9] = ((data[125] & 0x1) << 31) | ((data[121] >> 4) << 30) | (data[69] << 25) | (data[65] << 20) | (data[61] << 15) | (data[57] << 10) | (data[53] << 5) | data[49]; 20: packed[10] = ((data[126] & 0x1) << 31) | ((data[122] >> 4) << 30) | (data[70] << 25) | (data[66] << 20) | (data[62] << 15) | (data[58] << 10) | (data[54] << 5) | data[50]; 21: packed[11] = ((data[127] & 0x1) << 31) | ((data[123] >> 4) << 30) | (data[71] << 25) | (data[67] << 20) | (data[63] << 15) | (data[59] << 10) | (data[55] << 5) | data[51]; 22: 23: packed[12] = (((data[124] >> 1) & 0x3) << 30) | (data[92] << 25) | (data[88] << 20) | (data[84] << 15) | (data[80] << 10) | (data[76] << 5) | data[72]; 24: packed[13] = (((data[125] >> 1) & 0x3) << 30) | (data[93] << 25) | (data[89] << 20) | (data[85] << 15) | (data[81] << 10) | (data[77] << 5) | data[73]; 25: packed[14] = (((data[126] >> 1) & 0x3) << 30) | (data[94] << 25) | (data[90] << 20) | (data[86] << 15) | (data[82] << 10) | (data[78] << 5) | data[74]; 26: packed[15] = (((data[127] >> 1) & 0x3) << 30) | (data[95] << 25) | (data[91] << 20) | (data[87] << 15) | (data[83] << 10) | (data[79] << 5) | data[75]; 27: 28: packed[16] = ((data[124] >> 3) << 30) | (data[116] << 25) | (data[112] << 20) | (data[108] << 15) | (data[104] << 10) | (data[100] << 5) | data[96]; 29: packed[17] = ((data[125] >> 3) << 30) | (data[117] << 25) | (data[113] << 20) | (data[109] << 15) | (data[105] << 10) | (data[101] << 5) | data[97]; 30: packed[18] = ((data[126] >> 3) << 30) | (data[118] << 25) | (data[114] << 20) | (data[110] << 15) | (data[106] << 10) | (data[102] << 5) | data[98]; 31: packed[19] = ((data[127] >> 3) << 30) | (data[119] << 25) | (data[115] << 20) | (data[111] << 15) | (data[107] << 10) | (data[103] << 5) | data[99]; 32: } 33: 34: //packed指向128个5位数调用函数pack_5_sse后的输出,data指向的内存不得少于128 * sizeof(uint32_t)字节 35: //packed和data指向的内存地址需要16字节对齐 36: void unpack_5_sse(const uint32_t* packed, uint32_t* data) 37: { 38: const __m128i mask = _mm_set_epi32(0x1F, 0x1F, 0x1F, 0x1F); 39: __m128i code, code1; 40: __m128i unpacked, unpacked1, unpacked2, unpacked3, unpacked4, unpacked5; 41: 42: code = _mm_load_si128((__m128i*)&packed[0]); 43: unpacked = _mm_and_si128(code, mask); 44: _mm_store_si128((__m128i*)&data[0], unpacked); 45: 46: unpacked1 = _mm_srli_epi32(code, 5); 47: unpacked1 = _mm_and_si128(unpacked1, mask); 48: _mm_store_si128((__m128i*)&data[4], unpacked1); 49: 50: unpacked2 = _mm_srli_epi32(code, 10); 51: unpacked2 = _mm_and_si128(unpacked2, mask); 52: _mm_store_si128((__m128i*)&data[8], unpacked2); 53: 54: unpacked3 = _mm_srli_epi32(code, 15); 55: unpacked3 = _mm_and_si128(unpacked3, mask); 56: _mm_store_si128((__m128i*)&data[12], unpacked3); 57: 58: unpacked4 = _mm_srli_epi32(code, 20); 59: unpacked4 = _mm_and_si128(unpacked4, mask); 60: _mm_store_si128((__m128i*)&data[16], unpacked4); 61: 62: unpacked5 = _mm_srli_epi32(code, 25); 63: unpacked5 = _mm_and_si128(unpacked5, mask); 64: _mm_store_si128((__m128i*)&data[20], unpacked5); 65: 66: code1 = _mm_srli_epi32(code, 30); 67: 68: 69: code = _mm_load_si128((__m128i*)&packed[4]); 70: unpacked = _mm_and_si128(code, mask); 71: _mm_store_si128((__m128i*)&data[24], unpacked); 72: 73: unpacked1 = _mm_srli_epi32(code, 5); 74: unpacked1 = _mm_and_si128(unpacked1, mask); 75: _mm_store_si128((__m128i*)&data[28], unpacked1); 76: 77: unpacked2 = _mm_srli_epi32(code, 10); 78: unpacked2 = _mm_and_si128(unpacked2, mask); 79: _mm_store_si128((__m128i*)&data[32], unpacked2); 80: 81: unpacked3 = _mm_srli_epi32(code, 15); 82: unpacked3 = _mm_and_si128(unpacked3, mask); 83: _mm_store_si128((__m128i*)&data[36], unpacked3); 84: 85: unpacked4 = _mm_srli_epi32(code, 20); 86: unpacked4 = _mm_and_si128(unpacked4, mask); 87: _mm_store_si128((__m128i*)&data[40], unpacked4); 88: 89: unpacked5 = _mm_srli_epi32(code, 25); 90: unpacked5 = _mm_and_si128(unpacked5, mask); 91: _mm_store_si128((__m128i*)&data[44], unpacked5); 92: 93: code = _mm_srli_epi32(code, 30); 94: code = _mm_slli_epi32(code, 2); 95: code1 = _mm_or_si128(code1, code); 96: 97: 98: code = _mm_load_si128((__m128i*)&packed[8]); 99: unpacked = _mm_and_si128(code, mask); 100: _mm_store_si128((__m128i*)&data[48], unpacked); 101: 102: unpacked1 = _mm_srli_epi32(code, 5); 103: unpacked1 = _mm_and_si128(unpacked1, mask); 104: _mm_store_si128((__m128i*)&data[52], unpacked1); 105: 106: unpacked2 = _mm_srli_epi32(code, 10); 107: unpacked2 = _mm_and_si128(unpacked2, mask); 108: _mm_store_si128((__m128i*)&data[56], unpacked2); 109: 110: unpacked3 = _mm_srli_epi32(code, 15); 111: unpacked3 = _mm_and_si128(unpacked3, mask); 112: _mm_store_si128((__m128i*)&data[60], unpacked3); 113: 114: unpacked4 = _mm_srli_epi32(code, 20); 115: unpacked4 = _mm_and_si128(unpacked4, mask); 116: _mm_store_si128((__m128i*)&data[64], unpacked4); 117: 118: unpacked5 = _mm_srli_epi32(code, 25); 119: unpacked5 = _mm_and_si128(unpacked5, mask); 120: _mm_store_si128((__m128i*)&data[68], unpacked5); 121: 122: code = _mm_srli_epi32(code, 30); 123: code = _mm_slli_epi32(code, 4); 124: code1 = _mm_or_si128(code1, code); 125: 126: 127: code = _mm_load_si128((__m128i*)&packed[12]); 128: unpacked = _mm_and_si128(code, mask); 129: _mm_store_si128((__m128i*)&data[72], unpacked); 130: 131: unpacked1 = _mm_srli_epi32(code, 5); 132: unpacked1 = _mm_and_si128(unpacked1, mask); 133: _mm_store_si128((__m128i*)&data[76], unpacked1); 134: 135: unpacked2 = _mm_srli_epi32(code, 10); 136: unpacked2 = _mm_and_si128(unpacked2, mask); 137: _mm_store_si128((__m128i*)&data[80], unpacked2); 138: 139: unpacked3 = _mm_srli_epi32(code, 15); 140: unpacked3 = _mm_and_si128(unpacked3, mask); 141: _mm_store_si128((__m128i*)&data[84], unpacked3); 142: 143: unpacked4 = _mm_srli_epi32(code, 20); 144: unpacked4 = _mm_and_si128(unpacked4, mask); 145: _mm_store_si128((__m128i*)&data[88], unpacked4); 146: 147: unpacked5 = _mm_srli_epi32(code, 25); 148: unpacked5 = _mm_and_si128(unpacked5, mask); 149: _mm_store_si128((__m128i*)&data[92], unpacked5); 150: 151: code = _mm_srli_epi32(code, 30); 152: code = _mm_slli_epi32(code, 6); 153: code1 = _mm_or_si128(code1, code); 154: 155: 156: code = _mm_load_si128((__m128i*)&packed[16]); 157: unpacked = _mm_and_si128(code, mask); 158: _mm_store_si128((__m128i*)&data[96], unpacked); 159: 160: unpacked1 = _mm_srli_epi32(code, 5); 161: unpacked1 = _mm_and_si128(unpacked1, mask); 162: _mm_store_si128((__m128i*)&data[100], unpacked1); 163: 164: unpacked2 = _mm_srli_epi32(code, 10); 165: unpacked2 = _mm_and_si128(unpacked2, mask); 166: _mm_store_si128((__m128i*)&data[104], unpacked2); 167: 168: unpacked3 = _mm_srli_epi32(code, 15); 169: unpacked3 = _mm_and_si128(unpacked3, mask); 170: _mm_store_si128((__m128i*)&data[108], unpacked3); 171: 172: unpacked4 = _mm_srli_epi32(code, 20); 173: unpacked4 = _mm_and_si128(unpacked4, mask); 174: _mm_store_si128((__m128i*)&data[112], unpacked4); 175: 176: unpacked5 = _mm_srli_epi32(code, 25); 177: unpacked5 = _mm_and_si128(unpacked5, mask); 178: _mm_store_si128((__m128i*)&data[116], unpacked5); 179: 180: code = _mm_srli_epi32(code, 30); 181: code = _mm_slli_epi32(code, 8); 182: code1 = _mm_or_si128(code1, code); 183: 184: unpacked = _mm_and_si128(code1, mask); 185: _mm_store_si128((__m128i*)&data[120], unpacked); 186: 187: code1 = _mm_srli_epi32(code1, 5); 188: _mm_store_si128((__m128i*)&data[124], code1); 189: }
在函数pack_5_sse里,声明使用了9个XMM寄存器(mask,code,code1,unpacked,unpacked1~unpacked5),本意是想通过降低数据冲突(data hazard)提高CPU的并行执行度,然而gcc(version 4.1.2, 优化选项-O2)却不买账,用objdump查看汇编代码,gcc优化后只使用了4个XMM寄存器,看来要让编译器按意图来生成代码只能手写汇编了(当然,这只是一个想法,具体能不能提高并行执行度还依赖于其它因素,并且需要测试验证)。后来又一想,CPU内部使用寄存器重命名(register renaming),我这么做应该纯属多余。函数pack_5_sse/unpack_5_sse同样可以利用emacs函数query-replace-regexp来减少敲入的代码量。
3 小结
代码我只是简单地测试了一下,可能存在bug。循环展开版我这里是全部展开了,相反地可以只展开一部分比如一次pack/unpack 32个数,循环4次,比较起来,全部展开消除了分支语句,但也引起了代码膨胀,前者消除了可能的分支预测失败带来的开销,后者可能会引起code cache的缺失(cache miss),到底孰优孰劣,需要实际测试,我这里只是随手给出了一个版本,并没有实际测试过。