问题描述
我有以下代码,它比 inet_addr 快约 7 倍.我想知道是否有办法改进它以使其更快,或者是否存在更快的替代方案.
I have the following code which is about 7 times faster than inet_addr . I was wondering if there is a way to improve this to make it even faster or if a faster alternative exists.
这段代码要求提供一个有效的空终止 IPv4 地址,没有空格,在我的情况下总是这样,所以我针对这种情况进行了优化.通常你会有更多的错误检查,但如果有一种方法可以使以下操作更快或存在更快的替代方案,我将非常感激.
This code requires that a valid null terminated IPv4 address is supplied with no whitespace, which in my case is always the way, so I optimized for that case. Usually you would have more error checking, but if there is a way to make the following even faster or a faster alternative exists I would really appreciate it.
UINT32 GetIP(const char *p)
{
UINT32 dwIP=0,dwIP_Part=0;
while(true)
{
if(p[0] == 0)
{
dwIP = (dwIP << 8) | dwIP_Part;
break;
}
if(p[0]=='.')
{
dwIP = (dwIP << 8) | dwIP_Part;
dwIP_Part = 0;
p++;
}
dwIP_Part = (dwIP_Part*10)+(p[0]-'0');
p++;
}
return dwIP;
}
推荐答案
既然我们在谈论最大化 IP 地址解析的吞吐量,我建议使用矢量化解决方案.
Since we are speaking about maximizing throughput of IP address parsing, I suggest using a vectorized solution.
这是特定于 x86 的快速解决方案(需要 SSE4.1,或者至少需要 SSSE3):
Here is x86-specific fast solution (needs SSE4.1, or at least SSSE3 for poor):
__m128i shuffleTable[65536]; //can be reduced 256x times, see @IwillnotexistIdonotexist
UINT32 MyGetIP(const char *str) {
__m128i input = _mm_lddqu_si128((const __m128i*)str); //"192.167.1.3"
input = _mm_sub_epi8(input, _mm_set1_epi8('0')); //1 9 2 254 1 6 7 254 1 254 3 208 245 0 8 40
__m128i cmp = input; //...X...X.X.XX... (signs)
UINT32 mask = _mm_movemask_epi8(cmp); //6792 - magic index
__m128i shuf = shuffleTable[mask]; //10 -1 -1 -1 8 -1 -1 -1 6 5 4 -1 2 1 0 -1
__m128i arr = _mm_shuffle_epi8(input, shuf); //3 0 0 0 | 1 0 0 0 | 7 6 1 0 | 2 9 1 0
__m128i coeffs = _mm_set_epi8(0, 100, 10, 1, 0, 100, 10, 1, 0, 100, 10, 1, 0, 100, 10, 1);
__m128i prod = _mm_maddubs_epi16(coeffs, arr); //3 0 | 1 0 | 67 100 | 92 100
prod = _mm_hadd_epi16(prod, prod); //3 | 1 | 167 | 192 | ? | ? | ? | ?
__m128i imm = _mm_set_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 6, 4, 2, 0);
prod = _mm_shuffle_epi8(prod, imm); //3 1 167 192 0 0 0 0 0 0 0 0 0 0 0 0
return _mm_extract_epi32(prod, 0);
// return (UINT32(_mm_extract_epi16(prod, 1)) << 16) + UINT32(_mm_extract_epi16(prod, 0)); //no SSE 4.1
}
这是shuffleTable
所需的预计算:
void MyInit() {
memset(shuffleTable, -1, sizeof(shuffleTable));
int len[4];
for (len[0] = 1; len[0] <= 3; len[0]++)
for (len[1] = 1; len[1] <= 3; len[1]++)
for (len[2] = 1; len[2] <= 3; len[2]++)
for (len[3] = 1; len[3] <= 3; len[3]++) {
int slen = len[0] + len[1] + len[2] + len[3] + 4;
int rem = 16 - slen;
for (int rmask = 0; rmask < 1<<rem; rmask++) {
// { int rmask = (1<<rem)-1; //note: only maximal rmask is possible if strings are zero-padded
int mask = 0;
char shuf[16] = {-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1};
int pos = 0;
for (int i = 0; i < 4; i++) {
for (int j = 0; j < len[i]; j++) {
shuf[(3-i) * 4 + (len[i]-1-j)] = pos;
pos++;
}
mask ^= (1<<pos);
pos++;
}
mask ^= (rmask<<slen);
_mm_store_si128(&shuffleTable[mask], _mm_loadu_si128((__m128i*)shuf));
}
}
}
带测试的完整代码在此处可用.在 Ivy Bridge 处理器上打印:
Full code with testing is avaliable here. On Ivy Bridge processor it prints:
C0A70103
Time = 0.406 (1556701184)
Time = 3.133 (1556701184)
这意味着建议的解决方案在吞吐量方面比 OP 代码快 7.8 倍.它每秒处理3.36 亿个地址(单核 3.4 Ghz).
It means that the suggested solution is 7.8 times faster in terms of throughput than the code by OP. It processes 336 millions of addresses per second (single core of 3.4 Ghz).
现在我将尝试解释它是如何工作的.请注意,在列表的每一行中,您都可以看到刚刚计算出的值的内容.所有数组都以小端顺序打印(尽管 set
内部函数使用大端).
Now I'll try to explain how it works. Note that on each line of the listing you can see contents of the value just computed. All the arrays are printed in little-endian order (though set
intrinsics use big-endian).
首先,我们通过lddqu
指令从未对齐的地址加载16个字节.请注意,在 64 位模式下,内存是由 16 字节块分配的,因此这会自动运行良好.在 32 位上,理论上它可能会导致超出范围访问的问题.虽然我不相信它真的可以.无论后尾字节中的值如何,后续代码都可以正常工作.无论如何,您最好确保每个 IP 地址至少占用 16 个字节的存储空间.
First of all, we load 16 bytes from unaligned address by lddqu
instruction. Note that in 64-bit mode memory is allocated by 16-byte chunks, so this works well automatically. On 32-bit it may theoretically cause issues with out of range access. Though I do not believe that it really can. The subsequent code would work properly regardless of the values in the after-the-end bytes. Anyway, you'd better ensure that each IP address takes at least 16 bytes of storage.
然后我们从所有字符中减去0".在那之后 '.'变成-2,零变成-48,所有数字保持非负.现在我们用 _mm_movemask_epi8
取所有字节符号的位掩码.
Then we subtract '0' from all the chars. After that '.' turns into -2, and zero turns into -48, all the digits remain nonnegative. Now we take bitmask of signs of all the bytes with _mm_movemask_epi8
.
根据此掩码的值,我们从查找表 shuffleTable
中获取一个重要的 16 字节改组掩码.该表非常大:总共 1Mb.并且预先计算需要相当长的时间.但是,它不会占用 CPU 缓存中的宝贵空间,因为实际上只使用了该表中的 81 个元素.那是因为 IP 地址的每一部分都可以是一、二、三位数 => 因此总共有 81 个变体.请注意,字符串末尾的随机垃圾字节原则上可能会导致查找表中的内存占用增加.
Depending on the value of this mask, we fetch a nontrivial 16-byte shuffling mask from lookup table shuffleTable
. The table is quite large: 1Mb total. And it takes quite some time to precompute. However, it does not take precious space in CPU cache, because only 81 elements from this table are really used. That is because each part of IP address can be either one, two, three digits long => hence 81 variants in total.
Note that random trashy bytes after the end of the string may in principle cause increased memory footprint in the lookup table.
EDIT:你可以在评论中找到@IwillnotexistIdonotexist修改的版本,它使用只有4Kb大小的查找表(虽然有点慢).
EDIT: you can find a version modified by @IwillnotexistIdonotexist in comments, which uses lookup table of only 4Kb size (it is a bit slower, though).
巧妙的_mm_shuffle_epi8
内在函数允许我们使用随机掩码对字节重新排序.因此,XMM 寄存器包含四个 4 字节块,每个块包含以小端顺序排列的数字.我们通过 _mm_maddubs_epi16
后跟 _mm_hadd_epi16
将每个块转换为 16 位数字.然后我们对寄存器的字节进行重新排序,使整个IP地址占据低4个字节.
The ingenious _mm_shuffle_epi8
intrinsic allows us to reorder the bytes with our shuffle mask. As a result XMM register contains four 4-byte blocks, each block contains digits in little-endian order. We convert each block into a 16-bit number by _mm_maddubs_epi16
followed by _mm_hadd_epi16
. Then we reorder bytes of the register, so that the whole IP address occupies the lower 4 bytes.
最后,我们将 XMM 寄存器的低 4 个字节提取到 GP 寄存器.它是通过 SSE4.1 内在 (_mm_extract_epi32
) 完成的.如果没有,请使用 _mm_extract_epi16
将其替换为其他行,但运行速度会慢一些.
Finally, we extract the lower 4 bytes from the XMM register to GP register. It is done with SSE4.1 intrinsic (_mm_extract_epi32
). If you don't have it, replace it with other line using _mm_extract_epi16
, but it will run a bit slower.
最后,这里是生成的程序集(MSVC2013),以便您可以检查您的编译器是否没有生成任何可疑内容:
Finally, here is the generated assembly (MSVC2013), so that you can check that your compiler does not generate anything suspicious:
lddqu xmm1, XMMWORD PTR [rcx]
psubb xmm1, xmm6
pmovmskb ecx, xmm1
mov ecx, ecx //useless, see @PeterCordes and @IwillnotexistIdonotexist
add rcx, rcx //can be removed, see @EvgenyKluev
pshufb xmm1, XMMWORD PTR [r13+rcx*8]
movdqa xmm0, xmm8
pmaddubsw xmm0, xmm1
phaddw xmm0, xmm0
pshufb xmm0, xmm7
pextrd eax, xmm0, 0
附言如果您还在阅读,请务必查看评论 =)
P.S. If you are still reading it, be sure to check out comments =)
这篇关于从字符串中获取 IPv4 地址的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!