问题描述
当我在我的机器上以发布模式执行以下代码时,执行具有非空目标的委托总是比委托具有空目标时稍快(我希望它等效或更慢).
When I execute the following code in release mode on my machine the execution of a delegate with a non null target is always slightly faster than when the delegate has a null target (I expected it to be equivalent or slower).
我真的不是在寻找微优化,但我想知道为什么会这样?
I'm really not looking for micro optimization but I was wondering why this is the case?
static void Main(string[] args)
{
// Warmup code
long durationWithTarget =
MeasureDuration(() => new DelegatePerformanceTester(withTarget: true).Run());
Console.WriteLine($"With target: {durationWithTarget}");
long durationWithoutTarget =
MeasureDuration(() => new DelegatePerformanceTester(withTarget: false).Run());
Console.WriteLine($"Without target: {durationWithoutTarget}");
}
/// <summary>
/// Measures the duration of an action.
/// </summary>
/// <param name="action">Action which duration has to be measured.</param>
/// <returns>The duration in milliseconds.</returns>
private static long MeasureDuration(Action action)
{
Stopwatch stopwatch = Stopwatch.StartNew();
action();
return stopwatch.ElapsedMilliseconds;
}
class DelegatePerformanceTester
{
public DelegatePerformanceTester(bool withTarget)
{
if (withTarget)
{
_func = AddNotStatic;
}
else
{
_func = AddStatic;
}
}
private readonly Func<double, double, double> _func;
private double AddNotStatic(double x, double y) => x + y;
private static double AddStatic(double x, double y) => x + y;
public void Run()
{
const int loops = 1000000000;
for (int i = 0; i < loops; i++)
{
double funcResult = _func.Invoke(1d, 2d);
}
}
}
推荐答案
我会写这个,它背后有相当不错的编程建议,对于任何关心编写快速代码的 C# 程序员来说都应该很重要.我一般对使用微基准持谨慎态度,由于现代 CPU 内核上代码执行速度的不可预测性,15% 或更少的差异通常在统计上并不显着.减少测量不存在的东西的几率的一个好方法是重复测试至少 10 次以消除缓存影响并交换测试以便消除代码对齐影响.
I'll write this one up, there is pretty decent programming advice behind it that ought to matter to any C# programmer that cares about writing fast code. I in general caution about using micro-benchmarks, differences of 15% or less are not in general statistically significant due to the unpredictability of code execution speed on a modern CPU core. A good approach to reduce the odds of measuring something that is not there is to repeat a test at least 10 times to remove caching effects and to swap a test so that code alignment effects can be eliminated.
但是你看到的是真实的,调用静态方法的委托实际上更慢.在 x86 代码中效果非常小,但在 x64 代码中效果明显更差,请务必修改 Project > Properties > Build tab > Prefer 32-bit and Platform target settings to try both.
But what you saw is real, delegates that invoke a static method are in fact slower. The effect is quite small in x86 code but it is significantly worse in x64 code, be sure to tinker with the Project > Properties > Build tab > Prefer 32-bit and Platform target settings to try both.
要了解它变慢的原因需要查看抖动生成的机器代码.在委托的情况下,该代码非常隐藏得很好.当您使用 Debug > Windows > Disassembly 查看代码时,您不会看到它.而且您甚至无法单步执行代码,托管调试器被编写为隐藏它并完全拒绝显示它.我将不得不描述一种将视觉"放回 Visual Studio 的技术.
Knowing why it is slower requires looking at the machine code that the jitter generates. In the case of delegates, that code is very well hidden. You will not see it when you look at the code with Debug > Windows > Disassembly. And you can't even single-step through the code, the managed debugger was written to hide it and completely refuses to show it. I'll have to describe a technique to put the "visual" back into Visual Studio.
我必须谈谈存根".除了抖动生成的代码之外,存根是 CLR 动态创建的一小段机器代码.存根用于实现接口,它们提供了灵活性,即类的方法表中方法的顺序不必与接口方法的顺序相匹配.他们对代表很重要,这是这个问题的主题.存根对即时编译也很重要,存根中的初始代码指向抖动的入口点,以便在调用它时编译方法.之后替换存根,现在调用 jited 目标方法.正是存根使静态方法调用变慢,静态方法目标的存根比实例方法的存根更精细.
I have to talk a bit about "stubs". A stub is a little sliver of machine code that the CLR dynamically creates in addition to the code that the jitter generates. Stubs are used to implement interfaces, they provide the flexibility that the order of the methods in the method table for a class does not have to match the order of the interface methods. And they matter for delegates, the subject of this question. Stubs also matter to just-in-time compilation, the initial code in a stub points to an entrypoint into the jitter to get a method compiled when it is invoked. After which the stub is replaced, now calling the jitted target method. It is the stub that makes the static method call slower, the stub for the static method target is more elaborate than the stub for an instance method.
要查看存根,您必须让调试器强制显示它们的代码.需要进行一些设置:首先使用工具 > 选项 > 调试 > 常规.取消勾选Just My Code"复选框,取消勾选Suppress JIT optimization"复选框.如果您使用 VS2015,然后勾选使用托管兼容模式",VS2015 调试器非常有问题,并且严重妨碍了这种调试,此选项通过强制使用 VS2010 托管调试器引擎提供了一种解决方法.切换到发布配置.然后项目>属性>调试,勾选启用本机代码调试"复选框.然后 Project > Properties > Build,取消勾选Prefer 32-bit"复选框,Platform target"应该是 AnyCPU.
To see the stubs, you have to wrangle the debugger to force it to show their code. Some setup is required: first use Tools > Options > Debugging > General. Untick the "Just My Code" checkbox, untick the "Suppress JIT optimization" checkbox. If you use VS2015 then tick "Use Managed Compatibility Mode", the VS2015 debugger is very buggy and gets seriously in the way for this kind of debugging, this option provides a workaround by forcing the VS2010 managed debugger engine to be used. Switch to the Release configuration. Then Project > Properties > Debug, tick the "Enable native code debugging" checkbox. And Project > Properties > Build, untick the "Prefer 32-bit" checkbox and "Platform target" should be AnyCPU.
在 Run() 方法上设置断点,注意断点在优化代码中不是很准确.最好在方法头上设置.一旦命中,使用 Debug > Windows > Disassembly 查看抖动生成的机器代码.委托调用调用在 Haswell 内核上看起来像这样,如果您有一个不支持 AVX 的旧处理器,可能与您看到的不匹配:
Set a breakpoint on the Run() method, beware that breakpoints are not very accurate in optimized code. Setting on the method header is best. Once it hits, use Debug > Windows > Disassembly to see the machine code that the jitter generated. The delegate invoke call looks like this on a Haswell core, might not match what you see if you have an older processor that doesn't support AVX yet:
funcResult += _func.Invoke(1d, 2d);
0000001a mov rax,qword ptr [rsi+8] ; rax = _func
0000001e mov rcx,qword ptr [rax+8] ; rcx = _func._methodBase (?)
00000022 vmovsd xmm2,qword ptr [0000000000000070h] ; arg3 = 2d
0000002b vmovsd xmm1,qword ptr [0000000000000078h] ; arg2 = 1d
00000034 call qword ptr [rax+18h] ; call stub
64 位方法调用传递寄存器中的前 4 个参数,任何其他参数都通过堆栈传递(不在此处).此处使用 XMM 寄存器是因为参数是浮点数.此时,jitter 尚无法知道该方法是静态方法还是实例方法,直到该代码实际执行才能发现.隐藏差异是存根的工作.它假定它将是一个实例方法,这就是我注释 arg2 和 arg3 的原因.
A 64-bit method call passes the first 4 arguments in registers, any additional arguments are passed through the stack (not here). The XMM registers are used here because the arguments are floating point. At this point the jitter cannot know yet whether the method is static or instance, that can't be found out until this code actually executes. It is the job of the stub to hide the difference. It assumes it will be an instance method, that's why I annotated arg2 and arg3.
在 CALL 指令上设置一个断点,第二次命中(所以在存根不再指向抖动之后)你可以看看它.这必须手动完成,使用 Debug > Windows > Registers 并复制 RAX 寄存器的值.Debug > Windows > Memory > Memory1 并粘贴值,将0x"放在它前面并添加 0x18.右键单击该窗口并选择8-byte Integer",复制第一个显示的值.那是存根代码的地址.
Set a breakpoint on the CALL instruction, the second time it hits (so after the stub no longer points into the jitter) you can have a look at it. That has to be done by hand, use Debug > Windows > Registers and copy the value of the RAX register. Debug > Windows > Memory > Memory1 and paste the value, put "0x" in front of it and add 0x18. Right-click that window and select "8-byte Integer", copy the first displayed value. That is the address of the stub code.
现在的诀窍是,此时托管调试引擎仍在使用中,并且不允许您查看存根代码.您必须强制进行模式切换,以便控制非托管调试引擎.使用 Debug > Windows > Call Stack 并双击底部的方法调用,例如 RtlUserThreadStart.强制调试器切换引擎.现在您可以开始了,可以将地址粘贴到地址框中,在其前面加上0x".弹出存根代码:
Now the trick, at this point the managed debugging engine is still being used and will not allow you to look at the stub code. You have to force a mode switch so the unmanaged debugging engine is in control. Use Debug > Windows > Call Stack and double-click a method call on the bottom, like RtlUserThreadStart. Forces the debugger to switch engines. Now you are good to go and can paste the address in the Address box, put "0x" in front of it. Out pops the stub code:
00007FFCE66D0100 jmp 00007FFCE66D0E40
很简单的一个,直接跳转到委托目标方法.这将是快速代码.抖动在实例方法上猜对了,并且委托对象已经在 RCX 寄存器中提供了 this
参数,因此不需要做任何特别的事情.
Very simple one, a straight jump to the delegate target method. This will be fast code. The jitter guessed correctly at an instance method and the delegate object already provided the this
argument in the RCX register so nothing special needs to be done.
继续进行第二个测试并执行完全相同的操作来查看实例调用的存根.现在存根就大不一样了:
Proceed to the second test and do the exact same thing to look at the stub for the instance call. Now the stub is very different:
000001FE559F0850 mov rax,rsp ; ?
000001FE559F0853 mov r11,rcx ; r11 = _func (?)
000001FE559F0856 movaps xmm0,xmm1 ; shuffle arg3 into right register
000001FE559F0859 movaps xmm1,xmm2 ; shuffle arg2 into right register
000001FE559F085C mov r10,qword ptr [r11+20h] ; r10 = _func.Method
000001FE559F0860 add r11,20h ; ?
000001FE559F0864 jmp r10 ; jump to _func.Method
代码有点不稳定,而且不是最优的,微软在这里可能会做得更好,而且我不能 100% 确定我的注释是否正确.我想不必要的 mov rax,rsp 指令仅与具有 4 个以上参数的方法的存根相关.不知道为什么需要添加指令.最重要的细节是 XMM 寄存器的移动,它必须重新洗牌,因为静态方法没有 this
参数.正是这种重新洗牌的要求让代码变慢了.
The code is a bit wonky and not optimal, Microsoft could probably do a better job here, and I'm not 100% sure I annotated it correctly. I guess that the unnecessary mov rax,rsp instruction is only relevant for stubs to methods with more than 4 arguments. No idea why the add instruction is necessary. Most important detail that matters are the XMM register moves, it has to reshuffle them because the static method does not have the this
argument. It is this reshuffling requirement that makes the code slower.
你可以用 x86 抖动做同样的练习,静态方法存根现在看起来像:
You can do the same exercise with the x86 jitter, the static method stub now looks like:
04F905B4 mov eax,ecx
04F905B6 add eax,10h
04F905B9 jmp dword ptr [eax] ; jump to _func.Method
比 64 位存根要简单得多,这就是为什么 32 位代码不会受到几乎一样多的减速影响.它如此不同的一个原因是 32 位代码在 FPU 堆栈上传递浮点数,它们不必重新洗牌.当您使用整数或对象参数时,这不一定会更快.
Much simpler than the 64-bit stub, which is why 32-bit code does not suffer from the slowdown nearly as much. One reason it is so very different is that 32-bit code passes floating point on the FPU stack and they don't have to be reshuffled. This won't necessarily be faster when you use integral or object arguments.
非常神秘,希望我还没有让每个人都入睡.当心我可能弄错了一些注释,我不完全理解存根以及 CLR 烹饪委托对象成员以尽可能快地编写代码的方式.但是这里肯定有不错的编程建议.您确实喜欢将实例方法作为委托目标,使它们static
不是优化.
Very arcane, hope I didn't put everybody to sleep yet. Beware I might have gotten some annotations wrong, I don't fully understand stubs and the way the CLR cooks delegate object members to make code as fast as possible. But there is certainly decent programming advice here. You really do favor instance methods as delegate targets, making them static
is not an optimization.
这篇关于.net 无目标委托比有目标慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!