1
With the latest C# 7, you can now create by-ref extension methods, so you can get rid of the busywork of constantly storing the return value from the helper function back into the variable.
使用最新的c# 7,您现在可以创建by-ref扩展方法,这样您就可以摆脱不断地将返回值从helper函数存储回变量的busywork。
This streamlines the rotate functions nicely, and eliminates a common class of bug where you forget to re-store the function's return value, yet while possibly introducing a new, completely different type of bug--where ValueTypes
are inadvertently getting modified in-situ when you didn't want or expect them to be.
这样简化了rotate函数,消除了常见的bug类,在这些bug中,您忘记重新存储函数的返回值,但同时可能引入了一种全新的、完全不同的bug类型——当不希望或不希望它们出现时,ValueTypes会在本地被不经意地修改。
public static void Rol(ref this ulong ul) => ul = (ul <<1) | (ul >> 63);
public static void Rol(ref this ulong ul, int N) => ul = (ul <> (64 - N));
public static void Ror(ref this ulong ul) => ul = (ul <<63) | (ul >> 1);
public static void Ror(ref this ulong ul, int N) => ul = (ul <<(64 - N)) | (ul >> N);
/// note: ---^ ^---^--- extension method can now use 'ref' for ByRef semantics
Usually I would be sure to put [MethodImpl(MethodImplOptions.AggressiveInlining)]
on small methods like these, but after some investigation (on x64) I found out that it's not necessary at all here. If the JIT determines the method is eligible (for example, if you uncheck the VisualStudio debugger checkbox 'Suppress JIT Optimization', which is enabled by default) the methods will inlined regardless, and that is the case here.
通常我肯定会把[MethodImpl(method dimploptions.aggressiveinlining)放在这样的小方法上,但在x64进行了一些调查之后,我发现这里根本没有必要这样做。如果JIT确定该方法是合格的(例如,如果不选中VisualStudio调试器复选框“打压JIT优化”(默认情况下是启用的),那么这些方法将不考虑内联,这里就是这样。
To demonstrate the use of a by-ref extension method, I'll focus just on the first method shown above "rotate left", and compare the JIT output between the traditional by-value extension method and the newer by-ref approach. Here are the two test methods to be compared on x64 Release in .NET 4.7 on Windows 10. As noted above, this will be with JIT optimization 'not-suppressed', so under these test conditions as you'll see, the functions will completely disappear into inline code.
为了演示by-ref扩展方法的使用,我将只关注上面“向左旋转”所示的第一个方法,并比较传统的by-value扩展方法和较新的by-ref方法之间的JIT输出。下面是在Windows 10的。net 4.7版本的x64上要比较的两种测试方法。如上所述,这将使用JIT优化“不受抑制”,因此在您将看到的这些测试条件下,函数将完全消失在内联代码中。
static ulong Rol_ByVal(this ulong ul) => (ul <<1) | (ul >> 63);
static void Rol_ByRef(ref this ulong ul) => ul = (ul <<1) | (ul >> 63);
// notice reassignment here ---^ (c̲a̲l̲l̲e̲e̲ doing it instead of caller)
And here is the C# code for each corresponding call site. Since the fully JIT-optimized AMD64 code is so small, I can just include it here as well. This is the optimal case:
这是每个相应的呼叫站点的c#代码。由于完全经过jit优化的AMD64代码非常小,所以我也可以在这里包含它。这是最理想的情况:
static ulong x = 1; // static so it won't be optimized away in this simple test
// ------- ByVal extension method; c̲a̲l̲l̲e̲r̲ must reassign 'x' with the result -------
x = x.Rol_ByVal();
// 00007FF969CC0481 mov eax,dword ptr [7FF969BA4888h]
// 00007FF969CC0487 rol rax,1
// 00007FF969CC048A mov qword ptr [7FF969BA4888h],rax
// ------- New in C#7, ByRef extension method can directly alter 'x' in-situ -------
x.Rol_ByRef();
// 00007FF969CC0491 rol qword ptr [7FF969BA4888h],1
Wow. Yes, that's no joke. Right off the bat we can see that the glaring lack of an OpCodes.Rot
-family of instructions in the ECMA CIL (.NET) intermediate language is a total non-issue; The jitter was able to see through our pile of C# workaround code to divine its simple and pure intention, and the x64 JIT implements it with great code. Impressively, the ByRef version uses a single instruction to perform the rotation directly on the main-memory target address without even loading it into a register.
哇。是的,这不是笑话。从蝙蝠身上我们可以看到,明显缺乏一个操作码。在ECMA CIL (.NET)中间语言中的rot -族指令是完全不存在的;jitter可以通过我们的一堆c#工作代码来查看它的简单和纯粹的意图,而x64 JIT用伟大的代码实现它。令人印象深刻的是,ByRef版本使用一条指令直接在主内存目标地址上执行旋转,而不需要将其加载到寄存器中。
You can still see a residual trace of the excess copying which was necessary in the by-val case. Here is just 8-bytes of shuffling, which won't be much of a problem, but remember that the ValueType
could just as easily be several thousand bytes. Obviously passing those around by-value all the time would likely indicate a fundamental design flaw, but the point here is that just four simple lines of native code clearly show not only the potential for disaster--but the solution as well.
您仍然可以看到多余拷贝的残余痕迹,这在副值情况中是必要的。这里只是8字节的拖放,这不会有什么问题,但是请记住ValueType可以很容易地变成几千字节。显然,始终传递这些值可能表明了一个基本的设计缺陷,但这里的要点是,仅仅4行简单的本机代码就清楚地显示了灾难的可能性,而且还显示了解决方案。
To investigate further, we have to re-suppress JIT optimizations in the debugging session. Doing so will make our helper extension methods come back, with full bodies and stack frames. These clunkier versions will further exaggerate--and exacerbate--the problem shown so minimalistically above. First, let's look at the call sites. Here we really start to see the effect of traditional ValueType
semantics, in other words, the lengths that are required to ensure that every stack frame's cannot manipulate their parents' ValueType
copies:
为了进一步研究,我们必须重新抑制调试会话中的JIT优化。这样做将使我们的助手扩展方法返回,具有完整的主体和堆栈框架。这些粗制滥造的版本将进一步夸大并加剧上述问题。首先,让我们看一下呼叫点。这里我们真正开始看到传统的ValueType语义的影响,换句话说,要确保每个堆栈帧无法操作其父母的ValueType拷贝的长度。
by-value:
传递:
x = x.Rol_ByVal();
// 00007FF969CE049C mov rcx,qword ptr [7FF969BC4888h]
// 00007FF969CE04A3 call 00007FF969CE00A8
// 00007FF969CE04A8 mov qword ptr [rbp-8],rax
// 00007FF969CE04AC mov rcx,qword ptr [rbp-8]
// 00007FF969CE04B0 mov qword ptr [7FF969BC4888h],rcx
by-reference
按引用调用
x.Rol_ByRef();
// 00007FF969CE04B7 mov rcx,7FF969BC4888h
// 00007FF969CE04C1 call 00007FF969CE00B0
// ...all done, nothing to do here; the callee did everything in-place for us
As we might expect from the C# code associated with each of these two fragments, we see that the by-val caller has a bunch of work to do after the call returns. This is the process of overwriting the parent copy of the ulong
value 'x' with the completely independent ulong
value that's returned in the rax
register.
正如我们可能从与这两个片段相关的c#代码中所期望的那样,我们看到在调用返回后,副调用者有一大堆工作要做。这是用完全独立的ulong值重写ulong值'x'的父拷贝的过程,该值在rax寄存器中返回。
Last but not least, it's instructive to look at the native code the x64 Release JIT emits for Rol_ByVal
and Rol_ByRef
functions. As noted, this requires forcing the JIT to "suppress" the optimizations it normally would apply during its once-per-app-launch (so-called "just-in-time") conversion of our .NET IL instructions into speedy native code, tuned for the detected platform.
最后但并非最不重要的是,查看x64发行版JIT为Rol_ByVal和Rol_ByRef函数发出的本机代码是很有意义的。如前所述,这需要强制JIT“抑制”它通常在每次应用程序启动(所谓的“即时”)时应用的优化,将. net IL指令转换为快速的本机代码,并为检测到的平台进行调优。
In order to focus on the tiny but crucial difference between the two, I've stripped away some of administrative boilerplate. (I left the stack frame setup and teardown for context, and to show how in this example, that ancillary stuff pretty much dwarfs the actual contentful instructions.) Can you see the ByRef's indirection at work? Well, it helps that I pointed it out :-/
为了关注两者之间微小但至关重要的区别,我删除了一些管理样板文件。(我离开了堆栈框架设置,并删除了上下文,并展示了在这个示例中,辅助内容与实际的内容说明相比非常小。)你能看到ByRef的间接操作吗?嗯,我指出来是有帮助的:-/
static ulong Rol_ByVal(this ulong ul) => (ul <<1) | (ul >> 63);
// 00007FF969CD0760 push rbp
// 00007FF969CD0761 sub rsp,20h
// 00007FF969CD0765 lea rbp,[rsp+20h]
// ...
// 00007FF969CE0E4C mov rax,qword ptr [rbp+10h]
// 00007FF969CE0E50 rol rax,1
// 00007FF969CD0798 lea rsp,[rbp]
// 00007FF969CD079C pop rbp
// 00007FF969CD079D ret
static void Rol_ByRef(ref this ulong ul) => ul = (ul <<1) | (ul >> 63);
// 00007FF969CD0760 push rbp
// 00007FF969CD0761 sub rsp,20h
// 00007FF969CD0765 lea rbp,[rsp+20h]
// ...
// 00007FF969CE0E8C mov rax,qword ptr [rbp+10h]
// 00007FF969CE0E90 rol qword ptr [rax],1 <--- !
// 00007FF969CD0798 lea rsp,[rbp]
// 00007FF969CD079C pop rbp
// 00007FF969CD079D ret
You might notice that both calls are in fact passing the parent's instance of the ulong
value by reference--both examples are identical in this regard. The parent indicates the address where its private copy of ul
resides in the upper stack frame. Turns out it's not necessary to insulate callees from reading those instances where they lie, as long as we can be sure they never write to those pointers. This is a "lazy" or deferred approach which assigns to each lower (child) stack frame the responsibility for preserving the ValueType semantics of its higher-up callers. There's no need for a caller to proactively copy any ValueType
passed down to a child frame if the child never ends up overwriting it; to maximize the avoidance of unnecessary copying as much as possible, only the child can make the latest-possible determination.
您可能注意到,这两个调用实际上都是通过引用传递ulong值的父实例——在这方面,这两个示例都是相同的。父节点表示其私有的ul副本驻留在上部堆栈框架中的地址。事实证明,只要我们能确定callees从不写这些指针,就没有必要将它们与读取它们所在的实例隔离开来。这是一种“延迟的”或延迟的方法,它为每个低(子)堆栈帧分配维护其上级调用者的ValueType语义的责任。没有必要让调用者预先复制传递给子框架的任何ValueType,如果子框架永远不会覆盖它;为了最大限度地避免不必要的复制,只有孩子才能做出尽可能晚的决定。
Also interesting is that we might have an explanation here for the clunky use of rax
in the first 'ByVal' example I showed. After the by-value method had been completely reduced via inlining, why did the rotation still need to happen in a register?
同样有趣的是,在我展示的第一个“ByVal”示例中,我们可能对rax的笨拙使用有一个解释。在通过内联完全减少了副值方法之后,为什么仍然需要在寄存器中进行旋转?
Well in these latest two full-method-body versions its clear that the first method returns ulong
and the second is void
. Since a return value is passed in rax
, the ByVal method here has to fetch it into that register anyway, so it's a no-brainer to rotate it there too. Because the ByRef method doesn't need to return any value, it doesn't need to stick anything for its caller anywhere, let alone in rax
. It seems likely that "not having to bother with rax
" liberates the ByRef code to target the ulong
instance its parent has shared 'where it lies', using the fancy qword ptr
to indirect into the parent's stack frame memory, instead of using a register. So that's my speculative, but perhaps credible, explanation for the "residual rax
" mystery we saw earlier.
在最近的两个完整方法体版本中,第一个方法返回ulong,第二个方法无效。由于返回值是在rax中传递的,所以这里的ByVal方法无论如何都必须将它取出到那个寄存器中,因此将它旋转到那里也很容易。因为ByRef方法不需要返回任何值,所以它不需要在任何地方为调用者插入任何东西,更不用说在rax中了。似乎“不必费心使用rax”释放了ByRef代码来针对其父实例共享“where It lie”的ulong实例,使用花哨的qword ptr间接进入父实例的堆栈帧内存,而不是使用寄存器。这就是我对之前看到的“残余rax”之谜的推测,但可能是可信的。