更快的8位单片机16位乘算法-Faster16bitmultiplicationalgorithmfor8-bitMCU

作者：猛儿187888 | 来源：互联网 | 2023-05-17 11:42

Imsearchingforanalgorithmtomultiplytwointegernumbersthatisbetterthantheonebelow.Do

I'm searching for an algorithm to multiply two integer numbers that is better than the one below. Do you have a good idea about that? (The MCU - AT Tiny 84/85 or similar - where this code runs has no mul/div operator)

我正在寻找一种算法，将两个整数相乘，这个算法比下面的算法更好。你对此有什么好主意吗?(MCU -在很小的84/85或类似的地方运行此代码没有mul/div操作符)

uint16_t umul16_(uint16_t a, uint16_t b)
{
    uint16_t res=0;

    while (b) {
        if ( (b & 1) )
            res+=a;
        b>>=1;
        a+=a;
    }

    return res;
}

This algorithm, when compiled for AT Tiny 85/84 using the avr-gcc compiler, is almost identical to the algorithm __mulhi3 the avr-gcc generates.

这个算法，当使用avr-gcc编译器在很小的85/84处编译时，几乎与avr-gcc生成的算法__mulhi3相同。

avr-gcc algorithm:

avr-gcc算法:

00000106 <__mulhi3>:
 106:   00 24           eor r0, r0
 108:   55 27           eor r21, r21
 10a:   04 c0           rjmp    .+8         ; 0x114 <__mulhi3+0xe>
 10c:   08 0e           add r0, r24
 10e:   59 1f           adc r21, r25
 110:   88 0f           add r24, r24
 112:   99 1f           adc r25, r25
 114:   00 97           sbiw    r24, 0x00   ; 0
 116:   29 f0           breq    .+10        ; 0x122 <__mulhi3+0x1c>
 118:   76 95           lsr r23
 11a:   67 95           ror r22
 11c:   b8 f3           brcs    .-18        ; 0x10c <__mulhi3+0x6>
 11e:   71 05           cpc r23, r1
 120:   b9 f7           brne    .-18        ; 0x110 <__mulhi3+0xa>
 122:   80 2d           mov r24, r0
 124:   95 2f           mov r25, r21
 126:   08 95           ret

umul16_ algorithm:

umul16_算法:

00000044 :
  44:   20 e0           ldi r18, 0x00   ; 0
  46:   30 e0           ldi r19, 0x00   ; 0
  48:   61 15           cp  r22, r1
  4a:   71 05           cpc r23, r1
  4c:   49 f0           breq    .+18        ; 0x60 
  4e:   60 ff           sbrs    r22, 0
  50:   02 c0           rjmp    .+4         ; 0x56 
  52:   28 0f           add r18, r24
  54:   39 1f           adc r19, r25
  56:   76 95           lsr r23
  58:   67 95           ror r22
  5a:   88 0f           add r24, r24
  5c:   99 1f           adc r25, r25
  5e:   f4 cf           rjmp    .-24        ; 0x48 
  60:   c9 01           movw    r24, r18
  62:   08 95           ret

Edit: The instruction set is available here.

6 个解决方案

#1

Summary

总结

Consider swapping a and b (Original proposal)
考虑交换a和b(原始提案)
Trying to avoid conditional jumps (Not successful optimization)
尝试避免条件跳转(不成功的优化)
Reshaping of the input formula (estimated 35% gain)
修改输入公式(约35%增益)
Removing duplicated shift
删除重复的转变
Unrolling the loop: The "optimal" assembly
展开循环:“最佳”程序集
Convincing the compiler to give the optimal assembly
说服编译器给出最优程序集

1. Consider swapping a and b

One improvement would be to first compare a and b, and swap them if a: you should use as b the smaller of the two, so that you have the minimum number of cycles. Note that you can avoid swapping by duplicating the code (if (a then jump to a mirrored code section), but I doubt it's worth.

 
一种改进是首先比较a和b，如果a 
  
 
 
2. Trying to avoid conditional jumps (Not successful optimization) 
 
Try: 
试一试: 
uint16_t umul16_(uint16_t a, uint16_t b)
{
    ///Here swap if necessary
    uint16_t accum=0;

    while (b) {
        accum += ((b&1) * uint16_t(0xffff)) & a; //Hopefully this multiplication is optimized away
        b>>=1;
        a+=a;
    }

    return accum;
}
 
From Sergio's feedback, this didn't bring improvements. 
从塞尔吉奥的反馈来看，这并没有带来什么改进。 
 
 
3. Reshaping of the input formula 
 
Considering that the target architecture has basically only 8bit instructions, if you separate the upper and bottom 8 bit of the input variables, you can write: 
考虑到目标体系结构基本上只有8位指令，如果将输入变量的上下8位分开，可以这样写: 
a = a1 * 0xff + a0;
b = b1 * 0xff + b0;

a * b = a1 * b1 * 0xffff + a0 * b1 * 0xff + a1 * b0 * 0xff + a0 * b0
 
Now, the cool thing is that we can throw away the term a1 * b1 * 0xffff, because the 0xffff send it out of your register. 
现在，最酷的事情是我们可以抛弃a1 * b1 * 0xffff这个术语，因为0xffff将它从寄存器中发送出去。 
(16bit) a * b = a0 * b1 * 0xff + a1 * b0 * 0xff + a0 * b0
 
Furthermore, the a0*b1 and a1*b0 term can be treated as 8bit multiplications, because of the 0xff: any part exceeding 256 will be sent out of the register. 
此外，a0*b1和a1*b0项可以被视为8bit乘法，因为0xff:超过256的任何部分都将从寄存器中发出。 
So far exciting! ... But, here comes reality striking: a0 * b0 has to be treated as a 16 bit multiplication, as you'll have to keep all resulting bits. a0 will have to be kept on 16 bit to allow shift lefts. This multiplication has half of the iterations of a * b, it is in part 8bit (because of b0) but you still have to take into account the 2 8bit multiplications mentioned before, and the final result composition. We need further reshaping! 
到目前为止刺激!…但是，现实是惊人的:a0 * b0必须被视为一个16位的乘法，因为你必须保留所有的结果位。a0必须保持在16位，才能允许左移。这个乘法有a * b的一半迭代，它在第8位(因为b0)中，但是您仍然需要考虑前面提到的2个8位乘法，以及最终的结果组合。我们需要进一步重塑! 
So now I collect b0. 
现在我取b0。 
(16bit) a * b = a0 * b1 * 0xff + b0 * (a0 + a1 * 0xff)
 
But  
但 
(a0 + a1 * 0xff) = a
 
So we get: 
所以我们得到: 
(16bit) a * b = a0 * b1 * 0xff + b0 * a
 
If N were the cycles of the original a * b, now the first term is an 8bit multiplication with N/2 cycles, and the second a 16bit * 8bit multiplication with N/2 cycles. Considering M the number of instructions per iteration in the original a*b, the 8bit*8bit iteration has half of the instructions, and the 16bit*8bit about 80% of M (one shift instruction less for b0 compared to b). Putting together we have N/2*M/2+N/2*M*0.8 = N*M*0.65 complexity, so an expected saving of ~35% with respect to the original N*M. Sounds promising. 
如果N是原始a * b的周期，那么第一个项是一个8位的乘法，N/2个周期，第二个是16位* 8位乘以N/2个周期。考虑M指示每个迭代的数量在原来的a * b,8位* 8位迭代有一半的指令,和M的16位* 8位约80%(一个转移指令少b0相比,b)。让我们一起有N / 2 * M / 2 + N / 2 * 0.8米* 0.8 = N * M *复杂性,因此预计节省~ 35%对原来的N * M。听起来让人充满希望。 
This is the code: 
这是代码: 
uint16_t umul16_(uint16_t a, uint16_t b)
{
    uint8_t res1 = 0;

    uint8_t a0 = a & 0xff; //This effectively needs to copy the data
    uint8_t b0 = b & 0xff; //This should be optimized away
    uint8_t b1 = b >>8; //This should be optimized away

    //Here a0 and b1 could be swapped (to have b1 >=1;
        a0+=a0;
    }

    uint16_t res = (uint16_t) res1 * 256; //Should be optimized away, it's not even a copy!

    //Here swapping wouldn't make much sense
    while (b0) {///Maximum 8 cycles
        if ( (b0 & 1) )
            res+=a;
        b0>>=1;
        a+=a;
    }

    return res;
}
 
Also, the splitting in 2 cycles should double, in theory, the chance of skipping some cycles: N/2 might be a slight overestimate. 
另外，在理论上，两个周期的分裂应该加倍，有可能跳过一些周期:N/2可能是一个稍微高估的值。 
A tiny further improvement consist in avoiding the last, unnecessary shift for the a variables. Small side note: if either b0 or b1 are zero it causes 2 extra instructions. But it also saves the first check of b0 and b1, which is the most expensive because it cannot check the zero flag status from the shift operation for the conditional jump of the for loop. 
一个微小的改进在于避免对变量的最后一次不必要的转换。小边注:如果b0或b1都为0，它会产生两个额外的指令。但是它也保存了b0和b1的第一次检查，这是最昂贵的，因为它不能从shift操作中检查for循环的条件跳转的零标志状态。 
uint16_t umul16_(uint16_t a, uint16_t b)
{
    uint8_t res1 = 0;

    uint8_t a0 = a & 0xff; //This effectively needs to copy the data
    uint8_t b0 = b & 0xff; //This should be optimized away
    uint8_t b1 = b >>8; //This should be optimized away

    //Here a0 and b1 could be swapped (to have b1 >=1;
    while (b1) {///Maximum 7 cycles
        a0+=a0;
        if ( (b1 & 1) )
            res1+=a0;
        b1>>=1;
    }

    uint16_t res = (uint16_t) res1 * 256; //Should be optimized away, it's not even a copy!

    //Here swapping wouldn't make much sense
    if ( (b0 & 1) )
        res+=a;
    b0>>=1;
    while (b0) {///Maximum 7 cycles
        a+=a;
        if ( (b0 & 1) )
            res+=a;
        b0>>=1;
    }

    return res;
}
 
 
 
4. Removing duplicated shift 
 
Is there still space for improvement? Yes, as the bytes in a0 gets shifted two times. So there should be a benefit in combining the two loops. It might be a little bit tricky to convince the compiler to do exactly what we want, especially with the result register. 
还有改进的空间吗?是的，因为a0中的字节被移动了两次。所以结合这两个循环应该有好处。说服编译器做我们想做的事情可能有点棘手，特别是使用结果寄存器。 
So, we process in the same cycle b0 and b1. The first thing to handle is, which is the loop exit condition? So far using b0/b1 cleared status has been convenient because it avoids using a counter. Furthermore, after the shift right, a flag might be already set if the operation result is zero, and this flag might allow a conditional jump without further evaluations. 
所以，我们在相同的周期b0和b1。首先要处理的是，循环退出条件是什么?到目前为止，使用b0/b1清除状态非常方便，因为它避免使用计数器。此外，在右移之后，如果操作结果为零，则可能已经设置了一个标志，并且该标志可能允许条件跳转，而无需进一步的计算。 
Now the loop exit condition could be the failure of (b0 || b1). However this could require expensive computation. One solution is to compare b0 and b1 and jump to 2 different code sections: if b1 > b0 I test the condition on b1, else I test the condition on b0. I prefer another solution, with 2 loops, the first exit when b0 is zero, the second when b1 is zero. There will be cases in which I will do zero iterations in b1. The point is that in the second loop I know b0 is zero, so I can reduce the number of operations performed. 
现在循环退出条件可以是(b0 ||)的失败。然而，这可能需要昂贵的计算。一种解决方案是比较b0和b1，然后跳转到两个不同的代码段:如果b1 > b0我在b1上测试条件，否则我在b0上测试条件。我更喜欢另一个解，有两个循环，第一个出口在b0为0时，第二个出口在b1为0时。我会在b1中做零次迭代。关键是在第二个循环中，我知道b0是零，所以我可以减少执行的操作数。 
Now, let's forget about the exit condition and try to join the 2 loops of the previous section. 
现在，让我们忘掉退出条件，尝试加入上一节的两个循环。 
uint16_t umul16_(uint16_t a, uint16_t b)
{
    uint16_t res = 0;

    uint8_t b0 = b & 0xff; //This should be optimized away
    uint8_t b1 = b >>8; //This should be optimized away

    //Swapping probably doesn't make much sense anymore
    if ( (b1 & 1) )
        res+=(uint16_t)((uint8_t)(a && 0xff))*256;
    //Hopefully the compiler understands it has simply to add the low 8bit register of a to the high 8bit register of res

    if ( (b0 & 1) )
        res+=a;

    b1>>=1;
    b0>>=1;
    while (b0) {///N cycles, maximum 7
        a+=a;
        if ( (b1 & 1) )
            res+=(uint16_t)((uint8_t)(a & 0xff))*256;
        if ( (b0 & 1) )
            res+=a;
        b1>>=1;
        b0>>=1; //I try to put as last the one that will leave the carry flag in the desired state
    }

    uint8_t a0 = a & 0xff; //Again, not a real copy but a register selection

    while (b1) {///P cycles, maximum 7 - N cycles
        a0+=a0;
        if ( (b1 & 1) )
            res+=(uint16_t) a0 * 256;
        b1>>=1;
    }
    return res;
}
 
Thanks Sergio for providing the assembly generated (-Ofast). At first glance, considering the outrageous amount of mov in the code, it seems the compiler did not interpret as I wanted the hints I gave to him to interpret the registers. 
感谢Sergio提供的汇编(-Ofast)。乍一看，考虑到代码中大量的mov，编译器似乎并没有解释，因为我想要我给他的解释寄存器的提示。 
Inputs are: r22,r23 and r24,25.
 AVR Instruction Set: Quick reference, Detailed documentation 
输入是:r22 r23和r24,25。AVR指令集:快速参考，详细文档 
sbrs //Tests a single bit in a register and skips the next instruction if the bit is set. Skip takes 2 clocks. 
ldi // Load immediate, 1 clock
sbiw // Subtracts immediate to *word*, 2 clocks

    00000010 :
      10:    70 ff           sbrs    r23, 0
      12:    39 c0           rjmp    .+114        ; 0x86 <__SREG__+0x47>
      14:    41 e0           ldi    r20, 0x01    ; 1
      16:    00 97           sbiw    r24, 0x00    ; 0
      18:    c9 f1           breq    .+114        ; 0x8c <__SREG__+0x4d>
      1a:    34 2f           mov    r19, r20
      1c:    20 e0           ldi    r18, 0x00    ; 0
      1e:    60 ff           sbrs    r22, 0
      20:    07 c0           rjmp    .+14         ; 0x30 
      22:    28 0f           add    r18, r24
      24:    39 1f           adc    r19, r25
      26:    04 c0           rjmp    .+8          ; 0x30 
      28:    e4 2f           mov    r30, r20
      2a:    45 2f           mov    r20, r21
      2c:    2e 2f           mov    r18, r30
      2e:    34 2f           mov    r19, r20
      30:    76 95           lsr    r23
      32:    66 95           lsr    r22
      34:    b9 f0           breq    .+46         ; 0x64 <__SREG__+0x25>
      36:    88 0f           add    r24, r24
      38:    99 1f           adc    r25, r25
      3a:    58 2f           mov    r21, r24
      3c:    44 27           eor    r20, r20
      3e:    42 0f           add    r20, r18
      40:    53 1f           adc    r21, r19
      42:    70 ff           sbrs    r23, 0
      44:    02 c0           rjmp    .+4          ; 0x4a <__SREG__+0xb>
      46:    24 2f           mov    r18, r20
      48:    35 2f           mov    r19, r21
      4a:    42 2f           mov    r20, r18
      4c:    53 2f           mov    r21, r19
      4e:    48 0f           add    r20, r24
      50:    59 1f           adc    r21, r25
      52:    60 fd           sbrc    r22, 0
      54:    e9 cf           rjmp    .-46         ; 0x28 
      56:    e2 2f           mov    r30, r18
      58:    43 2f           mov    r20, r19
      5a:    e8 cf           rjmp    .-48         ; 0x2c 
      5c:    95 2f           mov    r25, r21
      5e:    24 2f           mov    r18, r20
      60:    39 2f           mov    r19, r25
      62:    76 95           lsr    r23
      64:    77 23           and    r23, r23
      66:    61 f0           breq    .+24         ; 0x80 <__SREG__+0x41>
      68:    88 0f           add    r24, r24
      6a:    48 2f           mov    r20, r24
      6c:    50 e0           ldi    r21, 0x00    ; 0
      6e:    54 2f           mov    r21, r20
      70:    44 27           eor    r20, r20
      72:    42 0f           add    r20, r18
      74:    53 1f           adc    r21, r19
      76:    70 fd           sbrc    r23, 0
      78:    f1 cf           rjmp    .-30         ; 0x5c <__SREG__+0x1d>
      7a:    42 2f           mov    r20, r18
      7c:    93 2f           mov    r25, r19
      7e:    ef cf           rjmp    .-34         ; 0x5e <__SREG__+0x1f>
      80:    82 2f           mov    r24, r18
      82:    93 2f           mov    r25, r19
      84:    08 95           ret
      86:    20 e0           ldi    r18, 0x00    ; 0
      88:    30 e0           ldi    r19, 0x00    ; 0
      8a:    c9 cf           rjmp    .-110        ; 0x1e 
      8c:    40 e0           ldi    r20, 0x00    ; 0
      8e:    c5 cf           rjmp    .-118        ; 0x1a 
 
 
 
5. Unrolling the loop: The "optimal" assembly 
 
With all this information, let's try to understand what would be the "optimal" solution given the architecture constraints. "Optimal" is quoted because what is "optimal" depends a lot on the input data and what we want to optimize. Let's assume we want to optimize on number of cycles on the worst case. If we go for the worst case, loop unrolling is a reasonable choice: we know we have 8 cycles, and we remove all tests to understand if we finished (if b0 and b1 are zero). So far we used the trick "we shift, and we check the zero flag" to check if we had to exit a loop. Removed this requirement, we can use a different trick: we shift, and we check the carry bit (the bit we sent out of the register when shifting) to understand if I should update the result. Given the instruction set, in assembly "narrative" code the instructions become the following. 
有了这些信息，让我们尝试理解在给定体系结构约束的情况下，什么是“最佳”解决方案。“最优”被引用是因为“最优”在很大程度上取决于输入数据和我们想要优化的内容。让我们假设我们想在最坏的情况下优化循环数。如果我们选择最坏的情况，循环展开是一个合理的选择:我们知道我们有8个周期，并且我们删除所有的测试以了解我们是否完成了(如果b0和b1是零)。到目前为止，我们使用的是“我们移动，我们检查零标志”来检查我们是否必须退出一个循环。去掉了这个要求，我们可以使用另一种技巧:我们移动，并检查进位(我们在移动时从寄存器中发出的位)，以了解我是否应该更新结果。给定指令集，在汇编“叙述”代码中，指令变成如下所示。 
//Input: a = a1 * 256 + a0, b = b1 * 256 + b0
//Output: r = r1 * 256 + r0

Preliminary:
P0 r0 = 0 (CLR)
P1 r1 = 0 (CLR)

Main block:
0 Shift right b0 (LSR)
1 If carry is not set skip 2 instructiOns= jump to 4 (BRCC)
2 r0 = r0 + a0 (ADD)
3 r1 = r1 + a1 + carry from prev. (ADC)
4 Shift right b1 (LSR)
5 If carry is not set skip 1 instruction = jump to 7 (BRCC)
6 r1 = r1 + a0 (ADD)
7 a0 = a0 + a0 (ADD)  
8 a1 = a1 + a1 + carry from prev. (ADC)

[Repeat same instructions for another 7 times]
 
Branching takes 1 instruction if no jump is caused, 2 otherwise. All other instructions are 1 cycle. So b1 state has no influence on the number of cycles, while we have 9 cycles if b0 = 1, and 8 cycles if b0 = 0. Counting the initialization, 8 iterations and skipping the last update of a0 and a1, in the worse case (b0 = 11111111b), we have a total of 8 * 9 + 2 - 2 = 72 cycles. I wouldn't know which C++ implementation would convince the compiler to generate it. Maybe: 
如果没有引起跳转，分支将接受1条指令，否则将接受2条指令。所有其他指令都是一个周期。b1状态对周期数没有影响，而b0 = 1有9个周期，b0 = 0有8个周期。计算初始化、8次迭代和跳过a0和a1的最后更新，在更糟糕的情况下(b0 = 11111111b)，我们总共有8 * 9 + 2 - 2 = 72个周期。我不知道c++实现会说服编译器生成它。可能: 
 void iterate(uint8_t& b0,uint8_t& b1,uint16_t& a, uint16_t& r) {
     const uint8_t temp0 = b0;
     b0 >>=1;
     if (temp0 & 0x01) {//Will this convince him to use the carry flag?
         r += a;
     }
     const uint8_t temp1 = b1;
     b1 >>=1;
     if (temp1 & 0x01) {
         r+=(uint16_t)((uint8_t)(a & 0xff))*256;
     }
     a += a;
 }

 uint16_t umul16_(uint16_t a, uint16_t b) {
     uint16_t r = 0;
     uint8_t b0 = b & 0xff;
     uint8_t b1 = b >>8;

     iterate(b0,b1,a,r);
     iterate(b0,b1,a,r);
     iterate(b0,b1,a,r);
     iterate(b0,b1,a,r);
     iterate(b0,b1,a,r);
     iterate(b0,b1,a,r);
     iterate(b0,b1,a,r);
     iterate(b0,b1,a,r); //Hopefully he understands he doesn't need the last update for variable a
     return r;
 }
 
But, given the previous result, to really obtain the desired code one should really switch to assembly! 
但是，考虑到前面的结果，要真正获得所需的代码，我们应该真正地切换到汇编! 
 
Finally one could also consider a more extreme interpretation of the loop unrolling: the sbrc/sbrs instructions allows to test on a specific bit of a register. We can therefore avoid shifting b0 and b1, and at each cycle check a different bit. The only problem is that those instructions only allow to skip the next instruction, and not for a custom jump. So, in "narrative code" it will look like this: 
最后，还可以考虑对循环展开更极端的解释:sbrc/sbrs指令允许对寄存器的特定位进行测试。因此，我们可以避免移动b0和b1，并且在每个周期检查不同的位。唯一的问题是这些指令只允许跳过下一条指令，而不允许进行自定义跳转。因此，在“叙事代码”中，它会是这样的: 
Main block:
0 Test Nth bit of b0 (SBRS). If set jump to 2 (+ 1cycle) otherwise continue with 1
1 Jump to 4 (RJMP)
2 r0 = r0 + a0 (ADD)
3 r1 = r1 + a1 + carry from prev. (ADC)
4 Test Nth bit of (SBRC). If cleared jump to 6 (+ 1cycle) otherwise continue with 5
5 r1 = r1 + a0 (ADD)
6 a0 = a0 + a0 (ADD)  
7 a1 = a1 + a1 + carry from prev. (ADC)
 
While the second substitution allows to save 1 cycle, there's no clear advantage in the second substitution. However, I believe the C++ code might be easier to interpret for the compiler. Considering 8 cycles, initialization and skipping last update of a0 and a1, we have now 64 cycles. 
第二个替换允许保存一个循环，但是第二个替换没有明显的优势。但是，我认为c++代码可能更容易为编译器解释。考虑到8个周期，初始化和跳过a0和a1的最后一次更新，我们现在有64个周期。 
C++ code: 
c++代码: 
 template
 void iterateWithMask(const uint8_t& b0,const uint8_t& b1, uint16_t& a, uint16_t& r) {
     if (b0 & mask)
         r += a;
     if (b1 & mask)
         r+=(uint16_t)((uint8_t)(a & 0xff))*256;
     a += a;
 }

 uint16_t umul16_(uint16_t a, const uint16_t b) {
     uint16_t r = 0;
     const uint8_t b0 = b & 0xff;
     const uint8_t b1 = b >>8;

     iterateWithMask<0x01>(b0,b1,a,r);
     iterateWithMask<0x02>(b0,b1,a,r);
     iterateWithMask<0x04>(b0,b1,a,r);
     iterateWithMask<0x08>(b0,b1,a,r);
     iterateWithMask<0x10>(b0,b1,a,r);
     iterateWithMask<0x20>(b0,b1,a,r);
     iterateWithMask<0x40>(b0,b1,a,r);
     iterateWithMask<0x80>(b0,b1,a,r);

     //Hopefully he understands he doesn't need the last update for a
     return r;
 }
 
Note that in this implementation the 0x01, 0x02 are not a real value, but just a hint to the compiler to know which bit to test. Therefore, the mask cannot be obtained by shifting right: differently from all other functions seen so far, this has really no equivalent loop version. 
注意，在这个实现中，0x01、0x02不是一个真正的值，而只是提示编译器知道要测试哪个位。因此，不能通过右移来获得掩码:与目前看到的所有其他函数不同，这实际上没有等效的循环版本。 
One big problem is that 
一个大问题是 
r+=(uint16_t)((uint8_t)(a & 0xff))*256;
 
It should be just a sum of the upper register of r with the lower register of a. Does not get interpreted as I would like. Other option: 
它应该是r的上寄存器和a的下寄存器之和。其他选项: 
r+=(uint16_t) 256 *((uint8_t)(a & 0xff));
 
 
 
6. Convincing the compiler to give the optimal assembly 
 
We can also keep a constant, and shift instead the result r. In this case we process b starting from the most significant bit. The complexity is equivalent, but it might be easier for the compiler to digest. Also, this time we have to be careful to write explicitly the last loop, which must not do a further shift right for r. 
我们也可以保持一个常数，然后改变结果r。在这种情况下，我们从最重要的位开始处理b。复杂性是等价的，但是编译器可能更容易理解。另外，这次我们要小心地写最后一个循环，它不能对r做进一步的转换。 
 template
 void inverseIterateWithMask(const uint8_t& b0,const uint8_t& b1,const uint16_t& a, const uint8_t& a0, uint16_t& r) {
     if (b0 & mask)
         r += a;
     if (b1 & mask)
         r+=(uint16_t)256*a0; //Hopefully easier to understand for the compiler?
     r += r;
 }

 uint16_t umul16_(const uint16_t a, const uint16_t b) {
     uint16_t r = 0;
     const uint8_t b0 = b & 0xff;
     const uint8_t b1 = b >>8;
     const uint8_t a0 = a & 0xff;

     inverseIterateWithMask<0x80>(b0,b1,a,r);
     inverseIterateWithMask<0x40>(b0,b1,a,r);
     inverseIterateWithMask<0x20>(b0,b1,a,r);
     inverseIterateWithMask<0x10>(b0,b1,a,r);
     inverseIterateWithMask<0x08>(b0,b1,a,r);
     inverseIterateWithMask<0x04>(b0,b1,a,r);
     inverseIterateWithMask<0x02>(b0,b1,a,r);

     //Last iteration:
     if (b0 & 0x01)
         r += a;
     if (b1 & 0x01)
         r+=(uint16_t)256*a0;

     return r;
 }

                        
                           
							  
							    #2
							    
							    
							      
4  
The compiler might be able to produce shorter code by using the ternary operator to choose whether to add 'a' to your accumulator, depends upon cost of test and branch, and how your compiler generates code. 
编译器可以通过使用三元运算符来选择是否向累加器添加“a”来生成更短的代码，这取决于测试和分支的成本，以及编译器如何生成代码。 
Swap the arguments to reduce the loop count. 
交换参数以减少循环计数。 
uint16_t umul16_(uint16_t op1, uint16_t op2)
{
    uint16_t accum=0;
    uint16_t a, b;
    a=op1; b=op2;
    if( op1>=1;
        a+=a;
    }

    return accum;
}
 
Many years ago, I wrote "forth", which promoted a compute rather than branch approach, and that suggests picking which value to use, 
许多年前，我写了“forth”，它提出了一种计算方法而不是分支方法，它建议选择使用哪个值， 
You can use an array to avoid the test completely, which your compiler can likely use to generate as a load from offset. Define an array containing two values, 0 and a, and update the value for a at the end of the loop, 
您可以使用一个数组来完全避免测试，您的编译器可能会使用它来从偏移量中生成负载。定义一个包含0和a两个值的数组，并在循环结束时更新a的值， 
uint16_t umul16_(uint16_t op1, uint16_t op2)
{
    uint16_t accum=0;
    uint16_t pick[2];
    uint16_t a, b;
    a=op1; b=op2;
    if( op1>=1;
        pick[1] += pick[1]; //(a+=a);
    }

    return accum;
}
 
Yeah, evil. But I don't normally write code like that. 
是的,邪恶。但我通常不写这样的代码。 
Edit - revised to add swap to loop on smaller of op1 or op2 (fewer passes). That would eliminate the usefulness of testing for an argument =0. 
编辑-修改为添加交换，以在较小的op1或op2(更少的通过)上循环。这将消除对参数=0的测试的有用性。
							     
							                          
                           
							  
							    #3
							    
							    
							      
3  
Well, mix of LUT and shift usually works 
嗯，LUT和shift的组合通常是有效的 
Something along the line, multiplying 8 bit entities. Lets consider them made up of two quads 
沿着这条线，乘以8位的单位。让我们假设它们是由两个四边形组成的 
uint4_t u1, l1, u2, l2;
uint8_t a = 16*u1 + l1;
uint8_t b = 16*u2 + l2;

product = 256*u1*u2 + 16*u1*l2 + 16*u2*l1 + l1*l1;

inline uint4_t hi( uint8_t v ) { return v >> 4; }
inline uint4_t lo( uint8_t v ) { return v & 15; }

inline uint8_t LUT( uint4_t x, uint4_t y ) {
    static uint8_t lut[256] = ...;
    return lut[x | y <<4]
}

uint16_t multiply(uint8_t a, uint8_t b) {
    return (uint16_t)LUT(hi(a), hi(b)) <<8 +
           ((uint16_t)LUT(hi(a), lo(b)) + (uint16_t)LUT(lo(a), hi(b)) <<4 +
           (uint16_t)LUT(lo(a), lo(b));
}
 
just fill lut[] with results of multiplication. In your case depending on memory you could go with quads (256 sized LUT) or with bytes (65536 size LUT) or anything in between 
只要用乘法的结果填充lut[]即可。在您的情况下，根据内存的不同，您可以使用quads(256大小的LUT)或bytes(65536大小的LUT)或介于两者之间的任何东西
							     
							                          
                           
							  
							    #4
							    
							    
							      
2  
One approach is to unroll the loop. I don't have a compiler for the platform you're using so I can't look at the generated code, but an approach like this could help. 
一种方法是展开循环。我没有您正在使用的平台的编译器，所以我无法查看生成的代码，但是这样的方法可能会有所帮助。 
The performance of this code is less data-dependent -- you go faster in the worst case by not checking to see if you're in the best case. Code size is a bit bigger but not the size of a lookup table. 
这段代码的性能不那么依赖于数据——在最坏的情况下，不检查是否在最好的情况下，您会走得更快。代码大小稍微大一点，但不是查找表的大小。 
(Note code untested, off the top of my head. I'm curious about what the generated code looks like!) 
(注意代码没有经过测试，超出了我的头脑。我很好奇生成的代码是什么样子的! 
#define UMUL16_STEP(a, b, shift) \
    if ((b) & (1U <<(shift))) result += ((a) <<(shift)));

uint16_t umul16(uint16_t a, uint16_t b)
{
    uint16_t result = 0;

    UMUL16_STEP(a, b, 0);
    UMUL16_STEP(a, b, 1);
    UMUL16_STEP(a, b, 2);
    UMUL16_STEP(a, b, 3);
    UMUL16_STEP(a, b, 4);
    UMUL16_STEP(a, b, 5);
    UMUL16_STEP(a, b, 6);
    UMUL16_STEP(a, b, 7);
    UMUL16_STEP(a, b, 8);
    UMUL16_STEP(a, b, 9);
    UMUL16_STEP(a, b, 10);
    UMUL16_STEP(a, b, 11);
    UMUL16_STEP(a, b, 12);
    UMUL16_STEP(a, b, 13);
    UMUL16_STEP(a, b, 14);
    UMUL16_STEP(a, b, 15);

    return result;
}
 
Update: 
更新: 
Depending on what your compiler does, the UMUL16_STEP macro can change. An alternative might be: 
根据编译器的作用，UMUL16_STEP宏可以更改。另一种可能是: 
#define UMUL16_STEP(a, b, shift) \
    if ((b) & (1U <<(shift))) result += (a); (a) <<1;
 
With this approach the compiler might be able to use the sbrc instruction to avoid branches. 
使用这种方法，编译器可以使用sbrc指令来避免分支。 
My guess for how the assembler should look per bit, r0:r1 is the result, r2:r3 is a and r4:r5 is b: 
我猜汇编器应该是什么样子的，r0:r1是结果，r2:r3是a r4:r5是b: 
sbrc r4, 0
add r0, r2
sbrc r4, 0
addc r1, r3
lsl r2
rol r3
 
This should execute in constant time without a branch. Test the bits in r4 and then test the bits in r5 for the higher eight bits. This should execute the multiplication in 96 cycles based on my reading of the instruction set manual. 
这应该在没有分支的恒定时间内执行。测试r4中的位，然后测试r5中较高的8位。根据我对指令集手册的阅读，这将在96个周期内执行乘法。
							     
							                          
                           
							  
							    #5
							    
							    
							      
1  
A non-answer, tinyARM assembler (web doc) instead of C++ or C. I modified a pretty generic multiply-by-squares-lookup for speed (<50 cycles excluding call&return overhead) at the cost of only fitting into AVRs with no less than 1KByte of RAM, using 512 aligned bytes for a table of the lower half of squares. At 20 MHz, that would nicely meet the 2 max 3 usec time limit still not showing up in the question proper - but Sergio Formiggini wanted 16 MHz. As of 2015/04, there is just one ATtiny from Atmel with that much RAM, and that is specified up to 8 MHz … (Rolling your "own" (e.g., from OpenCores) your FPGA probably has a bunch of fast multipliers (18×18 bits seems popular), if not processor cores.)
 For a stab at fast shift-and-add, have a look at shift and add, factor shifting left, unrolled 16×16→16 and/or improve on it (wiki post). (You might well create that community wiki answer begged for in the question.) 
non-answer,tinyARM汇编(web文档)而不是c++或C,我修改一个相当通用multiply-by-squares-lookup速度(<50周期调用和返回开销除外)的成本只有适合avr的不少于1 kb的内存,使用512字节对齐的表广场的下半部分。在20mhz时，这将很好地满足usec时间限制2 max 3仍然没有出现在问题本身-但是塞尔吉奥·福米吉尼想要16 MHz。的2015/04,只有一个ATtiny从爱特梅尔公司那么多的内存,这是指定8 MHz…(你“拥有”(例如,从OpenCores)你的FPGA可能有一堆快乘数(18×18位似乎受欢迎),如果不是处理器核心)。尝试快速shift-and-add,看看改变和添加,因素转移离开,展开16×16→16和/或提高(wiki文章)。(在这个问题中，你很可能会创造出维基百科所要求的社区答案。) 
.def    a0  = r16   ; factor low byte
.def    a1  = r17
#warning two warnings about preceding definitions of
#warning  r16 and r17 are due and may as well be ignored
.def    a   = r16   ; 8-bit factor
.def    b   = r17   ; 8-bit factor ; or r18, rather?
.def    b0  = r18   ; factor low byte
.def    b1  = r19
.def    p0  = r20   ; product low byte
.def    p1  = r21

; "squares table" SqTab shall be two 512 Byte tables of
;  squares of 9-bit natural numbers, divided by 4

; Idea: exploit p = a * b = Squares[a+b] - Squares[a-b]

init:
    ldi     r16, 0x73
    ldi     r17, 0xab
    ldi     r18, 23
    ldi     r19, 1
    ldi     r20, HIGH(SRAM_SIZE)
    cpi     r20, 2
    brsh    fillSqTable ; ATtiny 1634?
    rjmp    mpy16T16
fillSqTable:
    ldi     r20, SqTabH
    subi    r20, -2
    ldi     zh, SqTabH
    clr     zl
; generate sqares by adding up odd numbers starting at 1 += -1
    ldi     r22, 1
    clr     r23
    ser     r26
    ser     r27
fillLoop:
    add     r22, r26
    adc     r23, r27
    adiw    r26, 2
    mov     r21, r23
    lsr     r21         ; get bits 9:2
    mov     r21, r22
    ror     r21
    lsr     r21
    bst     r23, 1
    bld     r21, 7
    st      z+, r21
    cp      zh, r20
    brne    fillLoop
    rjmp    mpy16F16

; assembly lines are marked up with cycle count
;  and (latest) start cycle in block.
; If first line in code block, the (latest) block start cycle
;  follows; else if last line, the (max) block cycle total

;**************************************************************
;*
;* "mpy16F16" - 16x16->16 Bit Unsigned Multiplication
;*                        using table lookup
;* Sergio Formiggini special edition
;* Multiplies  two 16-bit register values a1:a0 and b1:b0.
;* The result is placed in p1:p0.
;*
;* Number of flash words: 318 + return = 
;*                       (40 + 256(flash table) + 22(RAM init))
;* Number of cycles     : 49 + return
;* Low  registers used  : None
;* High registers used  : 7+2 (a1:a0, b1:b0, p1:p0, sq;
;*                             + Z(r31:r30))
;* RAM bytes used       : 512 (squares table)
;*
;**************************************************************
mpy16F16:
    ldi     ZH, SqTabH>>1;1 0   0   squares table>>1
    mov     ZL, a0      ; 1 1
    add     ZL, b0      ; 1 2       a0+b0
    rol     ZH          ; 1 3       9 bit offset
    ld      p0, Z       ; 2 4       a0+b0l          1
    lpm     p1, Z       ; 3 6   9   a0+b0h          2

    ldi     ZH, SqTabH  ; 1 0   9   squares table

    mov     ZL, a1      ; 1 0   10
    sub     ZL, b0      ; 1 1       a1-b0
    brcc    noNegF10    ; 1 2
    neg     ZL          ; 1 3
noNegF10:
    ld      sq, Z       ; 2 4       a1-b0l          3
    sub     p1, sq      ; 1 6   7

    mov     ZL, a0      ; 1 0   17
    sub     ZL, b1      ; 1 1       a0-b1
    brcc    noNegF01    ; 1 2
    neg     ZL          ; 1 3
noNegF01:
    ld      sq, Z       ; 2 4       a0-b1l          4
    sub     p1, sq      ; 1 6   7

    mov     ZL, a0      ; 1 0   24
    sub     ZL, b0      ; 1 1       a0-b0
    brcc    noNegF00    ; 1 2
    neg     ZL          ; 1 3
noNegF00:
    ld      sq, Z       ; 2 4       a0-b0l          5
    sub     p0, sq      ; 1 6
    lpm     sq, Z       ; 3 7       a0-b0h          6*
    sbc     p1, sq      ; 1 10  11

    ldi     ZH, SqTabH>>1;1 0   35
    mov     ZL, a1      ; 1 1
    add     ZL, b0      ; 1 2       a1+b0
    rol     ZH          ; 1 3
    ld      sq, Z       ; 2 4       a1+b0l          7
    add     p1, sq      ; 1 6   7

    ldi     ZH, SqTabH>>1;1 0   42
    mov     ZL, a0      ; 1 1
    add     ZL, b1      ; 1 2       a0+b1
    rol     ZH          ; 1 3
    ld      sq, Z       ; 2 4       a0+b1l          8
    add     p1, sq      ; 1 6   7

    ret                 ;       49

.CSEG
.org 256; words?!
SqTableH:
.db   0,   0,   0,   0,   0,   0,   0,   0,   0,   0
.db   0,   0,   0,   0,   0,   0,   0,   0,   0,   0
.db   0,   0,   0,   0,   0,   0,   0,   0,   0,   0
.db   0,   0,   1,   1,   1,   1,   1,   1,   1,   1
.db   1,   1,   1,   1,   1,   1,   2,   2,   2,   2
.db   2,   2,   2,   2,   2,   2,   3,   3,   3,   3
.db   3,   3,   3,   3,   4,   4,   4,   4,   4,   4
.db   4,   4,   5,   5,   5,   5,   5,   5,   5,   6
.db   6,   6,   6,   6,   6,   7,   7,   7,   7,   7
.db   7,   8,   8,   8,   8,   8,   9,   9,   9,   9
.db   9,   9,  10,  10,  10,  10,  10,  11,  11,  11
.db  11,  12,  12,  12,  12,  12,  13,  13,  13,  13
.db  14,  14,  14,  14,  15,  15,  15,  15,  16,  16
.db  16,  16,  17,  17,  17,  17,  18,  18,  18,  18
.db  19,  19,  19,  19,  20,  20,  20,  21,  21,  21
.db  21,  22,  22,  22,  23,  23,  23,  24,  24,  24
.db  25,  25,  25,  25,  26,  26,  26,  27,  27,  27
.db  28,  28,  28,  29,  29,  29,  30,  30,  30,  31
.db  31,  31,  32,  32,  33,  33,  33,  34,  34,  34
.db  35,  35,  36,  36,  36,  37,  37,  37,  38,  38
.db  39,  39,  39,  40,  40,  41,  41,  41,  42,  42
.db  43,  43,  43,  44,  44,  45,  45,  45,  46,  46
.db  47,  47,  48,  48,  49,  49,  49,  50,  50,  51
.db  51,  52,  52,  53,  53,  53,  54,  54,  55,  55
.db  56,  56,  57,  57,  58,  58,  59,  59,  60,  60
.db  61,  61,  62,  62,  63,  63,  64,  64,  65,  65
.db  66,  66,  67,  67,  68,  68,  69,  69,  70,  70
.db  71,  71,  72,  72,  73,  73,  74,  74,  75,  76
.db  76,  77,  77,  78,  78,  79,  79,  80,  81,  81
.db  82,  82,  83,  83,  84,  84,  85,  86,  86,  87
.db  87,  88,  89,  89,  90,  90,  91,  92,  92,  93
.db  93,  94,  95,  95,  96,  96,  97,  98,  98,  99
.db 100, 100, 101, 101, 102, 103, 103, 104, 105, 105
.db 106, 106, 107, 108, 108, 109, 110, 110, 111, 112
.db 112, 113, 114, 114, 115, 116, 116, 117, 118, 118
.db 119, 120, 121, 121, 122, 123, 123, 124, 125, 125
.db 126, 127, 127, 128, 129, 130, 130, 131, 132, 132
.db 133, 134, 135, 135, 136, 137, 138, 138, 139, 140
.db 141, 141, 142, 143, 144, 144, 145, 146, 147, 147
.db 148, 149, 150, 150, 151, 152, 153, 153, 154, 155
.db 156, 157, 157, 158, 159, 160, 160, 161, 162, 163
.db 164, 164, 165, 166, 167, 168, 169, 169, 170, 171
.db 172, 173, 173, 174, 175, 176, 177, 178, 178, 179
.db 180, 181, 182, 183, 183, 184, 185, 186, 187, 188
.db 189, 189, 190, 191, 192, 193, 194, 195, 196, 196
.db 197, 198, 199, 200, 201, 202, 203, 203, 204, 205
.db 206, 207, 208, 209, 210, 211, 212, 212, 213, 214
.db 215, 216, 217, 218, 219, 220, 221, 222, 223, 224
.db 225, 225, 226, 227, 228, 229, 230, 231, 232, 233
.db 234, 235, 236, 237, 238, 239, 240, 241, 242, 243
.db 244, 245, 246, 247, 248, 249, 250, 251, 252, 253
.db 254, 255
; word addresses, again?!
.equ SqTabH = (high(SqTableH) <<1)

.DSEG
RAMTab .BYTE 512

							     
							                          
                           
							  
							    #6
							    
							    
							      
0  
At long last, an answer, if a cheeky one: I couldn't (yet) get the AVR-C-compiler from the GCC fit it into 8K code. (For an assembler rendition, see AVR multiplication: No Holds Barred).
 The approach is what everyone who used Duff's device tried for a second attempt:
 use a switch. Using macros, the source code looks entirely harmless, if massaged: 
最后，如果有一个厚颜无耻的答案:我(还)无法从GCC获得avr -c编译器，将它放入8K代码中。(对于汇编程序的渲染，请参见AVR乘法:No hold held)。这是每个使用达夫设备的人第二次尝试的方法:使用一个开关。使用宏，源代码看起来完全无害，如果经过处理: 
#define low(mp)     case mp: p = a0 * (uint8_t)(mp) <<8; break
#define low4(mp)    low(mp); low(mp + 1); low(mp + 2); low(mp + 3)
#define low16(mp)   low4(mp); low4(mp + 4); low4(mp + 8); low4(mp + 12)
#define low64(mp)   low16(mp); low16(mp + 16); low16(mp + 32); low16(mp + 48)
#if preShift
# define CASE(mp)   case mp: return p + a * (mp)
#else
# define CASE(mp)   case mp: return (p0<<8) + a * (mp)
#endif
#define case4(mp)   CASE(mp); CASE(mp + 1); CASE(mp + 2); CASE(mp + 3)
#define case16(mp)  case4(mp); case4(mp + 4); case4(mp + 8); case4(mp + 12)
#define case64(mp)  case16(mp); case16(mp + 16); case16(mp + 32); case16(mp + 48)

extern "C" __attribute__ ((noinline))
 uint16_t mpy16NHB16(uint16_t a, uint16_t b)
{
    uint16_t p = 0;
    uint8_t b0 = (uint8_t)b, b1 = (uint8_t)(b>>8);
    uint8_t a0 = (uint8_t)a, p0;

    switch (b1) {
        case64(0);
        case64(64);
        case64(128);
        case64(192);
    }
#if preShift
    p = p0 <<8;
#endif
#if preliminaries
    if (0 == b0) {
        p = -a;
        if (b & 0x8000)
            p += a <<9;
        if (b & 0x4000)
            p += a <<8;
        return p;
    }
    while (b0 & 1) {
        a <<= 1;
        b0 >>= 1;
    }
#endif
    switch (b0) {
        low64(0);
        low64(64);
        low64(128);
        low64(192);
    }
    return ~0;
}
int main(int ac, char const *const av[])
{
    char buf[22];
    for (uint16_t a = 0 ; a




    
        
                        c++
                        算法
                        bit
                        ip
                        io
                        go
                        search
                        int
                        controller
                    
    



    
        写下你的评论吧 !
        
            
                吐个槽吧,看都看了
            
            
                
                                        会员登录 | 用户注册
                                    
                
            
        

        
    

    
        推荐阅读
        
            
                                
                    
                        ide
                        3.223.28周学习总结中的贪心作业收获及困惑
                    

                    
                                                
                            
                        
                                                
                        本文是对3.223.28周学习总结中的贪心作业进行总结，作者在解题过程中参考了他人的代码，但前提是要先理解题目并有解题思路。作者分享了自己在贪心作业中的收获，同时提到了一道让他困惑的题目，即input details部分引发的疑惑。 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2023-12-13 03:42:02
                    

                

                
                                
                    
                        string
                        CF：3D City Model（小思维）问题解析和代码实现
                    

                    
                                                
                            
                        
                                                
                        本文通过解析CF：3D City Model问题，介绍了问题的背景和要求，并给出了相应的代码实现。该问题涉及到在一个矩形的网格上建造城市的情景，每个网格单元可以作为建筑的基础，建筑由多个立方体叠加而成。文章详细讲解了问题的解决思路，并给出了相应的代码实现供读者参考。 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2023-12-13 14:17:11
                    

                

                                
                    
                    
                
                
                                
                    
                        import
                        logistic回归（线性和非线性）的开发笔记
                    

                    
                                                
                        本文由编程笔记#小编为大家整理，主要介绍了logistic回归（线性和非线性）相关的知识，包括线性logistic回归的代码和数据集的分布情况。希望对你有一定的参考价值。 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2023-12-14 21:40:43
                    

                

                
                                
                    
                        process
                        【机器学习】生成式对抗网络模型综述
                    

                    
                                                
                        生成式对抗网络模型综述摘要生成式对抗网络模型(GAN)是基于深度学习的一种强大的生成模型，可以应用于计算机视觉、自然语言处理、半监督学习等重要领域。生成式对抗网络 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2023-12-14 17:51:18
                    

                

                
                                
                    
                        require
                        伊振华作品 | 沈阳市智慧城市运行管理中心的设计与建设
                    

                    
                                                
                        本文介绍了设计师伊振华受邀参与沈阳市智慧城市运行管理中心项目的整体设计，并以数字赋能和创新驱动高质量发展的理念，建设了集成、智慧、高效的一体化城市综合管理平台，促进了城市的数字化转型。该中心被称为当代城市的智能心脏，为沈阳市的智慧城市建设做出了重要贡献。 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2023-12-14 16:35:39
                    

                

                
                                
                    
                        get
                        CSS3选择器的使用方法详解，提高Web开发效率和精准度
                    

                    
                                                
                            
                        
                                                
                        本文详细介绍了CSS3新增的选择器方法，包括属性选择器的使用。通过CSS3选择器，可以提高Web开发的效率和精准度，使得查找元素更加方便和快捷。同时，本文还对属性选择器的各种用法进行了详细解释，并给出了相应的代码示例。通过学习本文，读者可以更好地掌握CSS3选择器的使用方法，提升自己的Web开发能力。 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2023-12-14 14:37:52
                    

                

                
                                
                    
                        string
                        Open judge C16H: Magical Balls 快速幂+逆元问题解析
                    

                    
                                                
                        本文主要解析了Open judge C16H问题中涉及到的Magical Balls的快速幂和逆元算法，并给出了问题的解析和解决方法。详细介绍了问题的背景和规则，并给出了相应的算法解析和实现步骤。通过本文的解析，读者可以更好地理解和解决Open judge C16H问题中的Magical Balls部分。 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2023-12-14 12:03:27
                    

                

                
                                
                    
                        import
                        sklearn数据集库中的常用数据集类型介绍
                    

                    
                                                
                            
                        
                                                
                        本文介绍了sklearn数据集库中常用的数据集类型，包括玩具数据集和样本生成器。其中详细介绍了波士顿房价数据集，包含了波士顿506处房屋的13种不同特征以及房屋价格，适用于回归任务。 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2023-12-13 17:45:15
                    

                

                
                                
                    
                        function
                        不同优化算法的比较分析及实验验证
                    

                    
                                                
                            
                        
                                                
                        本文介绍了神经网络优化中常用的优化方法，包括学习率调整和梯度估计修正，并通过实验验证了不同优化算法的效果。实验结果表明，Adam算法在综合考虑学习率调整和梯度估计修正方面表现较好。该研究对于优化神经网络的训练过程具有指导意义。 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2023-12-13 16:05:14
                    

                

                
                                
                    
                        string
                        VB.NET在线急等问题解决方法，如何统计数据库字段下的数据并显示在文本框里？
                    

                    
                                                
                        本文介绍了一个在线急等问题解决方法，即如何统计数据库中某个字段下的所有数据，并将结果显示在文本框里。作者提到了自己是一个菜鸟，希望能够得到帮助。作者使用的是ACCESS数据库，并且给出了一个例子，希望得到的结果是560。作者还提到自己已经尝试了使用"select sum(字段2) from 表名"的语句，得到的结果是650，但不知道如何得到560。希望能够得到解决方案。 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2023-12-13 15:15:30
                    

                

                
                                
                    
                        eval
                        也就是|小窗_卷积的特征提取与参数计算
                    

                    
                                                
                            
                        
                                                
                        篇首语：本文由编程笔记#小编为大家整理，主要介绍了卷积的特征提取与参数计算相关的知识，希望对你有一定的参考价值。Dense和Conv2D根本区别在于，Den ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2023-12-13 12:59:48
                    

                

                
                                
                    
                        string
                        [大整数乘法] java代码实现
                    

                    
                                                
                            
                        
                                                
                        本文介绍了使用java代码实现大整数乘法的过程，同时也涉及到大整数加法和大整数减法的计算方法。通过分治算法来提高计算效率，并对算法的时间复杂度进行了研究。详细代码实现请参考文章链接。 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2023-12-13 11:21:32
                    

                

                
                                
                    
                        string
                        Go Cobra命令行工具入门教程
                    

                    
                                                
                            
                        
                                                
                        本文介绍了Go语言实现的命令行工具Cobra的基本概念、安装方法和入门实践。Cobra被广泛应用于各种项目中，如Kubernetes、Hugo和Github CLI等。通过使用Cobra，我们可以快速创建命令行工具，适用于写测试脚本和各种服务的Admin CLI。文章还通过一个简单的demo演示了Cobra的使用方法。 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2023-12-12 20:02:41
                    

                

                
                                
                    
                        string
                        通过Go SDK（Amazon S3）从Bucket生成Torrent - Generate Torrent from Bucket via Go SDK (Amazon S3)
                    

                    
                                                
                        Imtryingtofigureoutawaytogeneratetorrentfilesfromabucket,usingtheAWSSDKforGo.我正 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2023-12-12 14:13:01
                    

                

                
                                
                    
                        function
                        在Windows 8上安装gvim中的插件的错误加载问题
                    

                    
                                                
                        本文讨论了在Windows 8上安装gvim中插件时出现的错误加载问题。作者将EasyMotion插件放在了正确的位置，但加载时却出现了错误。作者提供了下载链接和之前放置插件的位置，并列出了出现的错误信息。 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2023-12-14 14:44:00