How easy is it to multiply two numbers?
Recently, I've discovered that our fast emulator handles the multiplication in a completely wrong way. Well, multiplication is a very common operation, and even trivial testing should have found the bug. But there's one detail -- the processor does not have a multiplication instruction at all.
Instead, there are two instructions -- "first multiplication step" (MLF) and "next multiplication step" (MLS). To multiply two 32-bit numbers one has to execute MLF, and then to execute MLS 15 times. The documentation says this, C compiler generates this sequence, and everywhere I looked it was implemented this way. So, I decided to ignore MLS, and implement MLF as a regular multiplication.
And there were just one place where this approach failed, because there were 11 MLS instructions, not 15. To explain why this "incorrect" multiplication is need, I need to first explain the underlying algorithm.
Say you need to multiply 101 (binary) by 11100. Recalling school math that would be:
101 * 11100 -------------- 101 * 0 + 1010 * 0 + 10100 * 1 + 101000 * 1 + 1010000 * 1
However, 11100 can be represented as 100000 - 100. So:
101 * 11100 ------------ - 10100 * 1 + 10100000 * 1
Both approaches can be implemented as a loop over bits of the second operand. In the first approach you look at the lowest bit of the second operand, and if it's non-zero, add the first operand to the result. The first operand is then shifted left by one bit.
In the second approach when the lowest bit of the second operand is non-zero and the previous bit was zero, we substract the first operand. When we see 1->0 transition, we add the first operand. And the first operand is shifted after each bit, too.
This is called Booth algorithm, and it's primary purpose was to reduce the number of additions and substractions, because on older computers those operations took more time then shifts. In the example above, we save one add operation. The second advantage, more relevant today, is that Booth algorithm works fine with negative numbers. This is not obvious, but it does. See here.
The MLF and MLS instruction implement the modified Booth algorithm, that handles two bits at one step -- so we need 16 steps for 32 bit numbers. In the specific processor, each step shifts the second operator right by 2 bits, and new bits of the result appear from the left. And now the key trick -- suppose that input operands are unsigned and limited to 8 bits. Then the result will contain at most 16 bits. After 8 steps will will compute those 16 bits, but they'll be in the upper half of the register. Next 8 steps will just slowly shift them to the right position. So if you don't mind getting shifted result, you can get it in half the time. As an example, if you want to compare a*b with c*d, the extra shift does not matter.
Cool, isn't it? But don't ask how long it took me to find the single place where this trick was used!