For those of you who don’t know, Project Euler is a series of challenging mathematical/computer programming problems. Though the problem is more on figuring out the mathematical insight, this post will focus on the engineering side. We are going to look at problem #820, and try to solve it on various kinds of hardwares, ranging from CPU to FPGA, to see how much speed gains we can get from engineering alone.

Also, it is fun to see how different hardware affects on performance of the same codebase.

All the source code is available at Github.

This is part 1 of a 2-part series.

## The Solution ๐

At the heart of the problem is how to calculate $n^{th}$ digit of $\frac{1}{x}$. To do so, we multiply $\frac{1}{x}$ with $10^n$, and get the last digit of the integral part.

$$ digit = \left\lfloor \frac{10^n}{x} \right\rfloor \; mod \; 10 $$

$$ digit = \left\lfloor \frac{(10^{n-1} \; mod \; x) * 10}{x} \right\rfloor $$

(if you do division like this, the formula above probably comes more naturally)

To calculate $10^{n - 1} \; mod \; x$ quickly, we can use binary exponentiation. The time complexity would be $O(n log_n)$. With the problem asking for result of $n = 10^7$, this algorithm should be able to produce an answer within a blink.

## CPU ๐

### Recursive ๐

The most obvious implementation would be to calculate the exponential modulo recursively. At each step, you reduce `power`

by half, until it reaches 0.

```
constexpr int32_t ExpModuloRecursive(int32_t power, int32_t modulo) {
if (power == 0)
return 1;
auto res = ExpModuloRecursive(power / 2, modulo);
if (power % 2 == 0) {
return (static_cast<int64_t>(res) * res) % modulo;
} else {
return ((static_cast<int64_t>(res) * res) % modulo * 10) % modulo;
}
}
```

### Looping through the bit ๐

Another way is to decompose `N`

into

$$10'000'000 = 2^7 + 2^9 + 2^{10} + 2^{12} + 2^{15} + 2^{19} + 2^{20} + 2^{23}$$

To calculate $10^{2^x}$, we just need to repeatedly square itself by $x$ times. Multiply all of them together and you get the result.

The nice aspect about this is that you don’t need recursion, which is usually costly, especially for non-tail recursion.

```
constexpr int GetKBitOfN(int32_t n, int k) {
return (n >> k) & 1;
}
// Get index of highest set bit in a 32-bit integer.
// In 1-index.
constexpr int GetHighestSetBit(int32_t n) {
return sizeof(int32_t) * 8 - __builtin_clz(n);
}
constexpr int32_t ExpModuloBit(int32_t power, int32_t modulo) {
int64_t res = 1;
int64_t curr_power = 10;
for (int i = 0; i < GetHighestSetBit(power); ++i) {
if (GetKBitOfN(power, i)) {
res = (res * curr_power) % modulo;
}
curr_power = (curr_power * curr_power) % modulo;
}
return res;
}
```

### Faster Modulo ๐

Another thing to look for is modulo operation. Modulo and integer division are among the slowest arithmetic operations (even slower than floating point operation). You could probably execute a hundred integer addition/subtraction before finishing a single integer division.

There are various methods to speed up modulo reductions: Barrett Reduction, Montgomery, etc. In fact, your compiler uses a lot of tricks to not have to divide integers; e.g. converting division by power of 2 to left shift, convert division by constant to multiplication and left shift, etc.

This blog post provides a pretty good summary.

Below is an example of the code snippet using `fastmod`

.

```
constexpr int32_t ExpModuloBitWithFastMod(int32_t power, int32_t modulo) {
int64_t res = 1;
int64_t curr_power = 10;
auto M = fastmod::computeM_u64(modulo);
for (int i = 0; i < GetHighestSetBit(power); ++i) {
if (GetKBitOfN(power, i)) {
res = fastmod::fastmod_u64(res * curr_power, M, modulo);
}
curr_power = fastmod::fastmod_u64(curr_power * curr_power, M, modulo);
}
return res;
}
```

In the GitHub repos, I tested it with `fastmod`

and `libdivide`

.

### Loop Unroll ๐

This is a pretty common technique to optimize stuff. The idea is to run multiple iterations at 1 go, to avoid the cost of checking the loop conditions, and also allow more instructions to be pipelined.

Since `GetNthDigitFromFracX`

is quite a complex function, compilers won’t be able to unroll it. We will have to help ourselves in this case,

```
// Assume N % 4 == 0
for (int i = 0; i < N / 4; ++i) {
std::array<int64_t, 4> digits{1, 1, 1, 1};
std::array<int64_t, 4> curr_power{10, 10, 10, 10};
for (int k = 0; k < GetHighestSetBit(N - 1); ++k) {
if (GetKBitOfN(N - 1, k)) {
for (int j = 0; j < 4; ++j) {
digits[j] = (digits[j] * curr_power[j]) % (4 * i + j + 1);
}
}
for (int j = 0; j < 4; ++j) {
curr_power[j] = (curr_power[j] * curr_power[j]) % (4 * i + j + 1);
}
}
for (int j = 0; j < 4; ++j) {
digits[j] *= 10;
digits[j] /= 4 * i + j + 1;
}
for (int j = 0; j < 4; ++j) {
res += digits[j];
}
}
```

### Multi-Thread ๐

Since each digit can be computed independently, multi-thread is an obvious choice. We can just split up the work between different threads, and then add them up together at the end.

There will be a bit of overhead to managing these threads, and there could be lower performance if they have to compete for the same cores; but overall, this should be pretty straightforward speed up.

```
for (int i = 0; i < NUM_THREADS; ++i) {
threads.push_back(std::thread([&sum, i]() {
// Run from i * N / NUM_THREADS + 1
auto start = i * N / NUM_THREADS + 1;
auto end = (i + 1) * N / NUM_THREADS;
sum.fetch_add(CalculatePartialS(
N, start, end,
[](int32_t n, int32_t x) { return ExpModuloBit(n, x); }
));
}));
}
```

You can also combine with above optimization to yield even better result.

## Results and Explanation ๐

The code is compiled under GCC 10 and Clang 15 under 3 CPUs:

- Apple M2 - with Clang
- AMD Ryzen 9 (Zen) - with GCC
- Intel Xeon Gold (Cascade Lake) - with GCC
- Intel Core i9 10th gen (Comet Lake) - with GCC

Each cell has 2 numbers, the first one for $N = 10^7$ and another for $N = 10^8$. All numbers are in seconds. They are measured with `time`

command, and the runtime environment is not very well-controlled, so don’t expect very accurate results.

Also, don’t look too much into absolute latency between different CPUs, as I don’t really choose the same equivalent CPUs. I found numbers of different variants in the same CPU tell us more interesting things in this case.

Variant | M2 | Zen | Cascade Lake | Comet Lake |
---|---|---|---|---|

Recursive | 1.29 / 11.16 | 1.29 / 15.9 | 3.10 / 35.62 | 2.27 / 26.10 |

Loop | 1.23 / 10.04 | 1.94 / 24.17 | 3.49 / 40.47 | 2.57 / 29.88 |

With fastmod | 1.54 / 13.21 | 1.29 / 15.12 | 1.76 / 20.03 | 1.29 / 14.90 |

With libdivide | 1.44 / 12.41 | 0.82 / 9.3 | 1.29 / 14.42 | 0.96 / 10.72 |

Loop Unroll | 0.70 / 4.21 | 1.83 / 23.79 | 2.82 / 34.02 | 2.12 / 25.51 |

Multi-Thread | 0.53 / 2.54 | 0.65 / 7.63 | - | - |

(I forgot to record the time of some Multi-Thread programs, but they are pretty predictable so we can safely ignore it)

M2 seems to be fastest here. But please note that this workload is not a common one, since there is pretty much no memory access. So don’t generalize this result.

GCC partially unrolls and inlines the recursive functions (last 8 steps are unrolled in my test), while Clang keeps the recursive structure. However, when we loop through each bit, GCC fails to do this optimization, leading to faster code in `recursive`

compared to `loop`

. This is possible because in my code, `N`

is a compiled-time constant so GCC knows exactly how the recursion is gonna be.

```
# Asm generated by GCC for the recursive case.
400987: e8 34 02 00 00 call 400bc0 <_Z18ExpModuloRecursiveii.part.0>
return (static_cast<int64_t>(res) * res) % modulo;
40098c: 48 98 cdqe
40098e: 48 0f af c0 imul rax,rax
400992: 48 99 cqo
400994: 48 f7 f9 idiv rcx
return (static_cast<int64_t>(res) * res * 10) % modulo;
400997: 48 0f af d2 imul rdx,rdx
40099b: 48 8d 04 92 lea rax,[rdx+rdx*4]
40099f: 48 01 c0 add rax,rax
4009a2: 48 99 cqo
4009a4: 48 f7 f9 idiv rcx
4009a7: 48 0f af d2 imul rdx,rdx
4009ab: 48 8d 04 92 lea rax,[rdx+rdx*4]
4009af: 48 01 c0 add rax,rax
4009b2: 48 99 cqo
4009b4: 48 f7 f9 idiv rcx
4009b7: 48 0f af d2 imul rdx,rdx
4009bb: 48 8d 04 92 lea rax,[rdx+rdx*4]
4009bf: 48 01 c0 add rax,rax
...
```

The latency differences between ways of doing modulo mostly due to different latency/throughput of arithmetic operations on each hardware.

`fastmod`

is slower than `libdivide`

here, but note that this is 64-bit modulo; it could be faster on 32-bit.

`fastmod`

/`libdivide`

is slower in M2, suggesting faster integer division in Apple hardware. This benchmark also shows similar results. This is actually pretty impressive. It is unclear on how Apple does it, but likely because there are multiple units for integer division.

The loop unroll version is much faster in M2 compared to other architectures. This should be due to inability to pipeline integer division efficiently in x86. A bit tricky to be 100% sure though as I don’t know any tooling or document to give more insight on performance implication of M2.

As mentioned above, you can combine them to get faster result, depending on the arch. For example, in M2, you can run multiple threads with loop unroll. The test on my machine show that it can run with $N = 10^8$ in ~ 0.8 seconds, more than 10 times faster than the original solution.

## Future Works ๐

SIMD, GPU and FPGA coming soon…