### A floating problem

Floating point precision problems are notoriously annoying, and personally I was somewhat surprised they are still an issue in this day and age, but they are.

For instance, by the end of the following code:

For instance, by the end of the following code:

float x = 0; float f = 1e-6; for (unsigned i = 0; i < 1e6; ++i) x = x + f;

x's value will be 1.00904 and not 1.

Even much simpler code reveals the issue:

float x = 1e6; float y = 1e-6; float z = x + y;

z's value here will be exactly 1,000,000 and not 1,000,000.000001.

The reason for this is that 32 bit (the number of bits used for a float type) can represent a number to a limited precision, and so it makes sense to use them for high precision when representing small numbers, and to sacrifice that precision gradually as the integral part of the number increases.

### Why use floats when you can use doubles?

Well, in many cases, doubles simply solve the problem. Double is also a finite representation, but is far more precise than float. The smallest positive number a float can represent is about 1e-38, whereas for a double it's something around 2e-308. To emphasize, the ratio between the two is something like 1e270 which is considerably more than the number of atoms in the universe raised to the third power.

The reason is that floats occupy half the space as doubles, and when writing a code that is meant to be vectorized by the compiler, which one often does in high performance computing, this can translate to a factor of 2 speed up. Additionally, if your data is basically a bunch of real numbers, it would also mean your data in floats is around half the size of your data in doubles, so transmitting it over a network may be twice as fast, and so on.

The reason is that floats occupy half the space as doubles, and when writing a code that is meant to be vectorized by the compiler, which one often does in high performance computing, this can translate to a factor of 2 speed up. Additionally, if your data is basically a bunch of real numbers, it would also mean your data in floats is around half the size of your data in doubles, so transmitting it over a network may be twice as fast, and so on.

### Kahan's summation algorithm

For running sums, William Kahan suggested the following algorithm, cleverly using a second variable to keep track of the deviation from the "true" sum (the following snippet is from Wikipedia, and not in C++, but straightforward nevertheless):

functionKahanSum(input)varsum = 0.0vary,t // Temporary values.varc = 0.0 // A running compensation for lost low-order bits.fori = 1toinput.lengthdoy = input[i] - c // So far, so good:cis zero. t = sum + y // Alas,sumis big,ysmall, so low-order digits ofyare lost. c = (t - sum) - y //(t - sum)recovers the high-order part ofy; subtractingyrecovers -(low part ofy) sum = t // Algebraically,cshould always be zero. Beware overly-aggressive optimizing compilers!nexti // Next time around, the lost low part will be added toyin a fresh attempt.returnsum

And this is, in my opinion, really beautiful.

However, as it turns out:

### Casting to double is faster

As it turns out, the dumber, less beautiful solution, is faster:

```
```

double x = 0; float f = 1e-6; for (unsigned i = 0; i < 1e6; ++i) { x = x + f; // f is implicitly cast to double here }

I ran some experiments, and for a very large set of numbers (adding 1e10 numbers, each equal to 1e-7), casting to double was twice as fast as using Kahan's summation algorithm.

A cool algorithm, though.

## No comments:

## Post a Comment