Frama-C news and ideas

原文：http://blog.frama-c.com/index.php?post/2013/05/02/nearbyintf1

Harder than it looks: rounding float to nearest integer, part 1

By pascal on Thursday, May 2 2013, 18:14 - Permalink

floating-point

This post is the first in a series on the difficult task of rounding a floating-point number to an integer. Laugh not! The easiest-looking questions can hide unforeseen difficulties, and the most widely accepted solutions can be wrong.

Problem

Consider the task of rounding a float to the nearest integer. The answer is expected as a float, same as the input. In other words, we are looking at the work done by standard C99 function nearbyintf() when the rounding mode is the default round-to-nearest.

For the sake of simplicity, in this series of posts, we assume that the argument is positive and we allow the function to round any which way if the float argument is exactly in-between two integers. In other words, we are looking at the ACSL specification below.

/*@ requires 0 ≤ f ≤ FLT_MAX ;
  ensures -0.5 ≤ \result - f ≤ 0.5 ;
  ensures \exists integer n; \result == n;
*/
float myround(float f);

In the second ensures clause, integer is an ACSL type (think of it as a super-long long long). The formula \exists integer n; \result == n simply means that \result, the float returned by function myround(), is a mathematical integer.

Via truncation

A first idea is to convert the argument f to unsigned int, and then again to float, since that's the expected type for the result:

float myround(float f)
{
  return (float) (unsigned int) f;
}

Obvious overflow issue

One does not need Frama-C's value analysis to spot the very first issue, an overflow for large float arguments, but it's there, so we might as well use it:

$ frama-c -pp-annot  -val r.c -lib-entry -main myround
...
r.c:9:[kernel] warning: overflow in conversion of f ([-0. .. 3.40282346639e+38])
   from floating-point to integer. assert -1 < f < 4294967296;

This overflow can be fixed by testing for large arguments. Large floats are all integers, so the function can simply return f in this case.

float myround(float f)
{
  if (f >= UINT_MAX) return f;
  return (float) (unsigned int) f;
}

Obvious rounding issue

The cast from float to unsigned int does not round to the nearest integer. It “truncates”, that is, it rounds towards zero. And if you already know this, you probably know too the universally used trick to obtain the nearest integer instead of the immediately smaller one, adding 0.5:

float myround(float f)
{
  if (f >= UINT_MAX) return f;
  return (float) (unsigned int) (f + 0.5f);
}

This universally used trick is wrong.

An issue when the ULP of the argument is exactly one

The Unit in the Last Place, or ULP for short, of a floating-point number is its distance to the floats immediately above and immediately below it. For large enough floats, this distance is one. In that range, floats behave as integers.

There is an ambiguity in the above definition for powers of two: the distances to the float immediately above and the float immediately below are not the same. If you know of a usual convention for which one is called the ULP of a power of two, please leave a note in the comments.

int main()
{
  f = 8388609.0f;
  printf("%f -> %f\n", f, myround(f));
}

With a strict IEEE 754 compiler, the simple test above produces the result below:

8388609.000000 -> 8388610.000000

The value passed as argument is obviously representable as a float, since that's the type of f. However, the mathematical sum f + 0.5 does not have to be representable as a float. In the worst case for us, when the argument is in a range of floats separated by exactly one, the floating-point sum f + 0.5 falls exactly in-between the two representable floats f and f + 1. Half the time, it is rounded to the latter, although f was already an integer and was the correct answer for function myround(). This causes bugs as the one displayed above.

The range of floating-point numbers spaced every 1.0 goes from 2^23 to 2^24. Half these 2^23 values, that is, nearly 4 millions of them, exhibit the problem.

Since the 0.5 trick is nearly universally accepted as the solution to implement rounding to nearest from truncation, this bug is likely to be found in lots of places. Nicolas Cellier identified it in Squeak. He offered a solution, too: switch the FPU to round-downward for the time of the addition f + 0.5. But let us not fix the problem just yet, there is another interesting input for the function as it currently stands.

An issue when the argument is exactly the predecessor of 0.5f

int main()
{
  f = 0.49999997f;
  printf("%.9f -> %.9f\n", f, myround(f));
}

This test produces the output 0.499999970 -> 1.000000000, although the input 0.49999997 is clearly closer to 0 than to 1.

Again, the issue is that the floating-point addition is not exact. The argument 0.49999997f is the lastfloat of the interval [0.25 … 0.5). The mathematical result of f + 0.5 falls exactly midway between the last float of the interval [0.5 … 1.0) and 1.0. The rule that ties must be rounded to the even choice means that 1.0 is chosen.

A function that works

The overflow issue and the first non-obvious issue (when ulp(f)=1) can be fixed by the same test: as soon as the ULP of the argument is one, the argument is an integer and can be returned as-is.

The second non-obvious issue, with input 0.49999997f, can be fixed similarly.

float myround(float f)
{
  if (f >= 0x1.0p23) return f;
  if (f <= 0.5) return 0;
  return (float) (unsigned int) (f + 0.5f);
}

A better function that works

Changing the FPU rounding mode, the suggestion in the Squeak bug report, is slightly unpalatable for such a simple function, but it suggests to add the predecessor of 0.5f instead of 0.5f to avoid the sum rounding up when it shouldn't:

float myround(float f)
{
  if (f >= 0x1.0p23) return f;
  return (float) (unsigned int) (f + 0.49999997f);
}

It turns out that this works, too. It solves the problem with the input 0.49999997f without making the function fail its specification for other inputs.

To be continued

The next post will approach the same question from a different angle. It should not be without its difficulties either.

Comments

1. On Friday, May 3 2013, 02:30 by Jared Armstrong

This is definitely harder than it seems that it should be. I have fought this battle with several languages including C. This is a great share.

2. On Friday, May 3 2013, 03:15 by shjk

Even the last myround function does not work for all inputs. myround(0.5f) gives 0.0f here (GCC 4.2, Mac OS X). Also negative numbers break. The tests should be on the absolute of f, also needs to subtract 0.5/0.49999997f when negative. See https://gist.github.com/sheijk/4e8b...

3. On Friday, May 3 2013, 08:38 by Joe

Is this really a problem? Given that a floating point number is not an exact representation, and can't even represent all integer values in its range, does it even mean anything to talk about rounding to the nearest integer in the general case?

E.g.

float f = 123456789;
int i = (int) f; // 123456792

I'd say most applications that use this sort of rounding implicitly restrict values to a range where it makes sense (with double-precision it will work for values with fewer than 15 digits).

4. On Friday, May 3 2013, 09:22 by Collin

Would this work?

float RoundedFloat (float rawFloat){
if(rawFloat - rawFloat.floor >= 0.5){
return rawFloat.floor+1;
}else{
return rawfloat.floor;
}
}

5. On Friday, May 3 2013, 10:01 by pascal

@shjk

The contract says that in case of ties, the function is allowed to return any of the floats within 0.5 distance. Always rounding numbers with 0.5 fractional part up is not such a good solution anyway, it introduces a statistical bias. The widely used solution is to round ties to the nearest even number. Function myround() is not trying to implement this rule because we specified that for ties, any of the two nearest numbers were fine, but if it were, myround(0.5) -> 0 would be the correct answer.

Also, negative numbers are not handled because there was plenty to say while only focusing on positive numbers. It would make each example one line longer to handle negative numbers, and that would only show that -0.49999997f and a bunch of numbers around -8388609.000000 need to be handled with care.

The information about ties and the information that negative numbers can be ignored is all contained in the contract:

/*@ requires 0 <= f <= FLT_MAX ;

ensures -0.5 <= \result - f <= 0.5 ;

ensures \exists integer n; \result == n;

float myround(float f);

If you need a real function that handles all cases, do not base one on this article. Use nearbyintf(). The function in this article is for showing the subtleties of floating-point computations, not necessarily for including in your programs.

6. On Friday, May 3 2013, 11:30 by pascal

@shjk

I used exhaustive testing with nearbyintf() as reference function to test the various proposals in this post:

union { float f; unsigned int u; } u;

for (u.f = 0.0; u.f < +1.0/0.0; u.u++)

{

float m = myround(u.f);

float ref = nearbyintf(u.f);

if (m != ref && u.f - ref != m - u.f)

printf("XXX %a %a %a\n", u.f, m, ref);

}

7. On Friday, May 3 2013, 11:36 by Collin

I think this works:

float RoundedFloat (float rawFloat){
if(rawFloat - rawFloat.floor >= 0.5){
return rawFloat.floor+1;
}else{
return rawFloat.floor;
}
}

Edit: Here's an actual tested implementation:

#include
#include
using namespace std;

float RoundedFloat(float rawFloat){
if(rawFloat - floor(rawFloat) >= 0.5){
return floor(rawFloat)+1;
}else{
return floor(rawFloat);
}
}

int main(){
float funnyNum=0.4999995f;
cout< cout< return 0;
}

Which produces:

cos@DEADBEEF:~/floatRound$ g++ RoundFloat.cpp -lm
cos@DEADBEEF:~/floatRound$ ./a.out
0.499999
0
cos@DEADBEEF:~/floatRound$

The interesting thing is I couldn't use 0.49999997 as it gets rounded to 0.5 the second it gets placed in a float. 0.4999995 was the largest number I tested that didn't automatically get rounded up to 0.5 before it even got in the variable.

I think the obvious lesson we should take away from this is that if you need more precision than that you should use double precision floats.

interestingly python doesn't seem to suffer from the same issue until much later:

>>> def RoundFloat(num):
if num-math.floor(num) >= 0.5:
return math.floor(num)+1
else:
return math.floor(num)

>>> import math
>>> RoundFloat (1.49999999999999999999999)
2.0
>>> RoundFloat (1.49999999999999)
1.0
>>> RoundFloat (1.499999999999999999)
2.0
>>> RoundFloat (1.4999999999999999)
2.0
>>> RoundFloat (1.499999999999999)
1.0
>>>

8. On Friday, May 3 2013, 11:43 by pascal

@Joe

It is a good point you make. In many usages, floating-point do not represent exact values, and in that case it may not make sense to round a number in a range where the ULP is, say, 16, to the nearest integer.

But this is only one possible use of floating-point numbers.

This is even worse for periodic trigonometric functions:

in the same way, it does not make sense for most people that function sin() is accurate for an argument such as 1.0e21. But the standard makes no special provision for these functions, so math libraries have to implement costly argument reduction that few programmers will ever use. This does not hurt programmers who only call sin() with arguments between -π and π very much: it only means that the library contains a thousand decimals for π that they never use.

The specification we set out to satisfy was to return an integer (represented as a float) that was 0.5 away from the float argument. We could loosen the specification, but that would only make it more complex. The simple way out is to implement a function that works for all positive arguments.

9. On Friday, May 3 2013, 11:51 by pascal

@Collin

> The interesting thing is I couldn't use 0.49999997 as it gets rounded to 0.5 the second it gets placed in a float.

Are you sure that it is not just the pretty-printing that is showing 0.49999997f as 0.5? The last float before 0.5 is exactly 0.4999999701976776123046875, or in hexadecimal 0x1.fffffep-1. I hear that C++ does not accept hexadecimal for floating-point numbers, an unfortunate oversight, but perhaps your compiler will accept it as an extension.

Regarding double precision, that's true, but the same issues found here happen with different numbers (following the same idea) in double-precision. However, if the floating-point sum is computed at a higher precision than the argument and result types, for instance 80-bit if the argument and results are doubles, then the problem with adding 0.5 disappears.

10. On Friday, May 3 2013, 13:33 by r-lyeh

float myround(float f) {
return (float) ( (unsigned) ( double(f) + 0.5f ) );
}

11. On Friday, May 3 2013, 13:59 by pascal

@r-lyeh

Exactly. For rounding a double in the same conditions, if there is a wider long double type available, one can also do:

double myround(double f) {

return (double) ( (unsigned long long) (f + 0.5L) );

}

But for a function to round a long double to the nearest integer, we have to get back to hackish solution(s), another one of which is reserved for the next post.

12. On Friday, May 3 2013, 18:09 by bodangly

I'd avoid this sort of conversion if at all possible.

13. On Friday, May 3 2013, 22:29 by Bruce Dawson

> Is this really a problem?

Sometimes, for some people, yes. Floating-point is complicated, and adding in the complexity of round-to-nearest-integer functions that don't actually work just layers on additional complexity -- that is unnecessary.

Note that one of the failure cases from the naive method is 0.49999997, which is likely to be in the active range for most users.

Note that the recommended trick is float specific -- it will fail on doubles. That also suggests that it may fail if intermediate calculations are done at double precision, which is entirely IEEE compliant.

14. On Friday, May 3 2013, 22:41 by Bruce Dawson

I believe you missed a simplification. The bad rounding on numbers that are already integers only happens above 2^23, and since all floats above 2^23 are already integers the large-number check can just be moved down. Then the constant 0.5 can safely be used.

Simple. And, it becomes easier to make the code work with doubles -- just compare against 2^53 and cast to uint64 -- 0.5 can still be used, whereas 0.49999997 would not have worked.

Relevant series of posts:
http://randomascii.wordpress.com/ca...

15. On Friday, May 3 2013, 22:59 by pascal

@Bruce

The recommended trick fails for the conversion of doubles if you think of it as “add 0.49999997f”, but not if you think of it as “add the appropriate predecessor of 0.5”. The double type has the same issue with adding 0.5, the problematic input is the same (the predecessor of 0.5, in this case 0.49999999999999994) and the same fix of adding the predecessor of 0.5 instead so that for the problematic input, the result of the addition is the predecessor of 1.0.

You are right that the code (unsigned int) (f + 0.49999997f) assumes that the C compiler computes exactly at the precision of the types. The C standard (standardized by ISO) allows this, rather than the IEEE 754 standard for which this would be such a heresy that they did not even think of forbidding it.

(unsigned int) (float) (f + 0.49999997f) would be more portable, and since computing a single basic operation at precision higher than double before rounding to single is harmless (1), it would work with the very large majority of compilers (C doesn't even mandate IEEE 754 floating-point, so it is impossible to claim it would work with all of them).

(1) “When is double rounding innocuous?”, Samuel A. Figueroa

16. On Friday, May 3 2013, 23:15 by pascal

@Bruce

The bad rounding on numbers that are already integers only happens above 2^23, and since all floats above 2^23 are already integers the large-number check can just be moved down.

Hmm, no, I did not miss this optimization, this is what section “A function that works” is all about. Is the use of the hexadecimal floating-point constant 0x1.0p23 confusing? It's a convenient input format introduced in C99 but sorely missing from C++.

17. On Monday, May 6 2013, 15:17 by TropicalCoder

I never simply use a cast to convert floats to ints, since years, maybe decades ago, I found that can produce rare errors. I don't know if that was an issue with CPUs back in the day or Windows floating point package, but I have tracked bugs down to using a cast at least twice in my life, seeing the actual fail under a debugger. I am talking here about a bug that may show up in an app that was released and has been in the field for several months. Now I don't recall if that was a situation where a bad value exceeded the range of an integer, and I never considered that. As posted above, quite often we have some range checks in our code and don't expect huge values in the final sum.

Many years ago a wise programmer would never use a cast to convert from float to integer. Instead he would do something like this...

if(fMyFloat >= 0)
{
IntValue = (int)floor(fMyFloat + 0.5);
}else
{
IntValue = (int)ceil(fMyFloat - 0.5);
}

In my audio applications, when converting double precision floating point samples to their final integer form, after checking for clipping I use truncation to convert double precision samples to shorts...

short sample = (short)_copysign(floor(fabs(dRawSample)), dRawSample);

This seems to result in the least amount of harmonic distortion.

We may pick up certain habits early in our career as a programmer and never consider them again. Your discussion has made me examine my habits, or even superstitions that may have arisen from experiences gained decades ago.

18. On Monday, May 6 2013, 22:59 by OldETC

The varying accuracy is also affected by the number format. This includes whether the flotaing format is hex, octal or binary. It may in some circumstances be affected by whether or not the number system in storage is big endian or little endian as that may change the algorithms used to convert the decimal entered into the internal storage format.

Therefore, there is not likely to be a single means to convert from float to integer. For the combinations of 3 for the format, and 2 for the endian, and maybe 3 for the conversion from decimal entry to internal representation, you end up with 24 possibilities, some of which may overlap. Add the variation of internal storage size, as 32,48, 64, 80, 96, and 128, and possibly 256 bit sizes, and you get 168 different combinations.

Is it getting to be fun yet? Then add the various "arbitrary length" math models and it gets even more wonderful.

Anyone care to try to work them all out? And by the way, an integer always is considered to be accurate +/- 1 anyway, so why is this even a problem? You have to consider the loss of precision between integers anyway.

19. On Tuesday, May 7 2013, 08:12 by Matlabcody

Use Fixed Point for all the calculations. This will solve all the problems!

20. On Friday, May 10 2013, 00:55 by gary knott

You might enjoy seeing the anatomy of a floating-point number
and some programming tricks in intpow.c
(at www.civilized.com/files/intpow.c)

Rounding float to nearest integer, part 2

By pascal on Friday, May 3 2013, 17:07 - Permalink

floating-point

The previous post offered to round a positive float to the nearest integer, represented as a float, through a conversion back and forth to 32-bit unsigned int. There was also the promise of at least another method. Thanks to reader feedback, there will be two. What was intended to be the second post in the series is hereby relegated to third post.

Rounding through bit-twiddling

Several readers seemed disappointed that the implementation proposed in the last post was not accessing the bits of float f directly. This is possible, of course:

  assert (sizeof(unsigned int) == sizeof(float));
  unsigned int u;
  memcpy(&u, &f, sizeof(float));

In the previous post I forgot to say that we were assuming 32-bit unsigned ints. From now on we are in addition assuming that floats and unsigned ints have the same endianness, so that it is convenient to work on the bit representation of one by using the other.

Let us special-case the inputs that can mapped to zero or one immediately. We are going to need it. We could do the comparisons to 0.5 and 1.5 on u, because positive floats increase with their unsigned integer representation, but there is no reason to: it is more readable to work on f:

  if (f <= 0.5) return 0.;
  if (f <= 1.5) return 1.;

Now, to business. The actual exponent of f is:

  int exp = ((u>>23) & 255) - 127;

The explicit bits of f's significand are u & 0x7fffff, but there is not need to take them out: we will manipulate them directly inside u. Actually, at one point we will cheat and manipulate a bit of the exponent at the same time, but it will all be for the best.

A hypothetical significand for the number 1, aligned with the existing significand for f, would be1U << (23 - exp). But this is hypothetical, because 23 - exp can be negative. If this happens, it means that f is in a range where all floating-point numbers are integers.

  if (23 - exp < 0) return f;
  unsigned int one = 1U << (23 - exp);

You may have noticed that since we special-cased the inputs below 1.5, variable one may be up to1 << 23 and almost, but not quite align with the explicit bits of f's significand. Let us make a note of this for later. For now, we are interested in the bits that represent the fractional part of f, and these are always:

  unsigned int mask = one - 1;
  unsigned int frac = u & mask;

If these bits represent less than one half, the function must round down. If this is the case, we can zero all the bits that represent the fractional part of f to obtain the integer immediately below f.

  if (frac <= one / 2)
  {
    u &= ~mask;
    float r;
    memcpy(&r, &u, sizeof(float));
    return r;
  }

And we are left with the difficult exercise of finding the integer immediately above f. If this computation stays in the same binade, this means finding the multiple of one immediately above u.

“binade” is not a word, according to my dictionary. It should be one. It designates a range of floating-point numbers such as [0.25 … 0.5) or [0.5 … 1.0). I needed it in the last post, but I made do without it. I shouldn't have. Having words to designate things is the most important wossname towards clear thinking.

And if the computation does not stay in the same binade, such as 3.75 rounding up to 4.0? Well, in this case it seems we again only need to find the multiple of one immediately above u, which is in this case the power of two immediately above f, and more to the point, the number the function must return.

  u = (u + mask) & ~mask;
  float r;
  memcpy(&r, &u, sizeof(float));
  return r;

To summarize, a function for rounding a float to the nearest integer by bit-twiddling is as follows. I am not sure what is so interesting about that. I like the function in the previous post or the function in the next post better.

float myround(float f)
{
  assert (sizeof(unsigned int) == sizeof(float));
  unsigned int u;
  memcpy(&u, &f, sizeof(float));
  if (f <= 0.5) return 0.;
  if (f <= 1.5) return 1.;
  int exp = ((u>>23) & 255) - 127;
  if (23 - exp < 0) return f;
  unsigned int one = 1U << (23 - exp);
  unsigned int mask = one - 1;
  unsigned int frac = u & mask;
  if (frac <= one / 2)
    u &= ~mask;
  else
    u = (u + mask) & ~mask;
  float r;
  memcpy(&r, &u, sizeof(float));
  return r;
}

To be continued again

The only salient point in the method in this post is how we pretend not to notice when significand arithmetic overflows over the exponent, for inputs between 1.5 and 2.0, 3.5 and 4.0, and so on. The method in next post will be so much more fun than this.

Rounding float to nearest integer, part 3

By pascal on Saturday, May 4 2013, 13:50 - Permalink

floating-point

Two earlier posts showed two different approaches in order to round a float to the nearest integer. The first was to truncate to integer after having added the right quantity (either 0.5 if the programmer is willing to take care of a few dangerous inputs beforehand, or the predecessor of 0.5 so as to have fewer dangerous inputs to watch for).

The second approach was to mess with the representation of the float input, trying to recognize where the bits for the fractional part were, deciding whether they represented less or more than one half, and either zeroing them (in the first case), or sending the float up to the nearest integer (in the second case) which was simple for complicated reasons.

Variations on the first method

Several persons have suggested smart variations on the first theme, included here for the sake of completeness. The first suggestion is as follows (remembering that the input f is assumed to be positive, and ignoring overflow issues for simplicity):

float myround(float f)
{
  float candidate = (float) (unsigned int) f;
  if (f - candidate <= 0.5) return candidate;
  return candidate + 1.0f;
}

Other suggestions were to use modff(), that separates a floating-point number into its integral and fractional components, or fmodf(f, 1.0f), that computes the remainder of f in the division by 1.

These three solutions work better than adding 0.5 for a reason that is simple if one only looks at it superficially: floating-point numbers are denser around zero. Adding 0.5 takes us away from zero, whereas operationsf - candidate, modff(f, iptr) and fmodf(f, 1.0) take us closer to zero, in a range where the answer can be exactly represented, so it is. (Note: this is a super-superficial explanation.)

A third method

General idea: the power of two giveth, and the power of two taketh away

The third and generally most efficient method for rounding f to the nearest integer is to take advantage of this marvelous rounding machine that is IEEE 754 arithmetic. But for this to work, the exact right machine is needed, that is, a C compiler that implements strict IEEE 754 arithmetic and rounds each operation to the precision of the type. If you are using GCC, consider using options -msse2 -mfpmath=sse.

We already noticed that single-precision floats between 2^23 and 2^24 are all the integers in this range. If we add some quantity to f so that the result ends up in this range, wouldn't it follow that the result obtained will be rounded to the integer? And it would be rounded in round-to-nearest. Exactly what we are looking for:

  f                 f + 8388608.0f
_____________________________

0.0f                8388608.0f
0.1f                8388608.0f
0.5f                8388608.0f
0.9f                8388609.0f
1.0f                8388609.0f
1.1f                8388609.0f
1.5f                8388610.0f
1.9f                8388610.0f
2.0f                8388610.0f
2.1f                8388610.0f

The rounding part goes well, but now we are stuck with large numbers far from the input and from the expected output. Let us try to get back close to zero by subtracting 8388608.0f again:

  f                 f + 8388608.0f             f + 8388608.0f - 8388608.0f
____________________________________________________________________

0.0f                8388608.0f                              0.0f
0.1f                8388608.0f                              0.0f
0.5f                8388608.0f                              0.0f
0.9f                8388609.0f                              1.0f
1.0f                8388609.0f                              1.0f
1.1f                8388609.0f                              1.0f
1.5f                8388610.0f                              2.0f
1.9f                8388610.0f                              2.0f
2.0f                8388610.0f                              2.0f
2.1f                8388610.0f                              2.0f

It works! The subtraction is exact, for the same kind of reason that was informally sketched forf - candidate. Adding 8388608.0f causes the result to be rounded to the unit, and then subtracting it is exact, producing a float that is exactly the original rounded to the nearest integer.

For these inputs anyway. For very large inputs, the situation is different.

Very large inputs: absorption

  f                 f + 8388608.0f             f + 8388608.0f - 8388608.0f
____________________________________________________________________

1e28f                  1e28f                               1e28f
1e29f                  1e29f                               1e29f
1e30f                  1e30f                               1e30f
1e31f                  1e31f                               1e31f

When f is large enough, adding 8388608.0f to it does nothing, and then subtracting 8388608.0f from it does nothing again. This is good news, because we are dealing with very large single-precision floats that are already integers, and can be returned directly as the result of our function myround().

In fact, since we entirely avoided converting to a range-challenged integer type, and since adding 8388608.0fto FLT_MAX does not make it overflow (we have been assuming the FPU was in round-to-nearest mode all this time, remember?), we could even caress the dream of a straightforward myround() with a single execution path. Small floats rounded to the nearest integer and taken back near zero where they belong, large floats returned unchanged by the addition and the subtraction of a comparatively small quantity (with respect to them).

Dreams crushed

Unfortunately, although adding and subtracting 2^23 almost always does what we expect (it does for inputs up to 2^23 and above 2^47), there is a range of values for which it does not work. An example:

  f                 f + 8388608.0f             f + 8388608.0f - 8388608.0f
____________________________________________________________________

8388609.0f          16777216.0f                        8388608.0f

In order for function myround() to work correctly for all inputs, it still needs a conditional. The simplest is to put aside inputs larger than 2^23 that are all integers, and to use the addition-subtraction trick for the others:

float myround(float f)
{
  if (f >= 0x1.0p23)
    return f;
  return f + 0x1.0p23f - 0x1.0p23f;
}

The function above, in round-to-nearest mode, satisfies the contract we initially set out to fulfill. Interestingly, if the rounding mode is other than round-to-nearest, then it still rounds to a nearby integer, but according to the FPU rounding mode. This is a consequence of the fact that the only inexact operation is the addition. The subtraction, being exact, is not affected by the rounding mode.

For instance, if the FPU is set to round downwards and the argument f is 0.9f, then f + 8388608.0fproduces 8388608.0f, and f + 8388608.0f - 8388608.0f produces zero.

Conclusion

This post concludes the “rounding float to nearest integer” series. The method highlighted in this third post is actually the method generally used for function rintf(), because the floating-point addition has the effect of setting the “inexact” FPU flag when it is inexact, which is exactly when the function returns an output other than its input, which is when rintf() is specified as setting the “inexact” flag.

Function nearbyintf() is specified as not touching the FPU flags and would typically be implemented with the method from the second post.

Comments

1. On Saturday, May 4 2013, 18:01 by John Regehr

My preferred method for converting a float to an int is called cvttsd2siq.

2. On Saturday, May 4 2013, 18:10 by pascal

Hello, John.

cvttsd2siq is the instruction to convert a double. The single-precision version would be cvttss2siq. And they truncate. Funnily, they truncate because cast truncates in C. The 8087 instruction set converted according to FPU rounding mode, and it was a pain to implement C's cast from float to int with it. Intel avoided the mistake when designing SSE2, but then according to this Hacker News comment they introduced conversion according to rounding mode back in SSE4: https://news.ycombinator.com/item?id=5648045

Since we are on the subject, two amusing facts:

cvttss2siq is the instruction to which a smart compiler translates the function below:

unsigned int f(float x)

{

return x;

}

Because it can! The conversion to (32-bit) int would rely on the parsimonious cvttss2si instead.

On the other hand, to convert to unsigned long long, there is no single instruction that does the trick:

unsigned long long f(float x)

{

return x;

}

It compiles to:

~ $ gcc -O -S tr.c

~ $ cat tr.s

…

LCPI1_0:

.long 1593835520 ## float 9.223372e+18

…

_f:

Leh_func_begin1:

pushq %rbp

Ltmp0:

movq %rsp, %rbp

Ltmp1:

movss LCPI1_0(%rip), %xmm1

movaps %xmm0, %xmm2

subss %xmm1, %xmm2

cvttss2siq %xmm2, %rax

movabsq $-9223372036854775808, %rcx

xorq %rax, %rcx

ucomiss %xmm1, %xmm0

cvttss2siq %xmm0, %rax

cmovaeq %rcx, %rax

popq %rbp

ret

I actually re-discovered and implemented this algorithm by hand at a time I needed it (in OCaml, towards multi-precision integers that only needed to be as large as unsigned long longs) before recognizing it later in my compiler's output.

One last funny fact is how C compilers sometimes try to recognize patterns to map them to efficient instructions. The blog The Shape of Code contains an example with the compilation of a multiplication by 6 or somesuch, that a programmer too smart for eir own good wrote as 4 * x + 2 * x, and that the compiler reverse-translates to a single multiplication.

Well, the compiler will not be able to do that with (int) (f + 0.5f), since it does not always produce the same result as an instruction to round to the nearest.

3. On Saturday, May 4 2013, 18:47 by John Regehr

I can't read anything about FP accuracy without going back to this bit of black humor:

http://gcc.gnu.org/bugzilla/show_bu...

Frama-C news and ideas

Harder than it looks: rounding float to nearest integer, part 1

Problem

Via truncation

Obvious overflow issue

Obvious rounding issue

An issue when the ULP of the argument is exactly one

An issue when the argument is exactly the predecessor of 0.5f

A function that works

A better function that works

To be continued

Comments

Rounding float to nearest integer, part 2

Rounding through bit-twiddling

To be continued again

Rounding float to nearest integer, part 3

Variations on the first method

A third method

General idea: the power of two giveth, and the power of two taketh away

Very large inputs: absorption

Dreams crushed

Conclusion

Comments

你可能感兴趣的:(Frama-C news and ideas)