Tel. +31 345 545535

Solutions
Coming soon...

Partners
Who we work with

News
What's going on

Media
Something to read

About
Who we are

Contact
Where we are

Nederlands | French

News

06 - 2022
05 - 2022
04 - 2022
03 - 2022
12 - 2021
11 - 2021
06 - 2021
05 - 2021
04 - 2021
03 - 2021
02 - 2021
01 - 2021
12 - 2020
11 - 2020
10 - 2020
09 - 2020
08 - 2020
Closing the Loop On IoT Device Error Reporting
Every byte counts - Floating-point in less than 1 KB
Percepio joins Renesas RA READY program
07 - 2020
06 - 2020
05 - 2020
04 - 2020
03 - 2020
02 - 2020
12 - 2019
11 - 2019
10 - 2019
09 - 2019
07 - 2019
06 - 2019
05 - 2019
03 - 2019
02 - 2019
11 - 2018
10 - 2018
09 - 2018
07 - 2018
06 - 2018
05 - 2018
04 - 2018
03 - 2018
02 - 2018
01 - 2018
12 - 2017
11 - 2017
10 - 2017
09 - 2017
08 - 2017
07 - 2017
06 - 2017
05 - 2017
04 - 2017
03 - 2017
02 - 2017
01 - 2017
12 - 2016
11 - 2016
10 - 2016
09 - 2016
08 - 2016
07 - 2016
06 - 2016
05 - 2016
04 - 2016
03 - 2016
02 - 2016
01 - 2016
12 - 2015
11 - 2015
10 - 2015
09 - 2015
08 - 2015
07 - 2015
06 - 2015
05 - 2015
03 - 2015
02 - 2015
01 - 2015
12 - 2014
10 - 2014
09 - 2014
08 - 2014
07 - 2014
06 - 2014
05 - 2014
03 - 2014
02 - 2014
01 - 2014
12 - 2013
11 - 2013
10 - 2013
09 - 2013
08 - 2013
07 - 2013
06 - 2013
05 - 2013
04 - 2013
03 - 2013
02 - 2013
01 - 2013
12 - 2012
11 - 2012
10 - 2012
09 - 2012
08 - 2012
07 - 2012
06 - 2012
05 - 2012
04 - 2012
03 - 2012
02 - 2012
01 - 2012
12 - 2011
11 - 2011
10 - 2011
09 - 2011
08 - 2011
07 - 2011
06 - 2011
05 - 2011
04 - 2011
03 - 2011
02 - 2011
01 - 2011
12 - 2010
11 - 2010
10 - 2010
09 - 2010
08 - 2010
07 - 2010
06 - 2010
05 - 2010
04 - 2010
03 - 2010
02 - 2010
01 - 2010
12 - 2009
11 - 2009
10 - 2009
09 - 2009
08 - 2009
07 - 2009
06 - 2009
05 - 2009
04 - 2009
03 - 2009
02 - 2009
12 - 2008
11 - 2008
08 - 2008
07 - 2008
05 - 2008
04 - 2008
11 - 2007

Every byte counts - Floating-point in less than 1 KB

How expensive in terms of code size are floating-point operations if the CPU does not have an floating-point unit (FPU)?
In this article, I will investigate, based on Embedded Studio for ARM and a generic Cortex-M3 device, how big (or small) an entire application using basic float operations, add, sub, mul, and div, can be.

Since we started licensing our Floating-Point Library, outside of Embedded Studio and outside of the SEGGER RunTime Library, we are seeing a lot of interest. We have published performance values, but for some people size matters more than speed. Here is a quick look at how small our floating point code is, based on a small project in Embedded Studio, which is also provided and easily allows reproduction. The same approach can be used to benchmark other tool chains, so this post can be used as a tutorial to find out how small the components of a floating point library are.
(For a closer look at performance, see blog article Floating-point face-off, part 2: Comparing performance)

Start project

As the starting point, I used the small project I generated in my previous blog article, titled Every byte counts – Smallest Hello World. It generates a 117 byte program that runs on any Cortex-M3, M4, M7 CPU or in a simulator, printing “Hello World!”. It looks as follows:

#include "stdio.h"
 
/*********************************************************************
*
*       main()
*
*  Function description
*    Application entry point.
*/
int main(void) {
  printf("Hello world!");
}

To be on the safe side, I decided to rebuild it, by pressing ALT-F7.

The result is as expected: 117 bytes.

Adding floating point code

How do we now add floating point code?
Actually, this is quite easy. We simply add a floating point computation and make sure the result is used by the program, so the compiler does not optimize it away.
In order to do that, I use the computed result in the string to output. This also gives me a chance to verify the result.
To test multiplication, I use the following code:

/*********************************************************************
*
*       main()
*
*  Function description
*    Application entry point.
*/
int main(void) {
  float f;
    
  f = 1.0;
  while (1) {
    printf("Result = %fn", f);
    f *= 3.0;
  }
}

Pressing F5 starts the debugger. After setting a breakpoint and hitting this a few times, I see the below:

Looking good, just as expected.

What is the size of the floating point library code?

Now let’s look at the program size, which Embedded Studio reports to be 401 bytes. When we subtract the original 117 bytes for “Hello World!” we get
a difference of 284 bytes. To be fair, we need to also take into account that main is now bigger, up 28 bytes from 29 to 57, so the size of the multiplication code is really just 256 bytes. Not bad for a floating point multiplication completely done in software!

But to be sure, let’s look at the disassembly of main.

We can see that __aeabia_fmul() is called for the multiplication, but before the printf(), __aeabi_f2d() is called. This is because printf() expects its arguments to be double precision (64-bit) rather than single precision (32-bit) floats. So as the name implies, __aeabi_f2d() converts from float to double.
We therefore also need to subtract its code size, as it is not multiplication related, just related to outputting the number.
Looking at the ELF file or the map file, we find that the size of __aeabi_f2d() is
52 bytes. Subtracting this from our original 256 bytes brings us to 204 bytes.
For verification purposes, let’s look at the map file.

__aeabi_f2d  000000d5 0x34 4 Code Wk floatasmops_arm.o (libc_v7em_t_le_eabi_small.a)
__aeabi_fmul 00000009 0xcc 4 Code Wk floatasmops_arm.o (libc_v7em_t_le_eabi_small.a)

It confirms what we have computed. For floating point multiplication, only a single routine is required. This routine is blazingly fast, but what we are looking at here is the code size: An amazing 204 bytes!

Note that this is the size of the multiplication code in the library. Every call now only adds a few bytes. A quick test shows us that a call is about 8 bytes.
By simply multiplying a second time in our small program we see that code size for the entire application goes up to 409 bytes:

int main(void) {
  float f;
    
  f = 1.0;
  while (1) {
    printf("Result = %fn", f);
    f *= 3.0;
    f *= 3.0;
  }
}

To be on the safe side, and to better understand what code the compiler has generated, let’s look at the disassembly by opening the ELF file.
We find the following:

We can see that in this case, the call to the multiplication routine actually requires 8 bytes.

Testing Add and Subtract

In order to do the same thing for add, I simply use addition instead of multiplication. The output looks as expected:

Program size is 481 bytes, so 80 bytes more than when multiplying.
For subtracting: Program size is 489 bytes,
code size of subtraction code is (489 -197) bytes = 284 +8 bytes.

Why is code for adding and subtracting bigger than code for multiplying floats?
The answer is quite simple:
In floating point, the mantissa is always scaled, so multiplying two values basically means multiplying the mantissas and adding the exponents, whereas for adding and subtracting, an extra step is required, namely shifting the mantissas to the same position before adding.
That all sounds expensive and complicated, but it can actually be done very efficiently.

Adding and subtracting in the same program

Let’s see what happens when we add and subtract in the same program. We use the below and look at the result:

int main(void) {
  float f;
 
  f = 1.0;
  while (1) {
    printf("Result = %fn", f);
    f += 3;
    f -= f;   // Make sure the compiler does not use add instead of subtract
  }
}

The compiler now generates a cal to __aeabi_fsub(). The map file tells us the divide code needs only 8 bytes:

__aeabi_f2d  0000012d  0x34 4 Code Wk floatasmops_arm.o (libc_v7em_t_le_eabi_small.a)
__aeabi_fadd 00000009 0x11c 4 Code Wk floatasmops_arm.o (libc_v7em_t_le_eabi_small.a)
__aeabi_fsub 00000125   0x8 4 Code Wk floatasmops_arm.o (libc_v7em_t_le_eabi_small.a)

How does this work?
Stepping into the subtraction routine reveals the trick:

Subtract simply inverts the sign of the second operand and uses the addition code. So in the presence of floating point addition, floating point subtraction “costs” only 8 bytes.
(Note: In speed optimized variants, subtraction actually has its own block of code to avoid the 2 instruction penalty that occurs when jumping to the addition code.)

Division

Division in floating point is actually easy to implement using shift and subtract.
Unfortunately, the performance of such a simple implementation is poor, so we use a fast algorithm using Cortex-M3 (and up) UDIV instruction.
This brings up the code size for divide, but not too much:
We end up with a program size of 421, so a library code size of (421 – 197) bytes = 224 bytes.
I looked at the output window after a few loops, showing correct values:

All in!

Let’s write a program that uses all 4 basic operations in a single program.
To avoid giving the compiler a chance to optimize some of the computation, we use a second variable and declare it as volatile.

volatile float b = 2;
 
/*********************************************************************
*
* main()
*
* Function description
* Application entry point.
*/
int main(void) {
  float a;
 
  a = 1.0;
  while (1) {
    printf("Result = %fn", a);
    a += b;
    a *= b;
    a -= b;
    a /= b;
  }
}

Running it, the output looks good:

The result is quite impressive.
983 bytes for the entire program! This includes everything: The startup code, printf code (host side evaluation), the part of the floating point library that does addition, subtraction, multiplication, division, as well as a single-precision (32-bit) to double precision (64-bit) conversion routine and our small application program including the short string.
Incredible!
Try this with any other tool chain!

Actually, it was so incredible that I had to verify it.
Turns out this is correct. Our application does actually call all 4 arithmetic functions used in the program.

The map file confirms all 5 functions are linked in.

.text.__aeabi_fadd Code 00000008 0x11c 4 floatasmops_arm.o (libc_v7em_t_le_eabi_small.a)
.text.__aeabi_fsub Code 00000124 0x8 4 floatasmops_arm.o (libc_v7em_t_le_eabi_small.a)
.text.__aeabi_fmul Code 0000012c 0xcc 4 floatasmops_arm.o (libc_v7em_t_le_eabi_small.a)
.text.__aeabi_fdiv Code 000001f8 0xe0 4 floatasmops_arm.o (libc_v7em_t_le_eabi_small.a)
.text.__aeabi_f2d Code 000002d8 0x34 4 floatasmops_arm.o (libc_v7em_t_le_eabi_small.a)

Complete floating point operations and output to terminal in less than 1kByte of ROM!

Quite impressive, I find…
But I think I already said that. 🙂

Conclusion

Floating point operations can be performed very efficiently in software.
The code in Embedded Studio and the SEGGER library is highly optimized.
A typical Cortex-M CPU can do multiple million floating point operations per second, so using floating point on a CPU without FPU is perfectly reasonable, both from a performance as well as a code size perspective. Dedicated FPUs for me really only make sense for applications that use floating point operations intensively.
Our developers, who have developed this great code over a period of more than 20 years, have done a great job!

Note that the SEGGER floating point library we are looking at has been hand optimized for ARM processors. There are different variants for the different CPUs, such as ARM (including THUMB-2, legacy ARM-V4 and modern 32- and 64-bit CPUs), as well as RISC-V, including 64-bit and RISC-V E cores.

In another post, I might look at high level functions, such as sin(), cos(), ln().
For now, I end this and encourage you try this yourself, with SEGGER Embedded Studio and / or any other tool chain.
I’d be very surprised if you can achieve the same level as Embedded Studio (less than 1kByte) for the same application program when using another tool chain.
And keep in mind that we can do even better. This code is using a speed optimized variant of float division. With size optimized code, the program could be even smaller.
The entire project used is here.

24-08-2020

New products

Flasher Pro (ARM7/9/11, Cortex-A,-R,-M, PPC, RX)

...

Flasher ARM

...

Licensing

...

Secure Embedded Database for IoT Devices

...