Indes.com

One of the issues faced by RISC-V developers is that the code density of the RISC-V instruction set for deeply embedded processors does not match that of Cortex-M with existing tools. That is changing with the product innovations SEGGER have developed, such as the recently-announced SEGGER Linker, capable of reducing code size by up to 15%, and the SEGGER Runtime Library, performance and size optimized for RISC-V.

I’ve written about the SEGGER Linker before—for Arm. Now it has been targeted to the 32-bit RISC-V instruction set, RV32I and RV32E. It brings with it all the features of the SEGGER Linker for Arm, and adds new capabilities that make best use of the RISC-V instruction set.

How do we do it?

Here is a table that shows the effectiveness of the SEGGER Linker for RISC-V, for a small application reproduced below.

The GNU linker and library is SiFive’s GNU Embedded for RISC-V, and the SEGGER linker and library are SEGGER’s equivalent products built for RISC-V.

The SEGGER Linker is doing a lot better than the GNU linker. And the SEGGER Runtime Library is performing great on RISC-V! The combination of both (as present in Embedded Studio) brings an incredible 76.4% reduction of flash size, from 47784 to 11270 bytes, for the same application. But even simply re-linking with the SEGGER Linker, an impressive 13% can be saved.

Relocation: Where code and data go to execute

The new linker is very much focused on reducing code size, to make RISC-V applications smaller. The compiler lays down code and, to support separately-compiled object modules, relocations to enable them to be linked together. As the compiler doesn’t know where functions and data objects will be finally placed (that is controlled by the linker script), it makes worst-case assumptions for all function calls and global data accesses (to both read-write and read-only data), laying down relatively long code sequences.

The RISC-V architecture offers smaller, more compact instructions that can be used, but the fact that an address is unknown at compile time makes it imposible for the compiler to do this, deferring it to the linker to fix up instead.

Relaxation: Making things smaller and faster

It’s the linker’s job to resolve these inter-module references laid down by the compiler. A simple linker would patch up the reference and be done with it. A more capable linker would opportunistically “relax” that instruction sequence to a smaller one, according to the resolved target address when known. A fully capable linker would arrange code and data such that nothing happens by good fortune, but by careful consideration of section layout.

That’s exactly what the SEGGER Linker does. It uses well-tuned heuristics to lay out code and data, in concert with the linker script, to maximize the number of “relaxations” that apply.

So, for instance, function groups that are closely-related are placed close together, even if they originate from different object files. This placement enables the two 32-bit instructions laid down by the compiler for a function call to be contracted to a single 32-bit instruction or, better still, to a 16-bit compact instruction.

Relaxation is good for data too!

The same happens for data. Global data is typically accessed by forming a 32-bit base address in a register and using an offset-load to read the item. The GNU linker can relax this by employing a global pointer so that one instruction can be eliminated, with “short” data items clustered together. But there is no intelligence to how the global pointer is placed, forcing the user to group data and assign a global pointer manually. We now live in the 21st century and have powerful computers, so why not let the computer find the best layout of data and position for the global pointer? Well, that’s what the SEGGER Linker will do for you: no messing about yourself, although of course you can go old-style and group sections and set the global pointer manually, the choice is yours.

Two is generally better than one

The RISC-V register file has a thread-pointer register, tp, that is designed for thread-local data. If your application does not use thead-local data, then tp is unused. (To avoid confusion, you can have a multi-tasking application that does not use per-thread copies of data, and therefore does not require a thread pointer.) The SEGGER Linker enables that register to be unlocked for use and become a second global base: all the transformations applied for the global pointer are applied for the thread pointer too. Of course, you can specify the model for the thread pointer to use: reserved, use as a global base, use for thread-local data, or automatically assign its model based on input files.

And if you link your application at location zero, the SEGGER Linker will automatically transform the code to use the “zero” register x0 with all those great optimizations. So now the SEGGER Linker can juggle code and data and lay it out for best use by three register-based pointers.

What’s the catch?

The optimizations and transformations described here have a dual benefit: they produce code that is not just smaller, but also faster!

One optimization that slightly reduces performance (one branch penalty) is springboarding. The SEGGER Linker can transform groups of calls and jumps through common springboards, delivering great code reductions with a single-branch runtime penalty. This is one more tool at your disposal, to accept a few additional cycles in order to fit your application into flash.

But the user retains total control: all transformations can be enabled or disabled individually, although they apply to the entire program. In the near future, you’ll be able to control optimization on a per-section basis, perfect for isolating time-critical or space-critical functions!

Conclusion

The SEGGER Linker is delivering substantial benefits for existing applications. Embracing SEGGER’s design philosophy, the SEGGER Linker simply works: it gets the best from your application, so you can drive on automatic. If you’re one of the more hands-on manual types, then the SEGGER Linker has that covered too, enabling precise control of where code and data go and where your base registers point, and how you want your code transformed.

Benchmark code

Here is the simple benchmark application for the table at the top of the article.

/*********************************************************************
*                   (c) SEGGER Microcontroller GmbH                  *
*                        The Embedded Experts                        *
*                           www.segger.com                           *
**********************************************************************
 
-------------------------- END-OF-HEADER -----------------------------
 
File        : bench-fp-math-size.c
Purpose     : Benchmark overall size of FP library.
 
*/
 
/*********************************************************************
*
*       #include section
*
**********************************************************************
*/
 
#include <math.h>
 
/*********************************************************************
*
*       Static data
*
**********************************************************************
*/
 
static volatile float  vf;
static volatile double vd;
 
/*********************************************************************
*
*       Public code
*
**********************************************************************
*/
 
int main(void) {
  float f;
  double d;
  //
  f = vf;
  f = sinf(f);
  f = cosf(f);
  f = tanf(f);
  f = asinf(f);
  f = acosf(f);
  f = atanf(f);
  f = sinhf(f);
  f = coshf(f);
  f = tanhf(f);
  f = asinhf(f);
  f = acoshf(f);
  f = atanhf(f);
  vf = f;
  //
  d = vd;
  d = sin(d);
  d = cos(d);
  d = tan(d);
  d = asin(d);
  d = acos(d);
  d = atan(d);
  d = sinh(d);
  d = cosh(d);
  d = tanh(d);
  d = asinh(d);
  d = acosh(d);
  d = atanh(d);
  vd = d;
  //
  return 0;
}
 
/*************************** End of file ************************

Code size: Closing the gap between RISC-V and Arm for embedded applications