FPU on F407 – how to

Pito
Wed Apr 19, 2017 1:55 pm
This post is about FPU on STM32F407 and friends.

UPDATE
The quick guide on setting the FPU ON on STM32F407 boards under STM32duino

Will be precised further on

FPU on STM32F4xx is a Single Precision Floating Point Unit. It will not work for double precision math acceleration.
The size is 32bit with this FPU. The expected precision is ~7 digits with basic math primitives and typical algorithms. With transcendental math functions do expect 5-6 digits.

The Single Precision theoretical scope is from +/- 1.175494351E-38 .. 3.402823466E+38.
Not all real numbers have their Single Precision representation, though.

The FPU provides only a few basic primitive functions (like add, sub, mul, div, sqrt) and several other supporting functions. It does not provide you with results like sinf(), still the sw math libraries must be used for transcendental functions. With stm32duino the gcc build-in libraries will be used (when proper compile flags are used). You may also use CMSIS DSP library instead.
[attachment=1]STM32F407 FPU INSTR.JPG[/attachment] The Single Precision is good for:
1. audio, video, sensors processing, control and regulation – everywhere you need speed, large dynamics, but lower resolution is acceptable
2. for DSP with 12-18bit ADC values, FFT, IFFT, FIR IIR filters, SI conversions
3. math with and printing out real values with larger order ranges (pico, nano,.., mega, giga) but with 6-7 digits
4. math calcs where the algorithm can accept the lower resolution.

The Single Precision is NOT good for:
1. navigation, gps, cartography, land surveying
2. measuring and calcs with frequencies (DDS, frequency measurements, etc.)
3. numerical calculation in science and technology – in mathematics, physics, astronomy, etc.
4. pocket calculators
5. CAD systems and simulators
6. financial calculation.

HINT: I recommend this reading https://ece.uwaterloo.ca/~dwharder/Nume … /paper.pdf
as the using of floating point math could be sometimes tricky..

In your platform.txt add
-mfloat-abi=hard -mfpu=fpv4-sp-d16 -fsingle-precision-constant


ag123
Wed Apr 19, 2017 3:40 pm
i’m not too sure if there is a way to say that if -mfloat-abi=hard -mfpu=fpv4-sp-d16 is defined, we’d run the initialization codes for the fpu.
for now i’ve not found any marcos or defines as related to -mfloat-abi=hard etc

it would seem that hardware float would need to be handled as somewhat a ‘library’ approach. i’d think the missing CMSIS libmath.a libraries probably is a reason causing my sketch to ‘hang’. the surprising thing would be that with gcc 6 i didn’t seem to hit compile or link errors that functions are missing


edogaldo
Wed Apr 19, 2017 4:27 pm
A nice tip I just found.
Running:
arm-none-eabi-gcc -mcpu=cortex-m4 -mthumb -mfloat-abi=hard -mfpu=fpv4-sp-d16 -E -dM - < /dev/null

Pito
Wed Apr 19, 2017 4:41 pm
VFP are the FPU’s fp registers. When you compile with -softfp the VFPs are not used (the standard r registers are used instead). I get errors with -hard because it wants to use the VFPs regs but something essential is missing..

edogaldo
Wed Apr 19, 2017 4:53 pm
Pito wrote:VFP are the FPU’s fp registers. When you compile with -softfp the VFP are not used. I get errors with -hard because it wants to use the VFPs regs but something essential is missing..

Pito
Wed Apr 19, 2017 7:01 pm
With -mfloat-abi=hard -mfpu=fpv4-sp-d16 it generates instructions like
vpadd.u<illegal width 64>

ag123
Wed Apr 19, 2017 7:04 pm
fpu enable codes
http://infocenter.arm.com/help/topic/co … BJHIG.html
//enable the fpu (cortex-m4 - stm32f4* and above)
void enablefpu()
{
__asm volatile
(
" ldr.w r0, =0xE000ED88 \n" /* The FPU enable bits are in the CPACR. */
" ldr r1, [r0] \n" /* read CAPCR */
" orr r1, r1, #( 0xf << 20 )\n" /* Set bits 20-23 to enable CP10 and CP11 coprocessors */
" str r1, [r0] \n" /* Write back the modified value to the CPACR */
" dsb \n" /* wait for store to complete */
" isb" /* reset pipeline now the FPU is enabled */

);
}


Pito
Wed Apr 19, 2017 7:33 pm
Now with (Soebler, below flags added into C and C++)
-mfloat-abi=softfp -mfpu=fpv4-sp-d16

ag123
Wed Apr 19, 2017 8:16 pm
actually, this isn’t incorrect, or rather gcc is quite well known for ‘removing codes’ with those (higher) optimizations
it is discussed somewhere in boinc (mass distributed community computing projects) forums

if you have a statement c = a + b, and c is not referenced anywhere. gcc could *remove* it, and that happens with whetstone codes (perhaps with the -O3 optimization), if gcc get even ‘smarter’ and remove all the whetstone codes because it is ‘doing nothing’ performance tends to infinity, faster than the fastest supercomputer in the world petaflops in 0 nanooseconds :lol:

p.s. what the boinc people did is they assign the whetstone results to global variables, so that gcc won’t remove it, but we’d leave that ‘enhancement’ for another time


edogaldo
Wed Apr 19, 2017 8:20 pm
Probably the best way to test fpu improvements is skipping optimizations..

Pito
Wed Apr 19, 2017 8:22 pm
Ok, when the -O3 optimized out all calcs, and we did 65MIPS with -softfp, and with the -O3 and -hard we get 340MIPs that is 5x – that would be the number I would expect from F407 FPU. But people say the -softfp and -hard are similar.. That is confusing..
I will try with -O0 as the last exercise today..
EDIT: The 65MIPS most probably a fizzle..

Pito
Wed Apr 19, 2017 8:38 pm
-O0, -g, 4.8.3-2014q1, -mfloat-abi=softfp, 168MHz
Loops: 1000Iterations: 10Duration: 23948 millisec.
C Converted Single Precision Whetstones: 41.76 MIPS

ag123
Wed Apr 19, 2017 9:18 pm
uploading whetstone.zip here as well

usage:
connect to usb-serial console, press ‘g’ to start

– v0.5 fixes for single precision, call single precision math lib functions instead of double precision functions
added codes to detect if software floating point or hardware floating point is selected
– v0.4 added fpu enabling codes
– v0.3 fixed a dangling double makes single precision work longer than expected
– v0.2 fixed some syntax errors
– v0.1 initial release


ag123
Wed Apr 19, 2017 9:25 pm
hi pito,

stm32f407vet
compiled optimization -O0 no optimization, no debug
hardware floating point, single precision
Beginning Whetstone benchmark at 168 MHz ...

Loops:1000, Iterations:1, Duration:2486.45 millisec
C Converted Single Precision Whetstones:40.22 mflops


Pito
Wed Apr 19, 2017 9:29 pm
Thanks! At least we know we use a similar chip :)

Here is a benchmark I used to use in past. It computes integrals of various crazy funcions. Comes from Rosetta code, but I changed the functions (it compares numerical integration methods and analytical integration results of the respective functions).
We may use it as a benchmark for single and double experiments. I can hardly imagine it optimizes something off :)
Just add time measurement and that is. It is much better benchmark as the Whetstone dinosaur :)
// A Benchmark - Integrals - Single Precision (with FPU on/off)
// Based on http://rosettacode.org/wiki/Numerical_integration
// Pito 4/2014
// Mod for STM32duino 4/2017
// Functions F: f1a = INTEGRAL(f1), etc.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <stdarg.h>

#define WITH_FPU 1

void prntf(char *fmt, ... ){
char buf[128]; // resulting string limited to 128 chars
va_list args;
va_start (args, fmt );
vsnprintf(buf, 128, fmt, args);
va_end (args);
Serial.print(buf);
}

// Integration methods
//
float int_leftrect(float from, float to, int n, float (*func)(float))
{
float h = (to-from)/n;
float sum = 0.0, x;
for(x=from; x <= (to-h); x += h)
sum += func(x);
return h*sum;
}

float int_rightrect(float from, float to, int n, float (*func)(float))
{
float h = (to-from)/n;
float sum = 0.0, x;
for(x=from; x <= (to-h); x += h)
sum += func(x+h);
return h*sum;
}

float int_midrect(float from, float to, int n, float (*func)(float))
{
float h = (to-from)/n;
float sum = 0.0, x;
for(x=from; x <= (to-h); x += h)
sum += func(x+h/2.0);
return h*sum;
}

float int_trapezium(float from, float to, int n, float (*func)(float))
{
float h = (to - from) / n;
float sum = func(from) + func(to);
int i;
for(i = 1;i < n;i++)
sum += 2.0*func(from + i * h);
return h * sum / 2.0;
}

float int_simpson(float from, float to, int n, float (*func)(float))
{
float h = (to - from) / n;
float sum1 = 0.0;
float sum2 = 0.0;
int i;

float x;

for(i = 0;i < n;i++)
sum1 += func(from + h * i + h / 2.0);

for(i = 1;i < n;i++)
sum2 += func(from + h * i);

return h / 6.0 * (func(from) + func(to) + 4.0 * sum1 + 2.0 * sum2);
}

/* test functions */
// F1
float f1(float x)
{
return ( x*x /( (-1.0 + powf(x,4.0)) * (-1.0 + powf(x,4.0)) ) );
}

float f1a(float x)
{
return ( ( (-4.0 * x*x*x)/(-1.0 + x*x*x*x) - 2.0*atanf(x) - logf(-1.0 + x) + logf(1.0 + x))/16.0 );
}

// F2
float f2(float x)
{
return ( sinf(x + 1.0)*cosf(x - 1.0)/tanf(x - 1.0) );
}

float f2a(float x)
{
return ( ( 2.0*x*cosf(2.0) + 4.0*logf(sinf(1.0 - x))*sinf(2.0) + sinf(2.0*x) )/4.0 ) ;
}

// F3
float f3(float x)
{
return ( (5.0 + 3.0*x)/(1.0 - x - x*x + x*x*x) );
}

float f3a(float x)
{
return ( -4.0/(-1.0 + x) - logf(-1.0 + x)/2.0 + logf(1.0 + x)/2.0 ) ;
}

// F4
float f4(float x)
{
return ( 1.0 /(2.0 + x*x*x*x) );
}

float f4a(float x)
{
float k = 2.0;
float m = 0.25;
return ( (-k*atanf(1.0 - powf(k,m)*x) + k*atanf(1.0 + \
powf(k,m)*x) - logf(k - k*powf(k,m)*x + sqrtf(k)*x*x) + \
logf(k + k*powf(k,m)*x + sqrtf(k)*x*x))/(8.0*powf(k,m)) ) ;
}

typedef float (*pfunc)(float, float, int , float (*)(float));
typedef float (*rfunc)(float);

#define INTG(F,A,B) (F((B))-F((A)))

void setup ()
{
int i, j;
float ic, tmp;

if(WITH_FPU) {
__asm volatile
(
" ldr.w r0, =0xE000ED88 \n" /* The FPU enable bits are in the CPACR. */
" ldr r1, [r0] \n" /* read CAPCR */
" \n"
" orr r1, r1, #( 0xf << 20 ) \n" /* Enable CP10 and CP11 coprocessors, then save back. */
" str r1, [r0] \n"
" dsb \n"
" isb" /* reset the pipeline, now the FPU is enabled */
);
}

Serial.begin(115200);
// while(!Serial.isConnected());

pfunc f[5] = {
int_leftrect, int_rightrect,
int_midrect, int_trapezium,
int_simpson
};

const char *names[5] = {
"leftrect", "rightrect", "midrect",
"trapezium", "simpson"
};

rfunc rf[] = { f1, f2, f3, f4 };
rfunc If[] = { f1a, f2a, f3a, f4a };

float ivals[] = {
2.0, 10.0,
-1.0, -0.7,
10.0, 50.0,
0.0, 10.0
};

int approx[] = { 10000, 10000, 10000, 10000 };

prntf("\nBenchmark starts..\n");

uint32 elapsed = millis();

for(j=0; j < (sizeof(rf) / sizeof(rfunc)); j++)
{
for(i=0; i < 5 ; i++)
{
ic = (*f[i])(ivals[2*j], ivals[2*j+1], approx[j], rf[j]);
tmp = INTG((*If[j]), ivals[2*j], ivals[2*j+1]);
prntf("%10s [%1.1e %1.1e] num: %1.6e, ana: %1.6e\n",
names[i], ivals[2*j], ivals[2*j+1], ic, tmp);
}
prntf("\n");
}

elapsed = millis() - elapsed;
prntf("Benchmark ends, elapsed %d msecs\n", elapsed);
}

void loop() {}


ag123
Wed Apr 19, 2017 9:43 pm
@pito,

for whetstone, to prevent gcc from optimizing away codes, i’ve seen some examples such as using global variables or arrays to store the results of the whetstone computation

an example of that is in the boinc project, where whetstone benchmark is used to compute ‘credits’ for the participants as an estimate of the ‘work’ done
https://github.com/BOINC/boinc/blob/mas … tstone.cpp
the boinc codes patched the results of the calculations into global variables and arrays during the calculations so that gcc -O options would not remove them

in the mean time i’d guess we’d make do and let gcc ‘cheat’ a little & get higher m/gflops :lol:
ok signing of for the day/night


edogaldo
Wed Apr 19, 2017 10:22 pm
ag123 wrote:hi pito,

stm32f407vet
compiled optimization -O0 no optimization, no debug
hardware floating point, single precision
Beginning Whetstone benchmark at 168 MHz ...

Loops:1000, Iterations:1, Duration:2486.45 millisec
C Converted Single Precision Whetstones:40.22 mflops


ag123
Thu Apr 20, 2017 5:21 am
@edogaldo thanks,

stm32f407vet
compiled optimization -O0 no optimization, no debug, -fsingle-precision-constant
hardware floating point, single precision
Loops:1000, Iterations:1, Duration:2030.42 millisec
C Converted Single Precision Whetstones:49.25 mflops


ag123
Thu Apr 20, 2017 6:51 am
@pito lets post your benchmark results and perhaps the code as well in the dhrystone and whetstone thread

i’m thinking that the -O2 and -O3 optimizations likely removed some codes, but nevertheless it is just fun to watch those ‘impossible’ mflops
Have you heard about the new Cray super computer? It’s so fast, it executes an infinite loop in 6 seconds.
:lol:


Pito
Thu Apr 20, 2017 7:28 am
-O0, -g, 4.8.3-2014q1, -mfloat-abi=hard -mfpu=fpv4-sp-d16 -fsingle-precision-constant, 168MHz
Loops: 1000Iterations: 10Duration: 19638 millisec.
C Converted Single Precision Whetstones: 50.92 MIPS

Pito
Thu Apr 20, 2017 8:11 am
@ag123: I’ve updated above the Integrals Benchmark with Single Precision FPU on/off variant.
I can build it, but it does not run here. No idea why.. Maybe math libraries missing?? But it builds w/o error.
Can you try to run it in your environment, plz?

ag123
Thu Apr 20, 2017 9:01 am
@pito
stm32f4vet WITH_FPU 0, optimization -Os (optimise size), debug off,v6.3.1-2017-q1-update
software floating point, single precision
Benchmark starts..
leftrect [2.0e+00 1.0e+01] num: 6.720000e-03, ana: 6.712139e-03
rightrect [2.0e+00 1.0e+01] num: 6.705779e-03, ana: 6.712139e-03
midrect [2.0e+00 1.0e+01] num: 6.712879e-03, ana: 6.712139e-03
trapezium [2.0e+00 1.0e+01] num: 6.712058e-03, ana: 6.712139e-03
simpson [2.0e+00 1.0e+01] num: 6.712050e-03, ana: 6.712139e-03

leftrect [-1.0e+00 -7.0e-01] num: -2.613795e-03, ana: -2.612248e-03
rightrect [-1.0e+00 -7.0e-01] num: -2.613944e-03, ana: -2.612248e-03
midrect [-1.0e+00 -7.0e-01] num: -2.613870e-03, ana: -2.612248e-03
trapezium [-1.0e+00 -7.0e-01] num: -2.612268e-03, ana: -2.612248e-03
simpson [-1.0e+00 -7.0e-01] num: -2.612271e-03, ana: -2.612248e-03

leftrect [1.0e+01 5.0e+01] num: 2.825554e-01, ana: 2.824790e-01
rightrect [1.0e+01 5.0e+01] num: 2.824033e-01, ana: 2.824790e-01
midrect [1.0e+01 5.0e+01] num: 2.824792e-01, ana: 2.824790e-01
trapezium [1.0e+01 5.0e+01] num: 2.824792e-01, ana: 2.824790e-01
simpson [1.0e+01 5.0e+01] num: 2.824791e-01, ana: 2.824790e-01

leftrect [0.0e+00 1.0e+01] num: 6.603576e-01, ana: 6.601052e-01
rightrect [0.0e+00 1.0e+01] num: 6.598577e-01, ana: 6.601052e-01
midrect [0.0e+00 1.0e+01] num: 6.601063e-01, ana: 6.601052e-01
trapezium [0.0e+00 1.0e+01] num: 6.601086e-01, ana: 6.601052e-01
simpson [0.0e+00 1.0e+01] num: 6.601073e-01, ana: 6.601052e-01

Benchmark ends, elapsed 8161 msecs


Pito
Thu Apr 20, 2017 11:25 am
What is interesting:
1. the .bin for MMini, -Os, DP, noFPU is 47.8kB
2. the .bin for 407, -O0, FPU -hard, SP is 33kB
That’s weird.. Why Whetstone compiles and runs. The same math.h, almost the same math funcs (sin/cos…).

edogaldo
Thu Apr 20, 2017 11:36 am
Pito wrote:What is interesting:
1. the .bin for MMini, -Os, DP, noFPU is 47.8kB
2. the .bin for 407, -O0, SP is 33kB
That’s weird.. Why Whetstone compiles and runs. The same math.h, almost the same math funcs (sin/cos…).

Pito
Thu Apr 20, 2017 11:48 am
@edogaldo: updated.

The problem seems to be different:
1. I commented out all math in functions and return wit 1.0 – no go with FPU on, and then
2. I replaced prtn results with simple Serial.println(“Hello”) – no go with FPU on, and then
3. I replaced “float” with “double” and the FPU is on – it runs and prints Hello.

The Whetstone runs with “float” and FPU on, and with the same config (includes, asm) as here (and with the same math functions, except powf() I commented out here).
So my suspicion points to these “float” pointers exercises while FPU is on, like:
typedef float (*pfunc)(float, float, int , float (*)(float));
typedef float (*rfunc)(float);


ag123
Thu Apr 20, 2017 12:31 pm
@pito for the single precision compile, WITH_FPU = 1, i tried making them static global variables e.g.
typedef float (*pfunc)(float, float, int , float (*)(float));
typedef float (*rfunc)(float);

#define INTG(F,A,B) (F((B))-F((A)))

static pfunc f[5] = {
int_leftrect, int_rightrect,
int_midrect, int_trapezium,
int_simpson
};

static rfunc rf[4] = { f1, f2, f3, f4 };
static rfunc If[4] = { f1a, f2a, f3a, f4a };

static const char *names[5] = {
"leftrect", "rightrect", "midrect",
"trapezium", "simpson"
};


Pito
Thu Apr 20, 2017 12:45 pm
Once I messed with debugger it happens it stopped here:
/* STM32 ISR weak declarations */

.thumb

/* Default handler for all non-overridden interrupts and exceptions */
.globl __default_handler
.type __default_handler, %function

__default_handler:
b . <<<<<< cursor

.weak __exc_nmi
.globl __exc_nmi
.set __exc_nmi, __default_handler
.weak __exc_hardfault
.globl __exc_hardfault
.set __exc_hardfault, __default_handler
.weak __exc_memmanage
.globl __exc_memmanage
.set __exc_memmanage, __default_handler
.weak __exc_busfault
.globl __exc_busfault
.set __exc_busfault, __default_handler
.weak __exc_usagefault
.globl __exc_usagefault
.set __exc_usagefault, __default_handler
.weak __stm32reservedexception7
.globl __stm32reservedexception7
.set __stm32reservedexception7, __default_handler
.weak __stm32reservedexception8
.globl __stm32reservedexception8
.set __stm32reservedexception8, __default_handler
.weak __stm32reservedexception9
.globl __stm32reservedexception9
.set __stm32reservedexception9, __default_handler


ag123
Thu Apr 20, 2017 12:50 pm
i think in c++ compiles, in our embedded compile setups, we’ve a set of flags that removes various things in c++
-fno-exceptions
-fno-rtti
-fno-use-cxa-atexit
-fno-threadsafe-statics

Pito
Thu Apr 20, 2017 1:07 pm
Do you think FPU=1 starts to mess with integer calculations within init() somewhere? Do we provide floating point in init()?
PS: the closest instruction to the exception was (before the exception)
vpush (d8,d9)

ag123
Thu Apr 20, 2017 1:10 pm
@pito,
i think it probably has nothing to do with those fpu_enable codes, as i’ve enable_fpu parts of the code run after my key press, but in the first place, it did not even get into setup() as my led did not light up. hence, my guess is we’d need to patch that ISR (interrupt service routine) or something perhaps to put
void exception_handler(void) {}
or something, that’s probably very early in the ‘pre main’ parts of codes

it sounds like an interrupt fired, and there is no handler there, and not sure what is that interrupt (division by zero?), yup and i think it has something to do with init() :lol:


edogaldo
Thu Apr 20, 2017 1:57 pm
ag123 wrote:@pito,
i think it probably has nothing to do with those fpu_enable codes, as i’ve enable_fpu parts of the code run after my key press, but in the first place, it did not even get into setup() as my led did not light up. hence, my guess is we’d need to patch that ISR (interrupt service routine) or something perhaps to put
void exception_handler(void) {}
or something, that’s probably very early in the ‘pre main’ parts of codes

it sounds like an interrupt fired, and there is no handler there, and not sure what is that interrupt (division by zero?), yup and i think it has something to do with init() :lol:


Pito
Thu Apr 20, 2017 2:48 pm
So with Whetstone no interrupts fire? If it were the same with Whetstone, it could be with interrupts. But Whetstone works (even we do not have seen any results..).

ag123
Thu Apr 20, 2017 2:59 pm
i think one test is to have float a = 1.0 / 0.0, maybe that’d confirm it :lol:

while looking at
– STM32F4/variants/black_f407vet6/stm32_vector_table.S
.long __irq_FPU_IRQHandler /* FPU */


edogaldo
Thu Apr 20, 2017 3:28 pm
@Pito/ag123
Just a question: which core are you using?
Now there are so many..

Pito
Thu Apr 20, 2017 3:42 pm
I use the one from July 2016.

ag123
Thu Apr 20, 2017 4:09 pm
i’m actually working on the development branch on steve’s repository
https://github.com/stevstrong/Arduino_S … F4_variant

to use codes from the branch, the easier way is to simply use git to checkout that branch
http://www.stm32duino.com/viewtopic.php … =80#p26483

note that if you do a clone of the repository, you would likely get ‘everything’ including the branch
before you start work on that branch (after a fetch/pull or clone), it is necessary to issue
git checkout Black_F4_variant


Pito
Thu Apr 20, 2017 5:48 pm
Ok, again with debugger.
It happens when the code enters “setup()” at the address aa4. There is something weird at this address (first picture).
It comes to reservedexception9, which is handled at the address dd0 in the isrs file (second picture).
So the jump from main() to setup() at weird aa4 is causing the exception, imho :?:
Around the aa4 there are 2 identical setup() labels with same creazy instructions on even and odd addresses.
From aad the asm FPU enable starts, then the delay(100) and then Serial.begin.
Before the setup() there are the math functions consisting mostly from fpu instr calls.

Crash1.JPG
Crash1.JPG (212.25 KiB) Viewed 331 times

ag123
Thu Apr 20, 2017 6:05 pm
this is complicated, i can’t see anything wrong in the (c++) code :lol:

i did something tedious, moved all the subroutines below setup() and loop(), declare function prototypes at the top, didn’t work, the led is still dark.
the typedefs remain there though


Pito
Thu Apr 20, 2017 6:18 pm
And this is how the fine working Whetstone setup() looks like (the same FPU settings).
The math code before setup() finishes fine, then nop, then setup() etc.
The instructions around setup() seem to be clean. On the above crash scene there is a big mess +/-2 instr big around the setup().

Whet OK setup.JPG
Whet OK setup.JPG (176.22 KiB) Viewed 320 times

ag123
Thu Apr 20, 2017 6:24 pm
trying something even more tedious, move integral into a different cpp file and declare the subroutines in the header :lol:

yes it works! 1 minute i’d upload it in a zip file here, note i’m using purely c/c++ file naming, the file integral_main.cpp contains setup() and loop() and would be renamed with an .ino suffix/file type
the floating point integral benchmark is in integral_float.cpp
uploaded the source codes, replace SerialUSB with Serial if you are using Serial.
press ‘g’ to start

WITH_FPU 1, optimization -Os (optimise size), debug off,v6.3.1-2017-q1-update
hardware floating point, single precision
Benchmark starts..
leftrect [2.0e+00 1.0e+01] num: 6.720000e-03, ana: 6.712139e-03
rightrect [2.0e+00 1.0e+01] num: 6.705779e-03, ana: 6.712139e-03
midrect [2.0e+00 1.0e+01] num: 6.712879e-03, ana: 6.712139e-03
trapezium [2.0e+00 1.0e+01] num: 6.712059e-03, ana: 6.712139e-03
simpson [2.0e+00 1.0e+01] num: 6.712050e-03, ana: 6.712139e-03

leftrect [-1.0e+00 -7.0e-01] num: -2.613795e-03, ana: -2.612248e-03
rightrect [-1.0e+00 -7.0e-01] num: -2.613944e-03, ana: -2.612248e-03
midrect [-1.0e+00 -7.0e-01] num: -2.613870e-03, ana: -2.612248e-03
trapezium [-1.0e+00 -7.0e-01] num: -2.612268e-03, ana: -2.612248e-03
simpson [-1.0e+00 -7.0e-01] num: -2.612271e-03, ana: -2.612248e-03

leftrect [1.0e+01 5.0e+01] num: 2.825554e-01, ana: 2.824790e-01
rightrect [1.0e+01 5.0e+01] num: 2.824033e-01, ana: 2.824790e-01
midrect [1.0e+01 5.0e+01] num: 2.824792e-01, ana: 2.824790e-01
trapezium [1.0e+01 5.0e+01] num: 2.824792e-01, ana: 2.824790e-01
simpson [1.0e+01 5.0e+01] num: 2.824791e-01, ana: 2.824790e-01

leftrect [0.0e+00 1.0e+01] num: 6.603576e-01, ana: 6.601052e-01
rightrect [0.0e+00 1.0e+01] num: 6.598577e-01, ana: 6.601052e-01
midrect [0.0e+00 1.0e+01] num: 6.601063e-01, ana: 6.601052e-01
trapezium [0.0e+00 1.0e+01] num: 6.601086e-01, ana: 6.601052e-01
simpson [0.0e+00 1.0e+01] num: 6.601073e-01, ana: 6.601052e-01

Benchmark ends, elapsed 2383 msecs


Pito
Thu Apr 20, 2017 6:42 pm
With WITH_FPU 1 I guess :)

ag123
Thu Apr 20, 2017 6:57 pm
yup, because if WITH_FPU is not defined it will ‘hang’ – fpu not enabled :lol:
you can take a look at the source codes, uploaded in the last post

Pito
Thu Apr 20, 2017 7:04 pm
This is with AG version
There is nothing before setup(): probably vectors, then __do_global_dtors_aux: and frame_dummy.

IntegrAG.JPG
IntegrAG.JPG (179.39 KiB) Viewed 307 times

Rick Kimball
Thu Apr 20, 2017 7:14 pm
maybe this problem?
http://stackoverflow.com/questions/2711 … ting-point

ag123
Thu Apr 20, 2017 7:23 pm
i’d guess one of a next thing may be to try to catch the normal floating point exceptions (e.g. division by zero), that’d need to mess with interrupts :D

Pito
Thu Apr 20, 2017 7:26 pm
I cannot get your version working :) :) :)
Benchmark starts..

ag123
Thu Apr 20, 2017 7:31 pm
void integral(void) is defined in integral.h
the source is in integral_float.cpp

in either setup() or loop()
u’d need to call integral() which is the benchmark code

do upload a version that’s probably more stm32duino ‘compliant’ as mine is pretty much a scratch

oops strange, doesn’t seem relevant it runs here, i mean integral()
;)

the defines used as extracted from a makefile
mcpu=cortex-m4 -mthumb -mfloat-abi=hard -mfpu=fpv4-sp-d16 -Os
-fmessage-length=0 -fsigned-char -ffunction-sections -fdata-sections -ffreestanding -fno-move-loop-invariants
-Wall -Wextra
-DSTM32_HIGH_DENSITY
-DSTM32F4
-DBOARD_black_f4
-DCONFIG_MAPLE_MINI_NO_DISABLE_DEBUG
-DDEBUG_LEVEL=DEBUG_NONE
-DVECT_TAB_ADDR=0x8000000
-DVECT_TAB_BASE
-DF_CPU=168000000L
-DSERIAL_USB
-DUSB_VID=0x1EAF
-DUSB_PID=0x0004
-DUSB_MANUFACTURER="Unknown"
-I"/opt3/src/workspace/STM32F4duinoblinky/STM32F4/cores/maple"
-I"/opt3/src/workspace/STM32F4duinoblinky/STM32F4/cores/maple/libmaple"
-I"/opt3/src/workspace/STM32F4duinoblinky/STM32F4/cores/maple/libmaple/usbF4/VCP"
-I"/opt3/src/workspace/STM32F4duinoblinky/STM32F4/cores/maple/libmaple/usbF4/STM32_USB_OTG_Driver/inc"
-I"/opt3/src/workspace/STM32F4duinoblinky/STM32F4/cores/maple/libmaple/usbF4/STM32_USB_Device_Library/Core/inc"
-I"/opt3/src/workspace/STM32F4duinoblinky/STM32F4/cores/maple/libmaple/usbF4/STM32_USB_Device_Library/Class/cdc/inc"
-I"/opt3/src/workspace/STM32F4duinoblinky/STM32F4/cores/maple/libmaple/usbF4"
-I"/opt3/src/workspace/STM32F4duinoblinky/STM32F4/variants/black_f407vet6"
-std=gnu++11
-fabi-version=0 -fno-exceptions -fno-rtti -fno-use-cxa-atexit -fno-threadsafe-statics
-MMD -MP -MF"$(@:%.o=%.d)" -MT"$(@)" -c -o "$@" "$<"


Pito
Thu Apr 20, 2017 7:36 pm
In AG.ino I call integral() in loop() as in your source (I am using your source).
I put integral.h and integral_float.cpp and AG.ino into the same project dir.
Compiles 50k without error or warning.
It prints so the Serial works.
Enclosed my code without USB and LEDS.

AGsetup.JPG
AGsetup.JPG (24.65 KiB) Viewed 291 times

ag123
Thu Apr 20, 2017 7:56 pm
ok uploaded integral_bench.zip
press ‘g’ to start :D

i compiled this in arduino ide 1.8.2, the board selected is “stm32 discovery f407”
note that SerialUSB exists and is defined, i do not need to substitute SerialUSB with Serial

4.8.3-2014q1 hardware floating point, parameters based on platforms.txt and boards.txt for above board
Benchmark starts..
leftrect [2.0e+00 1.0e+01] num: 6.720000e-03, ana: 6.712139e-03
rightrect [2.0e+00 1.0e+01] num: 6.705779e-03, ana: 6.712139e-03
midrect [2.0e+00 1.0e+01] num: 6.712879e-03, ana: 6.712139e-03
trapezium [2.0e+00 1.0e+01] num: 6.712058e-03, ana: 6.712139e-03
simpson [2.0e+00 1.0e+01] num: 6.712050e-03, ana: 6.712139e-03

leftrect [-1.0e+00 -7.0e-01] num: -2.613795e-03, ana: -2.612248e-03
rightrect [-1.0e+00 -7.0e-01] num: -2.613944e-03, ana: -2.612248e-03
midrect [-1.0e+00 -7.0e-01] num: -2.613870e-03, ana: -2.612248e-03
trapezium [-1.0e+00 -7.0e-01] num: -2.612268e-03, ana: -2.612248e-03
simpson [-1.0e+00 -7.0e-01] num: -2.612271e-03, ana: -2.612248e-03

leftrect [1.0e+01 5.0e+01] num: 2.825554e-01, ana: 2.824790e-01
rightrect [1.0e+01 5.0e+01] num: 2.824033e-01, ana: 2.824790e-01
midrect [1.0e+01 5.0e+01] num: 2.824792e-01, ana: 2.824790e-01
trapezium [1.0e+01 5.0e+01] num: 2.824792e-01, ana: 2.824790e-01
simpson [1.0e+01 5.0e+01] num: 2.824791e-01, ana: 2.824790e-01

leftrect [0.0e+00 1.0e+01] num: 6.603576e-01, ana: 6.601052e-01
rightrect [0.0e+00 1.0e+01] num: 6.598577e-01, ana: 6.601052e-01
midrect [0.0e+00 1.0e+01] num: 6.601063e-01, ana: 6.601052e-01
trapezium [0.0e+00 1.0e+01] num: 6.601086e-01, ana: 6.601052e-01
simpson [0.0e+00 1.0e+01] num: 6.601073e-01, ana: 6.601052e-01

Benchmark ends, elapsed 8136 msecs


Pito
Thu Apr 20, 2017 8:12 pm
That the same source as I use. Mine stops in the printing see above. It prints 3 Hello’s and stops.
I am getting __stm32reservedexception9 in
prntf("%10s [%1.1e %1.1e] num: %1.6e, ana: %1.6e\n",
names[i], ivals[2*j], ivals[2*j+1], ic, tmp);

ag123
Thu Apr 20, 2017 8:29 pm
oops, seem like the default compile in arduino ide 1.8.2 is for software float
i’m hitting up some compile errors when i add the hardware float compile flags in boards.txt
discovery_f407.build.extra_flags=-DMCU_STM32F406VG -mthumb -mcpu=cortex-m4 -mfloat-abi=hard -mfpu=fpv4-sp-d16 -Os -DSTM32_HIGH_DENSITY -DSTM32F2 -DSTM32F4 -DBOARD_discovery_f4

Pito
Thu Apr 20, 2017 8:39 pm
Compiled your arduino sketch under arduino ide. Serial used. compiled with
-mfloat-abi=hard -mfpu=fpv4-sp-d16 -fsingle-precision-constant

ag123
Thu Apr 20, 2017 8:42 pm
the main parts which i edited for printf is
void prntf(char *fmt, ... ){
char buf[128]; // resulting string limited to 128 chars
va_list args;
va_start (args, fmt );
vsnprintf(buf, 128, fmt, args);
va_end (args);
for(int i=0;i<128;i++) {
if(buf[i]=='\0') break;
if(buf[i]=='\n')
SerialUSB.println();
else
SerialUSB.write(buf[i]);
}
}

Pito
Thu Apr 20, 2017 8:45 pm
Yes it crashes with your prntf version too.
Yes it crashes with mine version too.
I m not using USBSerial.
So what is the trick with your version which works?? :?

ag123
Thu Apr 20, 2017 8:48 pm
it is somewhat surprising as these are the same files i used in eclipse, just that instead of ino, i rename it to cpp to make it compile :?

Pito
Thu Apr 20, 2017 8:52 pm
So back to eclipse. Mine setup is the same as yours, isn’t?
It compiles. I use Serial instead of USBSerial, so the .bins are not the same size.
It prints, so it compiles and links all components, it seems.
It crashes while printing results.
Both prntf versions tried.

ag123
Thu Apr 20, 2017 8:53 pm
the other thing which may be different is that i’m using gcc 6 in eclipse, while in the arduino ide it is 4.8.3, i’m not too sure if it matters
currently only the software float part of it works in arduino ide
signing off for now

Pito
Thu Apr 20, 2017 9:01 pm
I’ve replaced the prntf with custom Serial.println(float2s(ic,7)) to see the results at least. It works:
-Os -g, 4.8.3-2014q1, -mfloat-abi=hard -mfpu=fpv4-sp-d16 -fsingle-precision-constant, FPUon, 168MHz,
Benchmark starts..
left num: 6.7200002e-3 ana: 6.7121386e-3
righ num: 6.7057795e-3 ana: 6.7121386e-3
midr num: 6.7128791e-3 ana: 6.7121386e-3
trap num: 6.7120594e-3 ana: 6.7121386e-3
simp num: 6.7120509e-3 ana: 6.7121386e-3

left num: -2.6137950e-3 ana: -2.6122405e-3
righ num: -2.6139435e-3 ana: -2.6122405e-3
midr num: -2.6138706e-3 ana: -2.6122405e-3
trap num: -2.6122679e-3 ana: -2.6122405e-3
simp num: -2.6122713e-3 ana: -2.6122405e-3

left num: 2.8255537e-1 ana: 2.8247904e-1
righ num: 2.8240325e-1 ana: 2.8247904e-1
midr num: 2.8247917e-1 ana: 2.8247904e-1
trap num: 2.8247917e-1 ana: 2.8247904e-1
simp num: 2.8247909e-1 ana: 2.8247904e-1

left num: 6.6035757e-1 ana: 6.6010518e-1
righ num: 6.5985774e-1 ana: 6.6010518e-1
midr num: 6.6010627e-1 ana: 6.6010518e-1
trap num: 6.6010856e-1 ana: 6.6010518e-1
simp num: 6.6010732e-1 ana: 6.6010518e-1

Benchmark ends, elapsed 924 msecs


Pito
Thu Apr 20, 2017 9:30 pm
Now the issue is with vsnprintf. It loops in memset() in a loop(4688-4694) where it increments r2 register from ~zero until == r6( r6 set to 0x8000808) step 16, and when it reaches 1048584 it jumps to reservedexception9 (wdt??).
The vsnprintf is dangerous.We must get rid of it.
See below the loop.

vsnprtf issue.JPG
vsnprtf issue.JPG (154.32 KiB) Viewed 282 times

Pito
Thu Apr 20, 2017 11:00 pm
@ag123: enclosed the modded file with printing floats which works. You may try. I got 924msecs, -O2, FPU on. See above the results.
I will modify the stuff later on to print results after the computation. But it is ok for now.
Interested to see your results.. :)

ag123
Fri Apr 21, 2017 3:21 am
@pito thanks
yup it works
v6.3.1-2017-q1-update, optimization -O2 (optimise more), debug off, note -fsingle-precision-constant is not sspecified
hardware floating point, single precision, WITH_FPU 1,
Hello 1, The Benchmark starts..
left num: 6.7200002e-3 ana: 6.7121386e-3
righ num: 6.7057795e-3 ana: 6.7121386e-3
midr num: 6.7128791e-3 ana: 6.7121386e-3
trap num: 6.7120594e-3 ana: 6.7121386e-3
simp num: 6.7120504e-3 ana: 6.7121386e-3

left num: -2.6137950e-3 ana: -2.6122479e-3
righ num: -2.6139435e-3 ana: -2.6122479e-3
midr num: -2.6138706e-3 ana: -2.6122479e-3
trap num: -2.6122679e-3 ana: -2.6122479e-3
simp num: -2.6122713e-3 ana: -2.6122479e-3

left num: 2.8255537e-1 ana: 2.8247904e-1
righ num: 2.8240327e-1 ana: 2.8247904e-1
midr num: 2.8247923e-1 ana: 2.8247904e-1
trap num: 2.8247917e-1 ana: 2.8247904e-1
simp num: 2.8247907e-1 ana: 2.8247904e-1

left num: 6.6035757e-1 ana: 6.6010518e-1
righ num: 6.5985774e-1 ana: 6.6010518e-1
midr num: 6.6010627e-1 ana: 6.6010518e-1
trap num: 6.6010856e-1 ana: 6.6010518e-1
simp num: 6.6010732e-1 ana: 6.6010518e-1

Benchmark ends, elapsed 2305 msecs


ag123
Fri Apr 21, 2017 3:48 am
when the completed version is done lets post the source and results in the dhrystone and whetstone thread, this is a better benchmark, it is less prone to gcc removing codes, e.g. -Os, -O2, -O3 apparently made little difference with this benchmark :D

Pito
Fri Apr 21, 2017 7:08 am
Ok, this is the version with Serial.print() only. And the new results:
-Os -g, 4.8.3-2014q1, -mfloat-abi=hard -mfpu=fpv4-sp-d16 -fsingle-precision-constant, FPUon, 240MHz
Benchmark starts..

left num: 6.7200002e-3 ana: 6.7121386e-3
righ num: 6.7057795e-3 ana: 6.7121386e-3
midr num: 6.7128791e-3 ana: 6.7121386e-3
trap num: 6.7120594e-3 ana: 6.7121386e-3
simp num: 6.7120509e-3 ana: 6.7121386e-3

left num: -2.6137950e-3 ana: -2.6122405e-3
righ num: -2.6139435e-3 ana: -2.6122405e-3
midr num: -2.6138706e-3 ana: -2.6122405e-3
trap num: -2.6122679e-3 ana: -2.6122405e-3
simp num: -2.6122713e-3 ana: -2.6122405e-3

left num: 2.8255537e-1 ana: 2.8247904e-1
righ num: 2.8240325e-1 ana: 2.8247904e-1
midr num: 2.8247917e-1 ana: 2.8247904e-1
trap num: 2.8247917e-1 ana: 2.8247904e-1
simp num: 2.8247909e-1 ana: 2.8247904e-1

left num: 6.6035757e-1 ana: 6.6010518e-1
righ num: 6.5985774e-1 ana: 6.6010518e-1
midr num: 6.6010627e-1 ana: 6.6010518e-1
trap num: 6.6010856e-1 ana: 6.6010518e-1
simp num: 6.6010732e-1 ana: 6.6010518e-1

Benchmark ends, elapsed 633 msecs..


stevestrong
Fri Apr 21, 2017 7:19 am
Please also describe step-by-step what is necessary to get FPU running (maybe edit the very first post?).

Pito
Fri Apr 21, 2017 7:40 am
Updated the front page..

The quick guide on setting the FPU ON on STM32F407 boards under STM32duino
Will be precised further on

In your platform.txt add
-mfloat-abi=hard -mfpu=fpv4-sp-d16 -fsingle-precision-constant


ag123
Fri Apr 21, 2017 9:38 am
@pito put those asm codes in a function enable_fpu(), this may help in cases where someone may otherwise copy those asm and put them in the same section as the floating point codes, resulting in the hard faults etc

i’d think at some ‘future’ time enable_fpu() can perhaps go into libmaple as a ‘convenience function’, i.e. only sketches that needs them calls it
i’m not sure if calling it in init() may cause more power consumption etc, or more importantly if it gets called on the f103 it may lead to a hard fault/exception. hence, that’d be left to the user/sketch who needs it to call it

i’ve a version that’s sort of ‘pretty printed’
//enable the fpu (cortex-m4 - stm32f4* and above)
//http://infocenter.arm.com/help/topic/com.arm.doc.dui0553a/BEHBJHIG.html
void enablefpu()
{
__asm volatile
(
" ldr.w r0, =0xE000ED88 \n" /* The FPU enable bits are in the CPACR. */
" ldr r1, [r0] \n" /* read CAPCR */
" orr r1, r1, #( 0xf << 20 )\n" /* Set bits 20-23 to enable CP10 and CP11 coprocessors */
" str r1, [r0] \n" /* Write back the modified value to the CPACR */
" dsb \n" /* wait for store to complete */
" isb" /* reset pipeline now the FPU is enabled */

);
}


Pito
Fri Apr 21, 2017 9:57 am
This is the final version 1.0 of the Numerical Integration Benchmark.
The Benchmark shows the speed and the accuracy of the math lib.
The Benchmark is not easy to optimize out.
Changes:
1. Added precise results from Wolfram Alfa
2. Printing the results is now off the calculation loop.

-O3 -g, 4.8.3-2014q1, -mfloat-abi=hard -mfpu=fpv4-sp-d16 -fsingle-precision-constant, FPU on, Black F407ZET @240MHz
Numerical Integration Benchmark starts..
Single Precision with FPU, v1.0

F1 Result: 0.00671211394423041159587
leftrect num: 6.720000e-3 ana: 6.712139e-3
righrect num: 6.705780e-3 ana: 6.712139e-3
midrect num: 6.712879e-3 ana: 6.712139e-3
trapezium num: 6.712059e-3 ana: 6.712139e-3
simpson num: 6.712051e-3 ana: 6.712139e-3

F2 Result: -0.00261227114510560413684
leftrect num: -2.613795e-3 ana: -2.612241e-3
righrect num: -2.613944e-3 ana: -2.612241e-3
midrect num: -2.613871e-3 ana: -2.612241e-3
trapezium num: -2.612268e-3 ana: -2.612241e-3
simpson num: -2.612271e-3 ana: -2.612241e-3

F3 Result: 0.28247911095899395473
leftrect num: 2.825554e-1 ana: 2.824790e-1
righrect num: 2.824032e-1 ana: 2.824790e-1
midrect num: 2.824792e-1 ana: 2.824790e-1
trapezium num: 2.824792e-1 ana: 2.824790e-1
simpson num: 2.824791e-1 ana: 2.824790e-1

F4 Result: 0.660105195382224847535
leftrect num: 6.603576e-1 ana: 6.601052e-1
righrect num: 6.598577e-1 ana: 6.601052e-1
midrect num: 6.601063e-1 ana: 6.601052e-1
trapezium num: 6.601086e-1 ana: 6.601052e-1
simpson num: 6.601073e-1 ana: 6.601052e-1

Benchmark ends, elapsed 616621 microsecs..


ag123
Fri Apr 21, 2017 10:05 am
@pito
the trouble is on an f1 a user who runs it using soft float may need to wait > 10x that amount of time for the benchmark to complete :lol:
i’d suggest we can use micros() instead which is the time in microseconds and we can make do with less loops

in addition i found that __ARM_PCS_VFP is defined by gcc when compiling with the option -mfloat-abi=hard -mfpu=fpv4-sp-d16 hence, this could be used to determine if the user is compiling with hardware float specified in the compile options
#ifdef __ARM_PCS_VFP
#warning enabling harware fpu
enablefpu();
#else
#warning using software floating point
#endif


Pito
Fri Apr 21, 2017 10:17 am
Let us make it KISS.
We may enhance it in the near future. The Benchmark is for an FPU, so an F1 user has to modify it for F1.
The Single Precision is not intended for numerical math, so the results will be off for several reasons.
With newer processors with DP FPU we may create Double version.
The Q is whether somebody will use it, so do not spend too much time with it.. ;)
P.
PS: updated the pretty printed asm :)

edogaldo
Fri Apr 21, 2017 11:40 am
ag123 wrote:@pito
the trouble is on an f1 a user who runs it using soft float may need to wait > 10x that amount of time for the benchmark to complete :lol:
i’d suggest we can use micros() instead which is the time in microseconds and we can make do with less loops

in addition i found that __ARM_PCS_VFP is defined by gcc when compiling with the option -mfloat-abi=hard -mfpu=fpv4-sp-d16 hence, this could be used to determine if the user is compiling with hardware float specified in the compile options
#ifdef __ARM_PCS_VFP
#warning enabling harware fpu
enablefpu();
#else
#warning using software floating point
#endif


Pito
Fri Apr 21, 2017 12:05 pm
Updated the first post.
@407 experts and users: do start using the FPU and do publish your findings here in this thread. After gathering practical experience we will update.
Happy FPUing :)

Rick Kimball
Fri Apr 21, 2017 12:18 pm
edogaldo wrote:I’d also suggest to put this in function “init()” of “boards.c”.

edogaldo
Fri Apr 21, 2017 12:30 pm
Rick Kimball wrote:edogaldo wrote:I’d also suggest to put this in function “init()” of “boards.c”.

Rick Kimball
Fri Apr 21, 2017 12:53 pm
edogaldo wrote:It can be fine as well but why in variants?!

edogaldo
Fri Apr 21, 2017 1:47 pm
Rick Kimball wrote:Unfortunately, calling it in init() doesn’t guarantee that it will be called before any of the _pre_init constructors.

Pito
Fri Apr 21, 2017 3:25 pm
FYI
-Os, -g, 4.8.3-2014q1, Maple Mini, No FPU, 72MHz
Numerical Integration Benchmark starts..
Single Precision with No FPU, v1.0

F1 Result: 0.00671211394423041159587
leftrect num: 6.720000e-3 ana: 6.712139e-3
righrect num: 6.705780e-3 ana: 6.712139e-3
midrect num: 6.712879e-3 ana: 6.712139e-3
trapezium num: 6.712058e-3 ana: 6.712139e-3
simpson num: 6.712050e-3 ana: 6.712139e-3

F2 Result: -0.00261227114510560413684
leftrect num: -2.613795e-3 ana: -2.612248e-3
righrect num: -2.613944e-3 ana: -2.612248e-3
midrect num: -2.613871e-3 ana: -2.612248e-3
trapezium num: -2.612268e-3 ana: -2.612248e-3
simpson num: -2.612271e-3 ana: -2.612248e-3

F3 Result: 0.28247911095899395473
leftrect num: 2.825554e-1 ana: 2.824790e-1
righrect num: 2.824033e-1 ana: 2.824790e-1
midrect num: 2.824792e-1 ana: 2.824790e-1
trapezium num: 2.824792e-1 ana: 2.824790e-1
simpson num: 2.824791e-1 ana: 2.824790e-1

F4 Result: 0.660105195382224847535
leftrect num: 6.603576e-1 ana: 6.601052e-1
righrect num: 6.598577e-1 ana: 6.601052e-1
midrect num: 6.601063e-1 ana: 6.601052e-1
trapezium num: 6.601086e-1 ana: 6.601052e-1
simpson num: 6.601073e-1 ana: 6.601052e-1

Benchmark ends, elapsed 23992225 microsecs..


ag123
Fri Apr 21, 2017 4:44 pm
actually it just show that stm32f4 + ART accelerator + fpu + -fsingle-precision-constant + float specific math lib function + -O2 optimization + overclock is real rocket fuel (and that’s not yet adding hardware fpu compiled math lib functions) :lol:
but to leave something for the stm32 f1 users i’d suggest we have about 15-20secs for the f1 users
and we’d need to use micros() for timing, the ultimate would be when we’ve the loops complete in under 1usecs, we can then try to compete with the supercomputers :D

Pito
Fri Apr 21, 2017 5:23 pm
Ok I will change back to 10000 steps, and the measurement will be in usecs.
I only wanted to show to F1 users they system is really slow :lol:

ag123
Fri Apr 21, 2017 5:28 pm
Rick Kimball wrote:

Unfortunately, calling it in init() doesn’t guarantee that it will be called before any of the _pre_init constructors.

edogaldo
Fri Apr 21, 2017 5:29 pm
Pito wrote:Ok I will change back to 10000 steps, and the measurement will be in usecs.
I only wanted to show to F1 users they system is really slow :lol:

ag123
Fri Apr 21, 2017 5:30 pm
stm32 f1 is the benchmark :D
for those that are slower too bad :P :lol:

Pito
Fri Apr 21, 2017 5:41 pm
Ok replaced with the politically more correct benchmark with ~20secs for F1 users, and a little bit faster with 407.
Measurement in microsecs.
Updated on the first page (and elsewhere).
Q: hopefully the “float” in MapleMini profile does not mean the “double”. That would be a disaster..

ag123
Fri Apr 21, 2017 6:02 pm
perhaps at some point in time, we may have an f4 solving the travelling sales man problem
https://en.wikipedia.org/wiki/Travellin … an_problem
in real time and win the races :lol:
http://wiki.evilmadscientist.com/TSP_art
https://www.google.com.sg/search?q=tsp+ … DUcQsAQIIQ

Rick Kimball
Fri Apr 21, 2017 6:09 pm
Pito wrote:Q: hopefully the “float” in MapleMini profile does not mean the “double”. That would be a disaster..

Pito
Fri Apr 21, 2017 6:58 pm
:) I stated that because if the MapleMini used double calcs (called float in the benchmark) the results of F1 with float would be better and F1 users may think we cheat :)

What use case do you see that using a double vs a float would be beneficial?
That is a good Q. See my short description on the front page of this thread.
The SP is good when talking a “standard” MCU usage as a process controller where the sensors gives you “a few digits inputs” (say 4-5). Or when you want to do DSP with audio or video (even today’s GPUs go DP).
In the moment as you want something more demanding, ie navigate from London to Sydney, you need DP.
If you want to predict position of celestial bodies, or a HAM satellite flyby, you need DP.
Or when implementing a GPS something, you need DP.

You can land safely on the Mars under SP control, or control the attitude/engine/flaps/maneuvers of an F-16 with SP, but you cannot navigate to Mars with SP, or to navigate from London to Sydney while sitting in an F-16 (when talking dead reckoning for example). Simply 6-7 “valid digits” with complex math is not enough for a lot of apps, especially when messing with transcendental functions (they return 5-6 valid digits usually).

For example you may try to make this quite simple calculations http://www.movable-type.co.uk/scripts/latlong.html with SP against DP, what would be the difference.


ag123
Fri Apr 21, 2017 8:30 pm
i’d think it may matter if we do some iterative root finding stuff, using floats may lead to the algorithms going in loops and never converging as errors keep cumulating with each iteration :lol:
but i’d guess the f4 ‘platform’ is a niche of sorts, at least the performance is decent/good for these tests :D

for gps kind of things, i’d think double precision is needed for an accurate positioning. however if only a relative position is needed and in small metropolitan towns or cities, i’ve tried using the ‘the earth is flat’ approximation i.e. pythagoras theorem, and i’m able to pretty much ‘fake’ the distances between 2 points in a neighbourhood (say based on guess-itimate cell tower locations), it would certainly not work if we are on opposite sides of the spere :lol:


Pito
Fri Apr 21, 2017 8:41 pm
Yes, the “numerical stability” for SP is poor and in order to get some stability you must mess a lot with algorithms such it will work, less effort is to get DP. So you certainly do not want to make your PhD on such a platform :)
Even a compare ie. if( q == r ) may not work reliably, instead you have to use if ( (q < (r + epsilon)) or (q > (r – epsilon)) ).. to be sure it works.. Many $30 calculators use 25digits+ in order to find proper roots :) So forget this kind of math on 407’s FPU.
With CMSIS DSP libs you may try FFT and such stuff. That will work nice. Maybe faster than with integers.. No problems with scaling, saturation, q15 conversions, etc. Just feed floats in and you get nice floats spectra out..
PS:
FFT 1024 points on F429:
q15 457us
q31 855us
SP float 547us
http://www.st.com/content/ccc/resource/ … 273990.pdf

ag123
Fri Apr 21, 2017 8:51 pm
of course for the *hard core* probably can do highly optimised double precision floating point using hard core tuned assembly making sure the loops are unrolled and overclock that to 240mhz or higher and let the ART accelerator drive it there :lol:

Pito
Fri Apr 21, 2017 8:57 pm
Look above in the STM doc: SP float 1024 points FFT is 1.56x faster than the q32 (integer) one. And you will see much nicer spectra..

Pito
Sat Apr 22, 2017 10:12 am
FYI – I’ve built the CMSIS FFT Float32 “arm_fft_bin_example_f32” (from sources as I have not managed to get the dsplib linked in Slober), 1024points float32 FFT, FPU on, Black 407ZET, Sloeber.
Input/output is float (test signal from the example file).

-O3, -g, 4.8.3-2014q1, 168MHz
FFT.. elapsed 700 microsecs
FFT Magn.. elapsed 211 microsecs
FFT Bins.. elapsed 61 microsecs


Pito
Sat Apr 22, 2017 6:02 pm
And finally 3 tones, amplitudes 100000.0 at 50, 100.0 at 120 and 0.1 at 230. That means 60dB+60dB difference in amplitudes. Settings as above.
You may see the tone at 230 is still visible (luckily not hidden in the noise background) despite the large numerical noise of float.

FFT 3 tones.JPG
FFT 3 tones.JPG (63.11 KiB) Viewed 781 times

Pito
Sat Apr 22, 2017 9:55 pm
I’ve put the asm enable code before the __libc_init_array(); in start_c.c and it works.
..
#if defined (__GNUC__) && defined (__VFP_FP__)
#warning enabling harware fpu
__asm volatile
(
" ldr.w r0, =0xE000ED88 \n" /* The FPU enable bits are in the CPACR. */
" ldr r1, [r0] \n" /* read CAPCR */
" orr r1, r1, #( 0xf << 20 )\n" /* Set bits 20-23 to enable CP10 and CP11 coprocessors */
" str r1, [r0] \n" /* Write back the modified value to the CPACR */
" dsb \n" /* wait for store to complete */
" isb" /* reset pipeline now the FPU is enabled */
);
#else
#warning using software floating point
#endif

/* Run initializers. */
__libc_init_array();
..


ag123
Sun Apr 23, 2017 7:06 am
i found that the particular define __ARM_PCS_VFP only shows up if -mthumb -mcpu=cortex-m4 -mfloat-abi=hard -mfpu=fpv4-sp-d16 is defined
apparently this particular define won’t show up with -mthumb -mcpu=cortex-m4 -mfloat-abi=softfp -mfpu=fpv4-sp-d16 (i.e. software floats)
or if it is simply left out, i’d assume eclipse or rather perhaps gcc simply use the software floating points if nothing is specified

Pito
Sun Apr 23, 2017 11:05 am
“softfp” does not mean software floats.
It does mean the FPU uses the standard registers for parameters passing to the FPU. It is for compatibility reasons.
Both -hard and -softfp use the FPU.
It may happen the -softfp with FPU off will use software math routines, not verified yet.

victor_pv
Sun Apr 23, 2017 2:09 pm
Not sure if I have missed this in one of the posts, but I could not find it. Using the command from Edogaldo, and confirmed checking some threads in google, it seems like the defines for when using FPU or not go like this:

No hardware fpu used:
__SOFTFP__ 1
__ARM_PCS 1


ag123
Sun Apr 23, 2017 3:09 pm
i did a little google as well and found the definitions here and quoted some definitions
http://infocenter.arm.com/help/topic/co … 21827.html
__ARM_PCS 1 Set for 32-bit targets only. Set to 1 if the default procedure calling standard for the translation unit conforms to the base PCS.
__ARM_PCS_VFP 1 Set for 32-bit targets only. Set to 1 if the default procedure calling standard for the translation unit conforms to the VFP PCS. That is, -mfloat-abi=hard.
__SOFTFP__ 1 Set to 1 when compiling with software floating-point on 32-bit targets. Set to 0 otherwise.

ag123
Sun Apr 23, 2017 3:16 pm
and here is some stories about fpu-ing rather floating point errors or problems that caused some bigger problems, one of which bascially says that if u are building a mil $ flying machine and accuracy matters, it may be safer to use doubles and get a cpu/mcu with an fpu that does it :lol:
http://ta.twi.tudelft.nl/users/vuik/wi2 … sters.html

Pito
Sun Apr 23, 2017 6:04 pm
FYI – I’ve replaced the code for printing the floats (we use it in the Integration Benchmark – fixed) as the code did an error with other calculation I’ve done recently. Instead of getting 100.0 I got only 10.0..

victor_pv
Sun Apr 23, 2017 6:33 pm
ag123 wrote:
#if defined(__GNUC__) && (__ARM_PCS==1 || __ARM_PCS_VFP==1)
#warning enabling harware fpu
enable_fpu();
#endif
...

ag123
Sun Apr 23, 2017 6:43 pm
well according to this
http://support.raisonance.com/content/h … -cortex-m4
The GCC toolchain offers 3 methods to handle floating-point on STM32F4xx. The correct method depends on your application and its requirements:
Emulated float (-mfloat-abi=soft option) which disables the FPU
Hardware FPU (-mfloat-abi=softfp option) with standard EABI parameter passing
Hardware FPU (-mfloat-abi=hard option) with specific parameter passing

the thing i’m not sure is: if none of those flags are specified and that there are float (single precision calculations) in the codes, which is the option that gcc would use as the default? if the default is emulated float i.e. skipping fpu altogether, we are probably safe (i.e. we won’t have the __ARM_PCS==1 || __ARM_PCS_VFP==1 ) flags, as either of them being there means hooking up the fpu or a hard fault would occur since the call would simply fall flat (fpu disabled)

if the default is -mfloat-fbi=soft (i.e. software float emulation) adding hardware float would then be simply selecting a ‘variant’ in which we can put those -mfloat-abi=softfp or -mfloat-abi=hard options in the particular boards.txt variant section

signing off for now ;)


Leave a Reply

Your email address will not be published. Required fields are marked *