I’m trying to make my code as fast as possible on my newly purchased Blue Pill.
I was wondering . . .
Since the CPU is a 32-bit device, is there any particular speed advantage (or other possible advantage) of using the ‘short’ and ‘uint16_t’ data types that use two bytes instead of four, please? Would Maths with these smaller data types be any faster on these 32-bit devices?
Thank you
Ken
P.S. I did a Google search and also searched this forum, but couldn’t find an answer.
My guess would be that you would not speed things up, but may save some RAM
There is an optimise option menu, in my Arduno-STM32 core, which would have more effect.
The default optimisation is for “size” -Os but this is probably the slowest setting.
Try the -O2 or -O3 settings, with or without LTO (link time optimisation ) and you will find the -O3 is probably the fastest.
I find LTO can sometimes make things worse.
Note we have warnings when using -O3 but the core normally still works Ok
I just ran a few test, changing some ‘shorts’ to ‘ints’.
I found no speed advantage to using the smaller data types.
In fact, my code ran at pretty much identical speed.
So no advantage at all (in my case).
I did, however, shave a good ten milliseconds off my code by changing floats to Ints, and rejigging the maths as essentially ‘two floating point accuracy’.
Where would I be apply to apply these optimisations that you speak of, please?
Are these in the Arduino IDE somewhere? (I assume not, since I looked and didn’t see anything applicable)
Ken
The F4 has a FPU so the speed would be almost identical for both int and float
Try the -O3 optimisation setting, you will probably notice a decent increase in speed
Personally, I would prefer if the default was -O2 but the view of the community is that they prefer the optimise for size option, which is the slowest.
Apparently, I wasn’t using your core (or your latest core).
I downloaded the files and placed them in the right place, restarted the Arduino IDE and now I see ‘Optimise’ in the tools drop-down menu.
Great!
I’ll quickly have a tinker (and a time), and report back.
Ken
Smallest (standard) – 30.2 Microseconds
-01 setting – 31.4 Microseconds (slower!)
-02 setting – 29.5 Microseconds
-03 setting -29.6 Microseconds.
-03 with LTO -25.8 Microseconds (Sweet!!!!)
Thanks for this – It may not seem much of a difference but I may be adding some more maths in there, and like Tesco’s motto ‘every little helps’.
Ken
Some things are a lot faster with higher optimisation settings, but your code could potentially already be quite well optimised
In terms of using 16 bit data types to speed things up, the easiest thing to do is look for where you copy data – moving 32 bits at a time is faster than 2×16 bits. Look at the peripheral documentation to see if the peripherals have modes that can help (e.g. dual mode for ADC transfers).
Beyond that, it’s look to see if you can process two things at once – <32bit value> & 0xfff0fff0 is faster than two x <16bit value> & 0xfff0.
If you know the range of values you will using won’t cause problems you can sometimes get away with treating a 32 bit value as two 16 bit ones. e.g. If your input data is guaranteed to be 14 bits unsigned, you can multiply two values by 2 at a time using <32bit value> << 1.
You could post your code and ask people for suggestions to speed it up.
BTW I presume you already inlined the functions and looked at unrolling loops and also branch optimisation, and use of lookup tables etc etc
It’s pretty much optimised but it could do with expert eyes.
We’ll see what this weekend brings.
Ken
Someone may spot an optimisation…
