Code Size compared to Teensy

sompost
Wed Apr 25, 2018 10:04 am
Hi all.

I’ve got a sketch that when compiled for the Teensy 3.2 (Cortex M4) uses 38020 bytes (of a total of 256k) of program store.
The same on a STM32 “blue pill” uses 52676 (max is 64k). I’m using the “faster” option (-O2) for both.

There appears to be a lot of unused (?) stuff in the binary on both devices, but the STM32 has at least 100 more entries in the symbol table.
Also, at least one unused constant array that is gc-ed away on the Teensy appears to be present on the blue pill.

I’m aware that the run-time systems may not be the same for both devices, but still that’s quite a difference.

Have you had similar experiences porting Teensy stuff onto blue pills?

Thanks, Ralph


fpiSTM
Wed Apr 25, 2018 12:15 pm
Which core you compared?

I guess the main differences are the compiler options.
For example, Teensy do not enable “-g” for O2 while for Roger’s core the -g option are available for all optimizations. Try to remove -g in platform.txt


sompost
Thu Apr 26, 2018 8:07 am
Thanks for your suggestion.

Removing the -g option didn’t help, though, and updating to the latest Artuino_STM32 added another 102 bytes to the binary :lol:

I saw a discussion on one of the gcc fora about a similar situation: an unused function being removed (as it should) but the data it used to use (now unused) was left behind. They were talking about version 4.9 or something of the compiler. I see that Arduino for STM32 uses version 4.8.3 of the compiler whereas Teensy appears to use version 5.4.1. Perhaps that explains (some of) the discrepancy.

In any case, I’ll have to trim the code manually then. Preprocessor to the rescue!

Thanks again, Ralph

Afterthought: I still think that there is a lot of stuff in the runtime that shouldn’t be there (malloc/free,HardwareSerial, …) unless the programmer explicitly uses/instantiates any of it. We don’t really have any memory to spare for things we don’t use (Teensy is not much better in that regard, mind you. I compared the symbol table line by line).


stevestrong
Thu Apr 26, 2018 9:23 am
There were several discussions regarding the issue of large bin size.
A simple solution would be to add
-specs=nosys.specs -specs=nano.specs -u _printf_float

sompost
Thu Apr 26, 2018 9:51 am
[stevestrong – Thu Apr 26, 2018 9:23 am] –
There were several discussions regarding the issue of large bin size.

I know. The ones I’ve found, I read ;)

In one of those threads somebody mentioned (and it appears to be the consensus here) that most of the blue pills have actually 128k of flash mem.
So I decided to stop worrying –for the time being– and see if my blue pills explodes.

Thanks, Ralph


stevestrong
Thu Apr 26, 2018 11:38 am
The question is whether with this extra compiler switch did you manage to reduce code size or not, so that the size became comparable with that of Teensy or not.

sompost
Thu Apr 26, 2018 12:48 pm
[stevestrong – Thu Apr 26, 2018 11:38 am] –
The question is whether with this extra compiler switch did you manage to reduce code size or not, so that the size became comparable with that of Teensy or not.

…and that is a very good question! 8-)

It didn’t. It reduced RAM but increased flash usage:

compiler.c.elf.extra_flags="-L{build.variant.path}/ld" -specs=nosys.specs -specs=nano.specs -u _printf_float


stevestrong
Thu Apr 26, 2018 12:52 pm
You can do one more try, if you don’t need to sprintf float variables, then you can leave away the switch
-u _printf_float

Slammer
Thu Apr 26, 2018 12:54 pm
Did you try with STM32 core? It is using much newer toolchain. The difference is probably caused by compiler/linker operation, I cant see other reason for so big difference (note that usb serial code, if enabled, is about 6-7KB of code on BluePill)

victor_pv
Thu Apr 26, 2018 1:14 pm
And the libmaple may compile with the same GCC version, I would test that just in case it works, that way will be a more similar comparison.
Also a good idea is to use the map file analyzer from Danieleff. It’s linked on a post in this thread:
viewtopic.php?f=28&t=1596&start=10

That will help see where the biggest chunks of flash are going to.


Rick Kimball
Thu Apr 26, 2018 3:51 pm
[sompost – Thu Apr 26, 2018 12:48 pm] –
It was really just an experiment to see if the same code running on a $20 Teensy would also run on a $2 blue pill. Being that close to the 64k “limit”, was a bummer, since I haven’t even started adding code for MIDI, SD card, display, external DAC….

Stop thinking you only have 64K. I’ve yet to find a bluepill that doesn’t have the full 128K that is probably there. Just flip the menu to use 128 and stop worrying about the code size.


sompost
Sun Apr 29, 2018 9:20 am
[Rick Kimball – Thu Apr 26, 2018 3:51 pm] –
Stop thinking you only have 64K. I’ve yet to find a bluepill that doesn’t have the full 128K that is probably there. Just flip the menu to use 128 and stop worrying about the code size.

I know, but I can’t stop worrying. I was, after all, a disciple of Prof. Wirth (of Pascal fame) and as such I learnt counting bytes. :mrgreen:

In any case, it appears that the “Teensy-compiler” is able to deduce that an overridden method is not called anymore, while the “STM32-compiler” apparently can’t and therefore leaves it around (together with everything it uses).

I added a new base class with the corresponding function being pure virtual (so an empty function is overridden). And for good measure I also added the -specs=nano.specs flag.

Here’s what I got. Take that, Teensy!

Sketch uses 39012 bytes (59%) of program storage space. Maximum is 65536 bytes.
Global variables use 12888 bytes (62%) of dynamic memory, leaving 7592 bytes for local variables. Maximum is 20480 bytes.

Is that cool or is that cool?

Thanks to everybody, Ralph


edogaldo
Sun Apr 29, 2018 10:28 am
Just want to highlight that I had problems using the automatic serial usb restart using the nano.specs: http://stm32duino.com/viewtopic.php?f=3 … 885#p27885

Rick Kimball
Sun Apr 29, 2018 3:44 pm
[sompost – Sun Apr 29, 2018 9:20 am] –
I added a new base class with the corresponding function being pure virtual (so an empty function is overridden). And for good measure I also added the -specs=nano.specs flag.

If you post the code I’m sure we could figure out what you are doing wrong. Without seeing the code, we can only guess what is upsetting the compiler. BTW, the compiler for Teensy (assuming it is an arm mcu) and the compiler for an STM32 are the same. The difference is in the c/c++ code and compiler flags that are used for each core. Also, I’m assuming you are using the Arduino IDE and not some other IDE like Sloeber, PlatformIO, or VSCode that affect the code size.

As we have discovered before, there can be any number of reasons why the code size explodes. Class static variables are a common problem. I proposed a solution in this thread viewtopic.php?f=3&t=1904#p25257 by using -fno-threadsafe-statics, however that solution was rejected by those who use FreeRTOS. https://github.com/rogerclarkmelbourne/ … -319511181 . Without seeing your code, I can only guess if this is something you are using.

There is always a reason why your code gains bloat. Deciding how to solve it and make the rest of the stm32duino community happy is another story.


RogerClark
Sun Apr 29, 2018 9:17 pm
Unfortunately we have to default to maximum compatibility, and this does not result in either the fastest code or the smallest code.

This is due to both to the code in the Core, and also the compiler switches.

But like Rick says, it’s basically impossible to have a optimal solution for everyone’s individual requirements, and since it’s open source, everyone is free to make whatever individual changes they feel appropriate


ahull
Mon Apr 30, 2018 5:19 am
I guess what this all boils down to is that if you want optimal code size, then it becomes necessary to investigate and understand some of the compiler switches and their likely effects.

One other point, smaller code may come at the expense of other things, for example speed, timing accuracy or even reliability. Of course the opposite is also true, smaller code might actually turn out to be faster, tighter and more reliable. Your mileage may vary.

If you want extremely optimised code, then you enter the rabbit hole of reading the assembler listings, unrolling loops, and coding critical parts by hand in assembler. That way, as I can testify from personal experience, madness lies ;)


bootsector
Mon Apr 30, 2018 8:57 am
If you are looking for smallest code, take a look at the following GCC flags:

-ffunction-sections
-fdata-sections
-Os

and this LD flag:

-Wl,–gc-sections


sompost
Mon Apr 30, 2018 2:55 pm
Actually, I need fast code more than I need small code, at least for the core of the algorithm. :mrgreen:

It is a VSTi-software synthesizer that I ported to the Teensy, because why not? After I found out about the “blue pill” (that cost one tenth of a Teensy) I wanted to see if it would run the same code. After all, it was the same-ish core.

I saw that the synth had large wave tables that used a lot of RAM despite being essentially constant (computed at load time). So I pre-computed those tables and #included them as const int to be put in flash by the compiler. But then I noticed that flash usage didn’t increase by the same amount that the ram usage decreased. When analyzing the disassembly I found that the compiler apparently discarded the table altogether because nothing used it anymore (computing the table was its only use). The only other “uses” were (1) a class member function that was overridden and (2) a function that I didn’t call.

Just to be clear: I neither expected nor required the table to disappear. The Teensy has enough flash/ram. Saving those 8kBytes was just a welcome side effect — in addition to being a diabolically clever optimization. However, on a blue pill, saving those 8k would be very welcome, indeed.

So to summarize, the only thing that bugged me, was that the compiler for the STM32 didn’t also discard the table. I don’t think that there are differences in compiler flags that could account for that different behaviour. I didn’t check each and every flag, but the ones that would eliminate unused code/data are there in both. However, Teensy and STM use different compiler versions that might explain it (The Teensy uses version 5.4.1 of the compiler).

In any case, despite my reluctance to changing code that I didn’t write, I think that my modification (adding an abstract base type to an object) was acceptable. Separating an object’s type from its implementation is always good. So they say. Somewhere. I think.

Thanks to everybody, and I hope you don’t hate me 8-)

Cheers, Ralph


Rick Kimball
Mon Apr 30, 2018 4:20 pm
[sompost – Mon Apr 30, 2018 2:55 pm] –
Actually, I need fast code more than I need small code, at least for the core of the algorithm. :mrgreen:

Wrap this around the functions you think need to be faster:

#pragma GCC push_options
#pragma GCC optimize (“O3”)

the code in need of speed optimzation

#pragma GCC pop_options


RogerClark
Mon Apr 30, 2018 9:46 pm
There are a load of optimisations available from the menu, including -O3 and linker ( but I didn’t find the linker options very useful, or even that they always resulted in code that would run)

The default is -Os , for smallest binary, however when I read the Gcc docs, they seemed to imply that -Os is like -O2 but with a few preferences biased towards small code size.

So I don’t think -O2 is much faster than -Os, but it will depend a lot on what your code does.

I’ve used -O3 quite a lot and not had any problems with it.

Btw. Putting const arrays in flash will be slower than having them in RAM because of the wait states required when the ARM core reads from flash.


fpiSTM
Mon Apr 30, 2018 10:13 pm
arm gcc version is also important.
If you really want to compare use the same version. At least the major number V5.x.x

Leave a Reply

Your email address will not be published. Required fields are marked *