F4 – Update boards.txt

RogerClark
Mon Jul 03, 2017 6:58 am
https://github.com/rogerclarkmelbourne/ … 2/pull/238

Changes to max sized etc in boards.txt on F4 boards. Probably OK but I have not had time to check


stevestrong
Mon Jul 03, 2017 7:19 pm
It seems to work fine:
Sketch uses 21,548 bytes (4%) of program storage space. Maximum is 514,288 bytes.
Global variables use 12,064 bytes (6%) of dynamic memory, leaving 184,544 bytes for local variables. Maximum is 196,608 bytes.

Pito
Mon Jul 03, 2017 7:25 pm
Is that just info taken from boards.txt or you can really allocate 192kB of ram??

stevestrong
Mon Jul 03, 2017 8:00 pm
Just info taken from boards.txt.
Usable (theoretically) 128kB normal RAM + 64kB CCMRAM.

zmemw16
Mon Jul 03, 2017 8:05 pm
i’m pretty sure its split up 128k & ccram is 64k, at least one chip has a 4k chunk out of 196k reserved as well

from my collection of assorted f4 linker files, grep for -i ccmram|grep -i origin

Debug_STM32F407VG_FLASH.ld: CCMRAM (rw) : ORIGIN = 0x10000000, LENGTH = 64K
Debug_STM32F407VG_RAM.ld: CCMRAM (rw) : ORIGIN = 0x10000000, LENGTH = 64K
STM32F407VETx_FLASH.ld:CCMRAM (rw) : ORIGIN = 0x10000000, LENGTH = 64K
STM32F407VG_FLASH.ld:CCMRAM (rw) : ORIGIN = 0x10000000, LENGTH = 64K
STM32F407VGTx_FLASH.ld:CCMRAM (rw) : ORIGIN = 0x10000000, LENGTH = 64K
STM32F407ZGTx_FLASH.ld:CCMRAM (rw) : ORIGIN = 0x10000000, LENGTH = 64K
STM32F417ZGTx_FLASH.ld:CCMRAM (rw) : ORIGIN = 0x10000000, LENGTH = 64K
STM32F429ZI_FLASH.ld:CCMRAM (rw) : ORIGIN = 0x10000000, LENGTH = 64K
STM32F429ZITx_FLASH.ld:CCMRAM (rw) : ORIGIN = 0x10000000, LENGTH = 64K
stm32f4_flash.ld: CCMRAM (rw) : ORIGIN = 0x10000000, LENGTH = 64K
stm32f4zgt.ld:CCMRAM (rw) : ORIGIN = 0x10000000, LENGTH = 64K


victor_pv
Mon Jul 03, 2017 9:08 pm
Most our linker scripts do not have the CCM declared. I have done it for a couple of boards just for testing, but you can not use it all as single block since the addresses dont line up, and on top CCM in F4 can’t be used for code or DMA, while it can be used for code in the F3…
But the ram is there, since those changes only affect the IDE reporting the size and what not, and do not cause any change to the linker scripts, I think’t it’s ok to keep them like that.
If we decide to add the CCM to the linker scripts in the future, then we can actually use that RAM.

Another option is to change those values to what we currently have in the linker scripts (like 112KB for some F4 and so on) and leave the CCM amount of of the total since we currently dont use it.

BTW if anyone is interested in using the CCM I modified the libmaple linker scripts for both the F3 and the F4 for testing and can pass them on.
I think I once placed the F4 USB buffers in CCM, since the USB doesn’t use DMA, it’s a good way to get a large buffer with 0 cost in RAM, and faster speed due to not being shared.


Pito
Mon Jul 03, 2017 9:09 pm
If there is not a relevant change in the linker file such it really allocates 192kB I would stay with 128kB in the info taken from boards.txt..

zmemw16
Mon Jul 03, 2017 11:17 pm
steve’s post shows way more than 128k available.
do mean set it to an appropriate ‘memory directly accessible’ value in boards.txt
stephen

victor_pv
Mon Jul 03, 2017 11:55 pm
[zmemw16 – Mon Jul 03, 2017 11:17 pm] –
steve’s post shows way more than 128k available.
do mean set it to an appropriate ‘memory directly accessible’ value in boards.txt
stephen

The memory set in boards.txt is the one the IDE uses to show how much RAM you have total and left, but is not used by the linker script to allocate memory. What Steve’s post shows is what the IDE reads from boards.txt.
You could set boards to 512MB of RAM, and the linker would still not care about that and use what’s in the linker scripts.

Currently the CCM memory is not used by the linker scripts in any way, so although available, is not usable. There are several changes needed to the linker script and the startup code to make it usable, but even then there are several caveats, that’s why most people won’t use it.

I kind of agree with Pito that unless we decide to go ahead with changes to the scripts and startup to use, it would be better to report only the main RAM in boards.txt.
The side effects of reporting 192KB while the linker allocates only 128KB is that the IDE may be telling you that you allocated 128KB of RAM and still have 64KB available for local variables (and stack), while in reallity your code will crash because there is no free space for even the stack.


stevestrong
Tue Jul 04, 2017 5:09 am
[Pito – Mon Jul 03, 2017 9:09 pm] –
If there is not a relevant change in the linker file such it really allocates 192kB I would stay with 128kB in the info taken from boards.txt..

Me too.
But lets include this feature, it is nice to have it.


Pito
Tue Jul 04, 2017 8:48 am
Long time back I messed with eLua, and I succeeded (with help of the community experts) to allocate full 192kB for the eLua on F4. It was done by an “allocator=multiple” directive and maybe 2 lines in a script defining the memory setup. And it worked. Being not a talented programmer I have to dig into old topics to find out how it worked..

PS: http://elua-development.2368040.n2.nabb … 82063.html
PPS: eLua uses scons instead of make..

allocator = newlib | multiple | simple: choose between the default newlib allocator (newlib) which is an older version of dlmalloc, the multiple memory spaces allocator (multiple) which is a newer version of dlmalloc that can handle multiple memory spaces, and a very simple memory allocator (simple) that is slow and doesn’t handle fragmentation very well, but it requires very few resources (Flash/RAM). You should use the multiple allocator only if you need to support multiple memory spaces (for example boards that have external RAM). You should use simple only on very resource-constrained systems.

So a mastering the multiple fragmented Sram spaces would be a great achievement here – consider the upcoming STM32H7 possesses maybe 4 6 (Update: 128+64+512+288+64+4) internal scattered sram spaces plus an external sram/sdram space as well :)


victor_pv
Tue Jul 04, 2017 3:45 pm
[Pito – Tue Jul 04, 2017 8:48 am] –
Long time back I messed with eLua, and I succeeded (with help of the community experts) to allocate full 192kB for the eLua on F4. It was done by an “allocator=multiple” directive and maybe 2 lines in a script defining the memory setup. And it worked. Being not a talented programmer I have to dig into old topics to find out how it worked..

PS: http://elua-development.2368040.n2.nabb … 82063.html
PPS: eLua uses scons instead of make..

allocator = newlib | multiple | simple: choose between the default newlib allocator (newlib) which is an older version of dlmalloc, the multiple memory spaces allocator (multiple) which is a newer version of dlmalloc that can handle multiple memory spaces, and a very simple memory allocator (simple) that is slow and doesn’t handle fragmentation very well, but it requires very few resources (Flash/RAM). You should use the multiple allocator only if you need to support multiple memory spaces (for example boards that have external RAM). You should use simple only on very resource-constrained systems.

So a mastering the multiple fragmented Sram spaces would be a great achievement here – consider the upcoming STM32H7 possesses maybe 4 6 (Update: 128+64+512+288+64+4) internal scattered sram spaces plus an external sram/sdram space as well :)

I know how to place stack, heap, or anything else we want in CCM RAM. I do not know how to make the linker place anything in one or the other by its own decission. I checked that thread, but seems that allocator is a function, not something in the linker script, is that right?

Forcing the stack and heap there is not difficult, the problematic part is if the user code creates a buffer during runtime that goes to either stack or heap and tries to use that for DMA, it will crash.
Other than that, is great for those usages and leaves the main chunk of RAM to be used for user global variables, buffers etc. But do we want that risk?
Perhaps we can use a compile option to tell whether we want stack and heap, or whatever else, in CCM, so if we know we need to do DMA in a local variable or one allocated with Malloc, we select the board options to avoid it?


Pito
Tue Jul 04, 2017 4:26 pm
but seems that allocator is a function, not something in the linker script, is that right?
The eLua builds by scons, the allocator=multiple is a scons parameter, based on that it includes the dlmalloc.c (130kB large source) into the build. The compiler itself was the codesourcery. Frankly, the details are not known to me, as the eLua build is a fairly complex exercise.
I think the decision on DMA accessible variables has to be left on the user (via some attributes) as the creating a build system which will consider all the MCU related nuances would be quite an effort..

testato
Tue Jul 04, 2017 4:49 pm
i think that for now is good exclude the CCM Ram from the info received at the end of the compilation, so the user know how many real Ram capability managing is there on the actual version of core.

If in the future, if will be implemented a usage of the CCM Ram, i think will be better explain that also on the board.txt and not simply increase the value displayed, for example should be:
Sketch uses 21,548 bytes (4%) of program storage space. Maximum is 514,288 bytes.
Global variables use 12,064 bytes (9%) of dynamic memory
CCM Ram use xxx bytes. Maximum is xxx bytes


victor_pv
Sat Jul 29, 2017 12:26 pm
Adding to the CCM discussion in this thread:
In F4 MCUs we have 64KB of CCM memory.
There are 2 restrictions using CCM:

  • The CCM memory can be used for data, no code can be run from it.
  • DMA controllers can not access

The advantage is that CCM is used exclusively by the MCU in a separate bus, so doesn’t share bandwidth with any other memory or peripheral.
In theory you could have fast DMA going on in the normal RAM, while the CPU runs from flash using CCM data with no penalty in CPU or DMA performance.

With that in mind, there are 3 possible uses for CCM:

  • Heap
  • Stack
  • Normal user variables.

As long as we don’t use those variables allocated in those blocks for DMA, all is good.
In the past as proof of concept I modified Steve’s F4 USB code to allocate its buffers in CCM. Was not much trouble, and gives the option to use large buffers without taking from normal RAM.
I also allocated the Heap and Stack to CCM, and that provided a small speed gain in one of the CPU benchmarks.
I have not tested like racemaniac did to push the CPU+DMA, but should allow more concurrent operations with no penalty.

Now, allocating all the normal data to CCM is very risky, because if a user allocates a buffer for DMA use there, the code will crash.
Heap and Stack are used can be used for variables too, so it’s somewhat risky, but most people using DMA will use a globally allocated buffer (not always though).

With all those conditions in mind, I have been thinking that a good compromise on using CCM without causing much pain would be to allow it as a board option. Similarly to selecting between stlink upload or bootloader upload makes the linker use a different linker script with different addresses, we could add an option that uses a script allocating Stack and Heap to CCM.
We can also add a #define in the core, similar to how __FLASH__ is defined to allow the user to force a variable to flash (that’s not used often since the linker will place RO data in flash anyway, but it’s in the core):

#define __attr_flash __attribute__((section (".USER_FLASH")))
#define __FLASH__ __attr_flash


stevestrong
Sat Jul 29, 2017 3:02 pm
I would welcome heap and stack in CCM.
I would make it default.

Using large buffers for DMA on stack makes not much sense to me, I don’t know any application doing this, and it seems to me a sub-optimal practice.

One special case would be the writing same data with DMA (with no increment of source pointer), but for that we could adapt the SPI DMA functions to convert to a global buffer (in normal RAM) the input data of passed buffer pointer[0].


victor_pv
Sat Jul 29, 2017 4:10 pm
[stevestrong – Sat Jul 29, 2017 3:02 pm] –
I would welcome heap and stack in CCM.
I would make it default.

Using large buffers for DMA on stack makes not much sense to me, I don’t know any application doing this, and it seems to me a sub-optimal practice.

One special case would be the writing same data with DMA (with no increment of source pointer), but for that we could adapt the SPI DMA functions to convert to a global buffer (in normal RAM) the input data of passed buffer pointer[0].

That’s a good option, hopefully will not cost many cycles. Something that compares the pointer address to the CCM range, if it matches then do the copy.
Or perhaps add an assertion that fails at compile time if using CCM address?


RogerClark
Sat Jul 29, 2017 9:59 pm
Slightly off topic, but from @racemaniac’s DMA speed tests, I don’t think that this RAM being on a separate bus would make much difference to performance unless you are doing specific things , e.g. Memory to memory DMA at the same time as DMA to SPI.

victor_pv
Sun Jul 30, 2017 3:03 am
[RogerClark – Sat Jul 29, 2017 9:59 pm] –
Slightly off topic, but from @racemaniac’s DMA speed tests, I don’t think that this RAM being on a separate bus would make much difference to performance unless you are doing specific things , e.g. Memory to memory DMA at the same time as DMA to SPI.

It all depends, for some uses there may be a performance improvement, for others may be nice to use the extra 64KB.
What do you think about Steve’s suggestion to move heap and stack to CCM?


RogerClark
Sun Jul 30, 2017 3:39 am
What benefit is there to moving both Stack and Heap to CCM ?

Surely it would be better to only move either Stack or Heap, so that both get more space.


victor_pv
Tue Aug 08, 2017 3:56 am
I believe that since we normally dont use “new” the heap doesn’t grow much, but my understanding on what is the heap used for is limited.
The stack grows down, but 64KB is plenty for both.
On my tests I moved the BSS area, PINMAP and the USB buffers at different times.

I just submitted a PR to Steve’s fork with the changes to relocate Heap, Stack and PinMap to CCMRAM, and add a new flag __CCMRAM__ so any variable can be moved to it.
https://github.com/stevstrong/Arduino_STM32/pull/5

I think the changes are mostly self explanatory, so it’s easy to modify to leave any of those out of CCMRAM, but provides an example covering those uses

BSS can be relocated successfully from my tests, but the any variable declared with default attributes (not forced to normal RAM), would cause a crash if used for DMA, since the DMA controllers can’t access CCMRAM.

I think the best use cause is manually adding __CCMRAM__ to the declaration of any variable that the user knows will not be used for DMA, and at the same time needs frequent access, since CCMRAM can be faster than normal RAM due to only being accessed by the CPU.
PINMAP is just a good example due to being large and not used as pointer for DMA.


stevestrong
Tue Aug 08, 2017 11:21 am
I thought PIN_MAP is in flash and will be kept there.
I personally don’t see any gain by placing it to CCMram, except saving some CPU cycles which is for F4 not so critical as it anyway runs with 168MHz…

victor_pv
Tue Aug 08, 2017 1:45 pm
[stevestrong – Tue Aug 08, 2017 11:21 am] –
I thought PIN_MAP is in flash and will be kept there.
I personally don’t see any gain by placing it to CCMram, except saving some CPU cycles which is for F4 not so critical as it anyway runs with 168MHz…

In the version of the core where I used CCMRAM PIN_MAP was still in RAM, that’s why I tested moving it to CCMRAM. Agreed that flash is a better place for it. But it gives an example of what’s needed for moving something to CCM.
With the same attribute I moved the USB buffers, and pretty much anything that will not be used during DMA and we want a high performance can be moved there.


stevestrong
Tue Aug 08, 2017 3:44 pm
Yea, maybe you should update your local copy of the repo, there were lately some major improvements committed.

This is just for the record: heap and stack, in general


victor_pv
Tue Aug 08, 2017 4:27 pm
Steve I just commented in the PR.
The /* that was missing is corrected now, but the __ccm_end__ label, perhaps we should change it to something else. it was used the right way, but the meaning is not what you thought, it is to indicate the end of the variables allocated to CCMRAM, so the heap can start after the last variable allocated there.

I have a good understanding on how the stack work, but as far as when is the heap used, I have seem to conflicting information on how it’s used in embedded systems, other than when using malloc and new. I’ll read thru your link to see if it adds some clarity.


stevestrong
Tue Aug 08, 2017 4:48 pm
It seems that I misunderstood the functionality due to the used naming.
Here is what I think I understood:

CCM starts at 0x10000000. This should be called __ccmram_start__.
CCM ends 64kB after __ccmram_start__. This should be called __ccmram_end__. This is the value the stack pointer should be initialized to, right? In this case the __msp_init calculation makes not too much sense for me, and should be fixed to __ccmram_end__.

Going back to CCM start, some variables are allocated, let’s say till 0x10001234. This is then the start of the heap, right? If so then it should be called __ccmram__heap_start__. Then the heap can go up to __ccmram_end__?

Stack: starts at __ccmram_end__ and is groing down theoretically till __ccmram__heap_start__, right?

I know, the head and stack may overlap under these conditions, but 64kB is a lot of space for heap and stack together.


victor_pv
Tue Aug 08, 2017 7:08 pm
[stevestrong – Tue Aug 08, 2017 4:48 pm] –
It seems that I misunderstood the functionality due to the used naming.
Here is what I think I understood:

CCM starts at 0x10000000. This should be called __ccmram_start__.
CCM ends 64kB after __ccmram_start__. This should be called __ccmram_end__. This is the value the stack pointer should be initialized to, right? In this case the __msp_init calculation makes not too much sense for me, and should be fixed to __ccmram_end__.

Going back to CCM start, some variables are allocated, let’s say till 0x10001234. This is then the start of the heap, right? If so then it should be called __ccmram__heap_start__. Then the heap can go up to __ccmram_end__?

Stack: starts at __ccmram_end__ and is groing down theoretically till __ccmram__heap_start__, right?

I know, the head and stack may overlap under these conditions, but 64kB is a lot of space for heap and stack together.

That’s the way it works.
Now, __ccmram_end__ still needs to be calculated, as the way we declare the memory blocks in jtag.ld (and the rest of them) is by START and LENGTH so the linker needs to calculate the end address. No big deal, that’s part of all the linker scripts we use for the different boards. I’ll keep that one called msp_init since that same name is used in the other linker scripts for all the other boards, so trying to keep as much as possible common.

What I will do is change the name of __ccmram_end__ to __ccmram_heap_start__

If we dont get stack overflows with a mini with 20KB, 64KB should definitely not cause any issue. Everyone should anyway avoid using malloc and new as much as possible, so the heap should normally not grow up. But I remember somewhere reading some embedded compilers place some local large variables in heap instead of stack, and that’s what has me a bit confused on how likely is the heap to go up,

I was reading more on it, and one way to find out if a collision between heap and stack caused a crash is to first fill the memory in between with a pattern, then if the system crashes, the memory can be read to find out if the pattern is gone.
FreeRTOS uses a similar system (method 2):
http://www.freertos.org/Stacks-and-stac … cking.html

I used it once when I wasn’t sure if a tasks was causing an overflow, found out it was not.

EDIT:
To be consistent with how the ld script labels were so far, I am making the following changes:
1.- .ccmram section renamed to .ccmdata. That section is to hold data (variables) allocated to CCMRam.
2.- The label for the start of the section is renamed to __ccmdata_start__ to be consistent with the normal data section label (__data_start__).
3.- The end of the section is labelled __ccmdata_end__ to be consistent with the normal ram section (__data_end__)
4.- The heap start is at __ccmdata_end__, so clearly indicating it starts where data ends.


Leave a Reply

Your email address will not be published. Required fields are marked *