STM32F103 performance when using non blocking DMA

racemaniac
Fri Jun 16, 2017 7:52 pm
I see DMA being used more and more on this forum, and one topic that gets mentioned from time to time is that when using DMA not blocking, it shares the memory bus with the cpu, and will degrade performance (as mentioned in the reference manual).

However, we never knew how much it would slow down, so i decided to do some measurements.

The setup: I took the dhrystone code you can find on the forum (i took the zip from the first post), and ran it on a maple mini i had near me. It gave me a rating of 35,87 VAX MIPS. (which is quite a bit lower than what the first post of the thread mentions they got 48, but it’s just a baseline & intial test, seems good enough).

I then did various tests with setting SPI1 in circular DMA transmitting & receiving data to 2 buffers 256bytes large. and repeated each test 5 times (the numbers however were always consistent, no deviation at all)

Here are the results:
No DMA: 35,87 VAX MIPS
SPI1 @DIV16: 36.08 VAX MIPS (for some reason slightly faster than without DMA)
SPI1 @DIV8: 34.8 VAX MIPS
SPI1 @DIV4: 34.32 VAX MIPS
SPI1 @ DIV2: 33.93 VAX MIPS
SPI1 & SPI2 both at DIV2: 33.61 VAX MIPS

This is just an initial test, but this seems very encouraging. there is hardly any performance penalty at all even when using both spi ports at full speed. To make sure the DMA was actually working i checked with my logic analyzer that they were actually doing something, didn’t check the receiving part, but the transmission was working for sure :).
I used my own DMA library as i had the code readily available and know how it works :).


RogerClark
Fri Jun 16, 2017 10:01 pm
Very interesting…

I wonder if there are some other speed tests you can run in the fireground, e.g. RAM access is supposed to be slowed by DMA, so a simple test would be to run memcoy to copy 4k block from RAM to RAM, multiple times.

You could do the same test doing memcoy from a const array in flash to RAM.

but supposedly the RAM performance was the thing that should be impacted by DMA

BTW. you may need o write your own very simple memcpy, to be sure of what its doing


danieleff
Sat Jun 17, 2017 4:11 am
Yes test with memcpy.

After that, test with DMA mem2mem, that will surely cripple the performance.


racemaniac
Sat Jun 17, 2017 5:37 am
Ok, first next observation in trying to tank the performance using DMA: the maple mini only has 1 DMA controller, and it has finite bandwith. The 2 spi’s together (in 8 bit spi mode) work pretty well together. i see slight degradation in their performance when i enable both (a slight pauze in the clock between each byte), but nothing dramatic.
As soon as i enable a mem2mem transfer however, the DMA controller is overloaded, and one of both spi’s will have to give… the other will still function fine, but one of both will hardly be able to transmit data anymore. Even when i trigger that, i see no big difference in the dhrystone benchmarks. The stm32 docs also mentions that whatever you do, both cpu & dma will equally share the memory controller, so even if i swamp the DMA, the cpu will still have 50% of the memory controller all to itself.

Next tests to do: the memcpy.


danieleff
Sat Jun 17, 2017 5:46 am
racemaniac wrote:As soon as i enable a mem2mem transfer however, the DMA controller is overloaded, and one of both spi’s will have to give…

RogerClark
Sat Jun 17, 2017 6:00 am
racemaniac wrote:

Next tests to do: the memcpy.

racemaniac
Sat Jun 17, 2017 6:22 am
danieleff wrote:racemaniac wrote:As soon as i enable a mem2mem transfer however, the DMA controller is overloaded, and one of both spi’s will have to give…

RogerClark
Sat Jun 17, 2017 6:45 am
Excellent

racemaniac
Sat Jun 17, 2017 7:08 am
Okay, did some timings on the following piece of code (my ‘memcpy’):
Begin_Time = micros();
for(int j = 0; j < 10000; j++)
{
for(int i = 0; i < 256; i ++)
{
txBuffer2[i] = rxBuffer2[i];
}
}
End_Time = micros();

stevestrong
Sat Jun 17, 2017 7:22 am
Maybe a small modification of the memcopy would show different result:
uint8_t * rxBufPtr = rxBuffer2;
uint8_t * txBufPtr = txBuffer2;
for(int j = 0; j < 10000; j++)
{
for(int i = 0; i < 32; i ++) // split 256 byte write into 32 * 8 bytes write
{
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
}
}

RogerClark
Sat Jun 17, 2017 7:55 am
try

void loop() {
// put your main code here, to run repeatedly:
// put your setup code here, to run once:
uint8_t *tPtr;
uint8_t *rPtr;
uint8_t *tEndPtr = &txBuffer2[256];
unsigned long m=millis();
for(int j = 0; j < 10000; j++)
{
tPtr = txBuffer2;
rPtr = rxBuffer2;
while(tPtr < tEndPtr )
{
*rPtr++ = *tPtr++;
}
}
Serial.println(millis() -m);
}


RogerClark
Sat Jun 17, 2017 7:56 am
Steve – we cross posted with virtually identical code

I’d also try different pointer sizes


stevestrong
Sat Jun 17, 2017 7:58 am
My version uses more intensively the memory bus, making 8 consecutive accesses.

RogerClark
Sat Jun 17, 2017 8:00 am
stevestrong wrote:My version uses more intensively the memory bus, making 8 consecutive accesses.

racemaniac
Sat Jun 17, 2017 8:17 am
stevestrong wrote:Maybe a small modification of the memcopy would show different result:
uint8_t * rxBufPtr = rxBuffer2;
uint8_t * txBufPtr = txBuffer2;
for(int j = 0; j < 10000; j++)
{
for(int i = 0; i < 32; i ++) // split 256 byte write into 32 * 8 bytes write
{
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
}
}

stevestrong
Sat Jun 17, 2017 8:42 am
I think it would be interesting to see the test software from Victor (http://www.stm32duino.com/viewtopic.php … 811#p24989) combined with your measurements.

racemaniac
Sat Jun 17, 2017 9:05 am
what could also be an interesting test, is a bigger stm32. the cbt6 only has 1 DMA controller, but the bigger ones have 2. If i do work on both, each get 1/3rd of the memory bus, so that will probably have a bigger impact on the cpu (unless for those models they also increased the capacity of the memory bus).

before doing these tests myself, i did a quick google search to see if i could find existing benchmarks, didn’t find much, except for this very nice document by stm: http://www.st.com/content/ccc/resource/ … 160362.pdf


RogerClark
Sat Jun 17, 2017 11:02 am
With the speed tests, I think @stevestrongs code is the worst possible case.

e.g You would not normally write code like that, unless you had a small buffer of known length and wanted absolutely the best speed.

I think in normal operation, something like memcpy or perhaps my code, is more likely to be used.
And its possible that the DMA can interleave when the processor is doing the pointer address comparison and branching. So its possible that DMA transfers from RAM would have no impact on memcpy etc, if they only take one cycle to fetch from RAM


racemaniac
Sat Jun 17, 2017 11:25 am
RogerClark wrote:With the speed tests, I think @stevestrongs code is the worst possible case.

e.g You would not normally write code like that, unless you had a small buffer of known length and wanted absolutely the best speed.

I think in normal operation, something like memcpy or perhaps my code, is more likely to be used.
And its possible that the DMA can interleave when the processor is doing the pointer address comparison and branching. So its possible that DMA transfers from RAM would have no impact on memcpy etc, if they only take one cycle to fetch from RAM


victor_pv
Mon Jun 26, 2017 3:08 pm
I don’t have much time, but I do have a host of other F103 MCUs, up to the RFT6.
If you post the full sketch in github and what connections I need to do (like a loop in the spi miso mosi, or whatever else, short of a logic analyzer at 36mhz as I dont have one) I can run it in different mcus in the series with more ram and controllers.

So far I think the take out is that DMA does not starve the cpu, but the cpu can starve the DMA, and the DMA can starve itself. So if you want parallelism on more than 1 peripheral, may be good to keep the DMA devices at less than their full speed to make sure nothing starves for access.

One more test I haven’t seen in the thread, is whether using 16bit mode in the SPI port helps, since in theory it would reduce the ram accesses to 1/2 per byte sent.


racemaniac
Mon Jun 26, 2017 3:50 pm
[victor_pv – Mon Jun 26, 2017 3:08 pm] –
I don’t have much time, but I do have a host of other F103 MCUs, up to the RFT6.
If you post the full sketch in github and what connections I need to do (like a loop in the spi miso mosi, or whatever else, short of a logic analyzer at 36mhz as I dont have one) I can run it in different mcus in the series with more ram and controllers.

So far I think the take out is that DMA does not starve the cpu, but the cpu can starve the DMA, and the DMA can starve itself. So if you want parallelism on more than 1 peripheral, may be good to keep the DMA devices at less than their full speed to make sure nothing starves for access.

hmm, also an interesting test: how much could the cpu starve the dma. They equally share the memory bus, so i don’t expect the cpu to be able to have much more influence on the dma than vice versa (although, the DMA is probably even more memory dependent than the cpu, so would be a nice test). But indeed, the DMA can at least completely starve itself ^^’.

I’ll have a look at sharing the code.


racemaniac
Mon Jun 26, 2017 5:08 pm
Here is my code:

Dma test.zip
(12.78 KiB) Downloaded 18 times

danieleff
Mon Jun 26, 2017 5:21 pm
I see “DMA_MEM_2_MEM | DMA_CIRC_MODE” in the code. According to RM0008 13.3.3, “Memory to Memory mode may not be used at the same time as Circular mode”

racemaniac
Mon Jun 26, 2017 8:27 pm
[danieleff – Mon Jun 26, 2017 5:21 pm] –
I see “DMA_MEM_2_MEM | DMA_CIRC_MODE” in the code. According to RM0008 13.3.3, “Memory to Memory mode may not be used at the same time as Circular mode”

well, it seemed to work at least ^^’
but it’s pretty pointless i think, and as observed completely starves the dma for other peripherals.


Leave a Reply

Your email address will not be published. Required fields are marked *