However, we never knew how much it would slow down, so i decided to do some measurements.
The setup: I took the dhrystone code you can find on the forum (i took the zip from the first post), and ran it on a maple mini i had near me. It gave me a rating of 35,87 VAX MIPS. (which is quite a bit lower than what the first post of the thread mentions they got 48, but it’s just a baseline & intial test, seems good enough).
I then did various tests with setting SPI1 in circular DMA transmitting & receiving data to 2 buffers 256bytes large. and repeated each test 5 times (the numbers however were always consistent, no deviation at all)
Here are the results:
No DMA: 35,87 VAX MIPS
SPI1 @DIV16: 36.08 VAX MIPS (for some reason slightly faster than without DMA)
SPI1 @DIV8: 34.8 VAX MIPS
SPI1 @DIV4: 34.32 VAX MIPS
SPI1 @ DIV2: 33.93 VAX MIPS
SPI1 & SPI2 both at DIV2: 33.61 VAX MIPS
This is just an initial test, but this seems very encouraging. there is hardly any performance penalty at all even when using both spi ports at full speed. To make sure the DMA was actually working i checked with my logic analyzer that they were actually doing something, didn’t check the receiving part, but the transmission was working for sure
.
I used my own DMA library as i had the code readily available and know how it works
.
I wonder if there are some other speed tests you can run in the fireground, e.g. RAM access is supposed to be slowed by DMA, so a simple test would be to run memcoy to copy 4k block from RAM to RAM, multiple times.
You could do the same test doing memcoy from a const array in flash to RAM.
but supposedly the RAM performance was the thing that should be impacted by DMA
BTW. you may need o write your own very simple memcpy, to be sure of what its doing
After that, test with DMA mem2mem, that will surely cripple the performance.
As soon as i enable a mem2mem transfer however, the DMA controller is overloaded, and one of both spi’s will have to give… the other will still function fine, but one of both will hardly be able to transmit data anymore. Even when i trigger that, i see no big difference in the dhrystone benchmarks. The stm32 docs also mentions that whatever you do, both cpu & dma will equally share the memory controller, so even if i swamp the DMA, the cpu will still have 50% of the memory controller all to itself.
Next tests to do: the memcpy.
Next tests to do: the memcpy.
Begin_Time = micros();
for(int j = 0; j < 10000; j++)
{
for(int i = 0; i < 256; i ++)
{
txBuffer2[i] = rxBuffer2[i];
}
}
End_Time = micros();
uint8_t * rxBufPtr = rxBuffer2;
uint8_t * txBufPtr = txBuffer2;
for(int j = 0; j < 10000; j++)
{
for(int i = 0; i < 32; i ++) // split 256 byte write into 32 * 8 bytes write
{
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
}
}void loop() {
// put your main code here, to run repeatedly:
// put your setup code here, to run once:
uint8_t *tPtr;
uint8_t *rPtr;
uint8_t *tEndPtr = &txBuffer2[256];
unsigned long m=millis();
for(int j = 0; j < 10000; j++)
{
tPtr = txBuffer2;
rPtr = rxBuffer2;
while(tPtr < tEndPtr )
{
*rPtr++ = *tPtr++;
}
}
Serial.println(millis() -m);
}
I’d also try different pointer sizes
uint8_t * rxBufPtr = rxBuffer2;
uint8_t * txBufPtr = txBuffer2;
for(int j = 0; j < 10000; j++)
{
for(int i = 0; i < 32; i ++) // split 256 byte write into 32 * 8 bytes write
{
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
*txBufPtr++ = *rxBufPtr++;
}
}before doing these tests myself, i did a quick google search to see if i could find existing benchmarks, didn’t find much, except for this very nice document by stm: http://www.st.com/content/ccc/resource/ … 160362.pdf
e.g You would not normally write code like that, unless you had a small buffer of known length and wanted absolutely the best speed.
I think in normal operation, something like memcpy or perhaps my code, is more likely to be used.
And its possible that the DMA can interleave when the processor is doing the pointer address comparison and branching. So its possible that DMA transfers from RAM would have no impact on memcpy etc, if they only take one cycle to fetch from RAM
e.g You would not normally write code like that, unless you had a small buffer of known length and wanted absolutely the best speed.
I think in normal operation, something like memcpy or perhaps my code, is more likely to be used.
And its possible that the DMA can interleave when the processor is doing the pointer address comparison and branching. So its possible that DMA transfers from RAM would have no impact on memcpy etc, if they only take one cycle to fetch from RAM
If you post the full sketch in github and what connections I need to do (like a loop in the spi miso mosi, or whatever else, short of a logic analyzer at 36mhz as I dont have one) I can run it in different mcus in the series with more ram and controllers.
So far I think the take out is that DMA does not starve the cpu, but the cpu can starve the DMA, and the DMA can starve itself. So if you want parallelism on more than 1 peripheral, may be good to keep the DMA devices at less than their full speed to make sure nothing starves for access.
One more test I haven’t seen in the thread, is whether using 16bit mode in the SPI port helps, since in theory it would reduce the ram accesses to 1/2 per byte sent.
[victor_pv – Mon Jun 26, 2017 3:08 pm] –
I don’t have much time, but I do have a host of other F103 MCUs, up to the RFT6.
If you post the full sketch in github and what connections I need to do (like a loop in the spi miso mosi, or whatever else, short of a logic analyzer at 36mhz as I dont have one) I can run it in different mcus in the series with more ram and controllers.So far I think the take out is that DMA does not starve the cpu, but the cpu can starve the DMA, and the DMA can starve itself. So if you want parallelism on more than 1 peripheral, may be good to keep the DMA devices at less than their full speed to make sure nothing starves for access.
hmm, also an interesting test: how much could the cpu starve the dma. They equally share the memory bus, so i don’t expect the cpu to be able to have much more influence on the dma than vice versa (although, the DMA is probably even more memory dependent than the cpu, so would be a nice test). But indeed, the DMA can at least completely starve itself ^^’.
I’ll have a look at sharing the code.
- Dma test.zip
- (12.78 KiB) Downloaded 18 times
[danieleff – Mon Jun 26, 2017 5:21 pm] –
I see “DMA_MEM_2_MEM | DMA_CIRC_MODE” in the code. According to RM0008 13.3.3, “Memory to Memory mode may not be used at the same time as Circular mode”
well, it seemed to work at least ^^’
but it’s pretty pointless i think, and as observed completely starves the dma for other peripherals.




