SPI DMA

victor_pv
Sat Apr 29, 2017 7:52 pm
I am adding SPI.dmaSend and SPI.dmaTransfer as discussed in another thread.
Now, for doing so I need to add a number of defines, and those defines are going to be different depending on the series (things like the SPI1 DMA controller and channel to use, IRQ line, same for SPI2, SPI3…SPI6).

Should I put all those defines in the SPI.h file, or rather add another .h file and include it from SPI.h?

EDIT:
I got dmaTransfer running fine in polling (blocking) mode, similar to the libmaple F1 core, at least for SPI1.
Tested with SPI1 and SPI2.
Tested by Pito with sdFat for F1 (F1 file needs modification of an #ifdef to compile for this core.)
It needs work to make it compatible with more MCUs and add non-blocking mode with callbacks.

EDIT2:
I have added defines for almost all the series, and tested compilation, but have not had the chance to test functionality except for the F4.
I have F1s, so I can test that. If anyone has L0, L1, L4 or F3 series and test it, please let me know if it works.

EDIT3:
New corrections for L0 and L4 series, which needed some extra settings.
Latest working version in this branch:
https://github.com/victorpv/STM32GENERIC/tree/SPI-DMA


Pito
Sun Apr 30, 2017 12:06 am
This is a F407 SDbench measurement (168MHz, BlackZE) for record – SPI1 freq vs. rdwr speed (Sammy EVO 8GB, CL10), SdFatEX, buff size 512b:

SPI no DMA rdwr speed.JPG
SPI no DMA rdwr speed.JPG (18.55 KiB) Viewed 1723 times

victor_pv
Sun Apr 30, 2017 3:21 am
First working version in this branch:
https://github.com/victorpv/STM32GENERIC/tree/SPI-DMA

Sent a PR to Daniel in case he wants to add it so more people can start testing/using it.

I tested it very basically, just sending and receiving data in the same port and comparing that I get whet I sent. So far testing only spi1. I need to find the pins for SPI2 and SPI3 and repeat the tests.
Would appreciate any feedback.

The main changes are:
Adding hdma structures for the TX and RX channels.
Enabling DMA1 and 2 peripheral clocks during .begin
A number of defines in the .h file listing what DMA Controller/Stream combination applies to each port.
If the F4 series there are multiple possible streams for each peripheral, I just picked the same the HAL MX chooses by default.

Needed:
Support for other series in the defines
In the dmaSend and Transfer functions FIFO and other settings need to be optional since they dont apply to F1 and F3 for example. Not sure of the best way to implement, I really dont like having a bunch of defines thrown in the middle of the code, so may just take the hdma settings to a separate function, and have all the conditional compiling in that single function perhaps.
Callback support. I need to add ISRs first that can manage any peripheral and figure out which one needs servicing (spi1 to 6 in some series).
Decide whether all DMA defines should be in a separate file or keep them in SPI.h


victor_pv
Sun Apr 30, 2017 3:24 am
Pito wrote:This is a F407 measurement (168MHz, BlackZE) for record – SPI1 freq vs. rdwr speed (Sammy EVO 8GB, CL10), SdFatEX, buff size 512b:
SPI no DMA rdwr speed.JPG
Double-checked the SPI clocks with LA.

PS: I think this in SPI.cpp
if (settings.clock > apb_freq / 2) {
spiHandle.Init.BaudRatePrescaler = SPI_BAUDRATEPRESCALER_2;
} else if (settings.clock > apb_freq / 4) {
spiHandle.Init.BaudRatePrescaler = SPI_BAUDRATEPRESCALER_4;


danieleff
Sun Apr 30, 2017 4:26 am
Thanks, I will fix the settings.clocks.

HAL_DMA_IRQHandler() is meant to be called from IRQ handler, check cube generated stm32f4xx_it.c: `extern “C” void DMA2_Stream3_IRQHandler(void)() {HAL_DMA_IRQHandler(&hdma_spi1_tx);}` (This will also call the hdma_spi_rx->XferCpltCallback (which is set to the weak SPI_DMATransmitReceiveCplt() by HAL_SPI_TransmitReceive_DMA (which is not ideal BTW, I want custom callbacks)))

I was wondering how you tackle the requests – streams – channels for every chip problem :)


victor_pv
Sun Apr 30, 2017 5:56 am
danieleff wrote:Thanks, I will fix the settings.clocks.

HAL_DMA_IRQHandler() is meant to be called from IRQ handler, check cube generated stm32f4xx_it.c: `extern “C” void DMA2_Stream3_IRQHandler(void)() {HAL_DMA_IRQHandler(&hdma_spi1_tx);}` (This will also call the hdma_spi_rx->XferCpltCallback (which is set to the weak SPI_DMATransmitReceiveCplt() by HAL_SPI_TransmitReceive_DMA (which is not ideal BTW, I want custom callbacks)))

I was wondering how you tackle the requests – streams – channels for every chip problem :)


victor_pv
Sun Apr 30, 2017 6:43 am
I run some speed tests with SPI2 to compare DMA vs non DMA speeds.
I only run the tests on both with a big buffer of 8KB.

8KB buffer
Speed / No DMA / DMA
21Mb / 8977.136uS / 3127.777uS
5.25Mb / 17560.336uS / 12490.117uS

EDIT: The table feature in phpBB is awful…
Used the code Pito posted in another thread for the uS measure:

elapsed = CpuGetTicks(); // Measure the 1ms delay
spi2.dmaTransfer(array_out, array_in, 8192);
//spi2.transfer(array_out, 8192);
elapsed = CpuGetTicks()- elapsed; // How many CPUTicks?
nanos = 5.9524 * elapsed; // Convert to nanoseconds


danieleff
Sun Apr 30, 2017 6:56 am
OK on interrupts.

FYI your code works on my F7.

I think that F2/F4/F7 is the same (stream+channel), F0/F1/F3/L1 is the same (channel), L0/L4 is the same (channel+request).
Also if the STM32 engineers were good we might get lucky, and they are always on the same (SPI1_TX is always on DMA2_Stream3 or DMA3_Stream5 on F2/F4/F7…)


victor_pv
Sun Apr 30, 2017 7:04 am
I get a filling they pretty much copy/paste hardware blocks between series, so hopefully they are the same, but if not, I dont think is too much trouble adding them, since the reference manuals have a nice table for the dma channels.
Glad to know if works on the F7. I have one I got last year for free getting dust in a shelf, now can put it to some use :)

Pito
Sun Apr 30, 2017 11:05 am
21Mb / 8977.136uS / 3127.777uS
The theoretical DMA transfer time for an 8192 bytes large buffer at 21MHz SPI speed is 3120.762uS :)

Hmm, I want as a test, to replace the receive in SdFat’s SdSPiDriver.h with dmaTransfer
uint8_t receive(uint8_t* buf, size_t n) {

SPI.dmaTransfer( buf, buf, n);
// SPI.dmaTransfer( 0, buf, n);

// for (size_t i = 0; i < n; i++) {
// buf[i] = SPI.transfer(0XFF);
// }

return 0;
}


michael_l
Sun Apr 30, 2017 12:45 pm
Victor_pv: nice work, I can test when I got some more free time. I’ll try to get spiffs working first with the on-board flash.

Is there any sense in creating a small test set for SPI and SPI DMA between SPI1 and SPI2 to verify how it works ?


victor_pv
Sun Apr 30, 2017 1:29 pm
Pito wrote:
SPI.dmaTransfer( buf, buf, n);
// SPI.dmaTransfer( 0, buf, n);


Pito
Sun Apr 30, 2017 1:31 pm
I updated the results.. About 2x faster wr/rd.
Yes, we need the NULL buffer :)

This is a F407 SDbench measurement (168MHz, BlackZE) for record – SPI1 freq vs. rdwr speed (Sammy EVO 8GB, CL10), SdFatEX, buffer size 512bytes:

SPI vs rdwr speed.JPG
SPI vs rdwr speed.JPG (26.6 KiB) Viewed 469 times

victor_pv
Sun Apr 30, 2017 1:32 pm
michael_l wrote:Victor_pv: nice work, I can test when I got some more free time. I’ll try to get spiffs working first with the on-board flash.

Is there any sense in creating a small test set for SPI and SPI DMA between SPI1 and SPI2 to verify how it works ?


Pito
Sun Apr 30, 2017 2:13 pm
@victor, OK, changed here, and it works. Updated above chart as the read speed is much higher now :) +20%
BTW what is the MINC=1 in
SPI.dmaSend(const_cast<uint8_t*>(buf), n, 1);
Memory Increment??

Pito
Sun Apr 30, 2017 2:40 pm
If the sdfat library was modified to switch thread for a ms when waiting due to latency I would guess that could give another good chunk of cpu time back to sketches that write. My wav player only reads, I don’t think there is much latency in reads, right?
Do you need to test SDBench on F1? I can do..

victor_pv
Sun Apr 30, 2017 2:54 pm
Pito wrote:@victor, OK, changed here, and it works. Updated above chart as the read speed is much higher now :) +20%
BTW what is the MINC=1 in
SPI.dmaSend(const_cast<uint8_t*>(buf), n, 1);
Memory Increment??

Pito
Sun Apr 30, 2017 3:54 pm
My wav player only reads, I don’t think there is much latency in reads, right?
Write latencies with SDcards are huge, 5-250ms, typically 3-50ms all the time, random occurrence..

With my above experiments I did not touch the SdFat’s SdSpiSTM32F1 driver file (even I’ve spent few hours with it..).
I simply replaced the receive and send functions (see above) in the SdSpiDriver.
That is the big Q whether we actually need the F1 driver file or not. When you set EX mode and DMA mode hardwired, then you do not need that SdSpiSTM32F1 stuff, I think.
So simply in the SpiSdDriver make from line 231+ 73+ ifdefs around the receive() and send()..
For compatibility reasons I tried to rename the SdSpiSTM32F1 to SdSpiSTM32F4, set __STM2F4__ inside (we need that flag in flags), and messed with internals, but it did not work and I elaborated that (..because the NULL pointer issue above), so I stopped and better went directly into the SdSpiDriver driver.
When there will be the SdSpiSTM32F4 with __STM32F4__ inside, it may work now. You have also to play with __STM32F4__ in the SdFatConfig.h too (maybe 2-3x). There is a lot of mess inside because of the AVR and various modi.
But there is only 1 mode which works with F1/4 – it is the EXTENDED mode with DMA.


racemaniac
Sun Apr 30, 2017 4:26 pm
Pito wrote:My wav player only reads, I don’t think there is much latency in reads, right?
Write latencies with SDcards are huge, 5-250ms, typically 3-50ms all the time, random occurrence..

victor_pv
Sun Apr 30, 2017 4:53 pm
racemaniac wrote:Pito wrote:My wav player only reads, I don’t think there is much latency in reads, right?
Write latencies with SDcards are huge, 5-250ms, typically 3-50ms all the time, random occurrence..

racemaniac
Sun Apr 30, 2017 5:01 pm
victor_pv wrote:If your output driver is not using DMA too, you should look into it, specially is you use HAL drivers for it, they have a fair ammount of overhead.
What fat library are you using with the f411?

ChrisMicro
Sun Apr 30, 2017 5:34 pm
Could someone of you add a simple working example sketch?
I did some for I2S without DMA.

Pito
Sun Apr 30, 2017 5:40 pm
SdCard latencies – you can check the latencies with SDbench. As the micros() work well today you will see the numbers – maxLatency, min, average, while you write/read say a 5MB large file. The bigger file you write the longest latencies you may encounter.
The Write latencies are huge, as the Sdcard does its housekeeping (ie wearleveling, etc) almost randomly. Read latencies are minimal, about 1.5ms usually. So critical is writing – ie for 1MB/s sustained write you woud need typically ~100kB large fifo buffer (100ms outage covered). There is the “SDCard Logger with FiFO’ topic which elaborates that in detail.
Here for example:
write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
4381.84,16924,108,115
4508.28,8079,108,112

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
3816.55,2083,133,133
3819.47,1322,133,133


danieleff
Sun Apr 30, 2017 6:20 pm
victor_pv wrote:@Daniel, in your defines for MOSI, MISO, etc, you are using the PB4 and PB5 ports, one of them is the same used for SWO. I found out because I have to use SWO due to lack of USB in this board.
Is there a reason to default to that rather than the other set of pins? If there is no reason or benefit, I would suggest changing it so SWO can be used at the same time as SPI1. Not a big deal for most people, but a big annoyance for my 407 board.

victor_pv
Sun Apr 30, 2017 6:28 pm
Yeah it’s in a F407VET board, not the black but blue, but I dont think its worth making a variant for it alone, and in mine perhaps the SPI flash is connected to those pins too.
I’ll just work around it then.

victor_pv
Sun Apr 30, 2017 6:29 pm
racemaniac wrote:victor_pv wrote:If your output driver is not using DMA too, you should look into it, specially is you use HAL drivers for it, they have a fair ammount of overhead.
What fat library are you using with the f411?

racemaniac
Sun Apr 30, 2017 6:31 pm
victor_pv wrote:racemaniac wrote:victor_pv wrote:If your output driver is not using DMA too, you should look into it, specially is you use HAL drivers for it, they have a fair ammount of overhead.
What fat library are you using with the f411?

danieleff
Sun Apr 30, 2017 6:43 pm
re: NULL send buffer in dmaTransfer(): wouldn’t it basically be just receiving? As in (a currently nonexistent) `uint8_t SPIClass::dmaReceive(uint8_t *receiveBuf, uint16_t length)`?, and use HAL_SPI_Receive_DMA

Pito
Sun Apr 30, 2017 10:37 pm
SPI always sends and receives in parallel.

victor_pv
Sun Apr 30, 2017 10:50 pm
danieleff wrote:re: NULL send buffer in dmaTransfer(): wouldn’t it basically be just receiving? As in (a currently nonexistent) `uint8_t SPIClass::dmaReceive(uint8_t *receiveBuf, uint16_t length)`?, and use HAL_SPI_Receive_DMA

Pito
Mon May 01, 2017 11:31 am
While trying to build SdBench for MapleMini, I get
SPI\src\SPI.cpp:15:2: error: 'DMA_Stream_TypeDef' was not declared in this scope
DMA_Stream_TypeDef *_StreamTX;

victor_pv
Mon May 01, 2017 2:02 pm
Pito wrote:While trying to build SdBench for MapleMini, I get
SPI\src\SPI.cpp:15:2: error: 'DMA_Stream_TypeDef' was not declared in this scope
DMA_Stream_TypeDef *_StreamTX;

victor_pv
Mon May 01, 2017 4:36 pm
danieleff wrote:OK on interrupts.

FYI your code works on my F7.

I think that F2/F4/F7 is the same (stream+channel), F0/F1/F3/L1 is the same (channel), L0/L4 is the same (channel+request).
Also if the STM32 engineers were good we might get lucky, and they are always on the same (SPI1_TX is always on DMA2_Stream3 or DMA3_Stream5 on F2/F4/F7…)


danieleff
Tue May 02, 2017 3:36 pm
I think the #define SPI1_StreamTX 1_Channel3… stuff should be in a central stm32_dma.h.
That way it will be easier to see what SPI/SDIO/I2S use, and check for conflicts.

victor_pv
Tue May 02, 2017 4:44 pm
danieleff wrote:I think the #define SPI1_StreamTX 1_Channel3… stuff should be in a central stm32_dma.h.
That way it will be easier to see what SPI/SDIO/I2S use, and check for conflicts.

Pito
Tue May 02, 2017 5:45 pm
Testing the F1 SPI with SdFat on Maple Mini.
This is with STM32F1 (SdFatEX, DMA, 36MHz SPI) libmaple core
File size 6 MB
Buffer size 512 bytes
Starting write test, please wait.

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
3533.34,25083,136,143
3594.74,15522,136,141

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
3257.12,1548,155,156
3258.89,1359,155,156

Done
Type any character to start


Pito
Tue May 02, 2017 6:20 pm
With slower clock
write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1946.03,26753,252,261
1976.16,12442,252,257

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
774.14,2547,658,659
774.04,2546,658,659

Done


victor_pv
Tue May 02, 2017 7:06 pm
Pito wrote:With slower clock
write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1946.03,26753,252,261
1976.16,12442,252,257

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
774.14,2547,658,659
774.04,2546,658,659

Done


Pito
Tue May 02, 2017 7:20 pm
The min latency is basically the time to transfer a block (512bytes). The ratio is 2.6, that multiplied by 774 is 2000.

victor_pv
Tue May 02, 2017 8:03 pm
Pito wrote:The min latency is basically the time to transfer a block (512bytes). The ratio is 2.6, that multiplied by 774 is 2000.

Pito
Tue May 02, 2017 11:06 pm
:evil:
The read values above were with buf[i] = SPI.transfer(0XFF);

victor_pv
Wed May 03, 2017 12:05 am
Pito wrote::evil:
The read values above were with buf[i] = SPI.transfer(0XFF);

Pito
Wed May 03, 2017 12:18 am
For Maple Mini. Eclipse’s editor shows it.. So maybe a false positive :)

victor_pv
Wed May 03, 2017 12:25 am
Pito wrote:For Maple Mini. Eclipse’s editor shows it.. So maybe a false positive :)

Pito
Wed May 03, 2017 8:06 am
No change with newest SPI and F1 on MM..
Read does not work here with F1..
With F4 it works.

ag123
Wed May 03, 2017 8:28 am
OT: it seemed i’ve had some ‘wire issues’ playing with SDFatEX on the original F1 libmaple core, i’m not sure if sometimes the wires itself could after all be creating all that havoc :lol:
http://www.stm32duino.com/viewtopic.php … 420#p27420

victor_pv
Wed May 03, 2017 1:49 pm
Pito wrote:No change with newest SPI and F1 on MM..
Read does not work here with F1..
With F4 it works.

Pito
Wed May 03, 2017 4:59 pm
Correct. I cannot get Reading the file working in the SdBench with F1 with DMA (MapleMini).

victor_pv
Wed May 03, 2017 6:24 pm
Pito wrote:Correct. I cannot get Reading the file working in the SdBench with F1 with DMA (MapleMini).

victor_pv
Thu May 04, 2017 2:29 am
I can’t see anything wrong with the F1 code.
I’ll have to wire a board to an sdcard and the jlink and see what I can find.

Pito
Thu May 04, 2017 12:56 pm
There must be something wrong with F1 SPI DMA version.
This is the SdBench with F1 DMA libmaple:
write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1972.26,21385,251,258
1996.68,8447,251,255
1987.15,11002,251,256

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
2044.86,1767,248,249
2046.53,1766,248,249
2046.53,1765,248,249


victor_pv
Thu May 04, 2017 1:48 pm
Pito wrote:There must be something wrong with F1 SPI DMA version.
This is the SdBench with F1 DMA libmaple:
write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1972.26,21385,251,258
1996.68,8447,251,255
1987.15,11002,251,256

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
2044.86,1767,248,249
2046.53,1766,248,249
2046.53,1765,248,249


Pito
Thu May 04, 2017 4:18 pm
This is F1 Daniel’s core and no DMA:
write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
368.38,90734,1302,1365
375.24,90747,1302,1352
377.19,366597,1302,1354

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
error: data check


danieleff
Thu May 04, 2017 5:39 pm
In here: http://www.stm32duino.com/viewtopic.php … 038#p27339

This does not have a return: `uint8_t SdSpiAltDriver::receive() {`

And change this:
SPI.transfer(const_cast<uint8_t*>(buf), n);


Pito
Thu May 04, 2017 5:57 pm
With changes (all below with Sandisk) – Daniel’s F1 no DMA, Sandisk fresh formatted
write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
781.69,22685,640,651
790.46,12851,640,645
792.59,11424,640,643

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
774.18,1916,658,659
774.18,1917,658,659
774.18,1917,658,659


danieleff
Thu May 04, 2017 7:26 pm
with DMA it seems to work if at the end of `uint8_t SPIClass::dmaSend(uint8_t *transmitBuf, uint16_t length, bool minc) {` I put `HAL_DMA_DeInit(&hdma_spi_tx);`.
Strange.

Pito
Thu May 04, 2017 7:51 pm
I’ve tried but negative..

victor_pv
Fri May 05, 2017 1:28 pm
Pito wrote:I’ve tried but negative..

Pito
Fri May 05, 2017 2:25 pm
Be careful with Daniel’s F1 DMA off – it writes but the filesystem went corrupt – the Sdbench output file contains “ABCD..” strings so it is easy to check. Always try to open the file on the sdcard after the Sdbench under win/lin.

victor_pv
Fri May 05, 2017 6:43 pm
Pito wrote:Be careful with Daniel’s F1 DMA off – it writes but the filesystem went corrupt – the Sdbench output file contains “ABCD..” strings so it is easy to check. Always try to open the file on the sdcard after the Sdbench under win/lin.

victor_pv
Sat May 06, 2017 6:06 pm
Still have not figured out why the F1 corrupts the sdcard, but so far I have found the F1 HAL MX has what looks to me like a bug on this function:
HAL_SPI_TransmitReceive_DMA

In all the other series, there is this comment followed by two lines which do what the comment says, but in the F1 there is the comment, but not the code to actually do it:
/* Set the SPI Tx DMA transfer complete callback as NULL because the communication closing
is performed in DMA reception complete callback */
hspi->hdmatx->XferHalfCpltCallback = NULL;
hspi->hdmatx->XferCpltCallback = NULL;


Pito
Sat May 06, 2017 9:27 pm
I would start with DMA off as it corrupts as well. Maybe SPI parameters are set wrong?

victor_pv
Sat May 06, 2017 9:50 pm
Pito wrote:I would start with DMA off as it corrupts as well. Maybe SPI parameters are set wrong?

danieleff
Sun May 07, 2017 1:58 pm
Added a tests code to examples/board testing/SPISelfTest

Tested on maple mini. Connect SPI1 pins to SPI2 (They are set as master and slave). It sends using SPI.transfer, SPI.dmaTransfer (with without NULL sendbuffer), SPI.dmaSend. Then it writes the sent and received bytes on master/slave SPI to Serial.

Observations: communication until dmaSend works fine. than it is corrupted.
If dmaSend is commented out, but dmaTransfer is used, everying works fine (almost…).
Actually implementing dma irq (with low priority!!! else spi salve gets borked up) seems to fix the problem. But then HAL_DMA_PollForTransfer does not work and you need to set custom flags.


victor_pv
Sun May 07, 2017 2:43 pm
danieleff wrote:Added a tests code to examples/board testing/SPISelfTest

Tested on maple mini. Connect SPI1 pins to SPI2 (They are set as master and slave). It sends using SPI.transfer, SPI.dmaTransfer (with without NULL sendbuffer), SPI.dmaSend. Then it writes the sent and received bytes on master/slave SPI to Serial.

Observations: communication until dmaSend works fine. than it is corrupted.
If dmaSend is commented out, but dmaTransfer is used, everying works fine (almost…).
Actually implementing dma irq (with low priority!!! else spi salve gets borked up) seems to fix the problem. But then HAL_DMA_PollForTransfer does not work and you need to set custom flags.


victor_pv
Sun May 07, 2017 4:38 pm
victor_pv wrote:danieleff wrote:Added a tests code to examples/board testing/SPISelfTest

Tested on maple mini. Connect SPI1 pins to SPI2 (They are set as master and slave). It sends using SPI.transfer, SPI.dmaTransfer (with without NULL sendbuffer), SPI.dmaSend. Then it writes the sent and received bytes on master/slave SPI to Serial.

Observations: communication until dmaSend works fine. than it is corrupted.
If dmaSend is commented out, but dmaTransfer is used, everying works fine (almost…).
Actually implementing dma irq (with low priority!!! else spi salve gets borked up) seems to fix the problem. But then HAL_DMA_PollForTransfer does not work and you need to set custom flags.


Pito
Sun May 07, 2017 5:42 pm
Those numbers are quite low.. Also max latency is high..
At 36MHz SPI and SdFatEX and DMA you should get ~3.5MB/s rd/wr..

Pito
Sun May 07, 2017 9:14 pm
This is with your newest SPI.h and SPI.cpp, DMA on, BLUE F103ZE @72MHz, 36MHz SPI, Sandisk 16GB CL10:
File size 5 MB
Buffer size 512 bytes
Starting write test, please wait.

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
3576.31,7794,139,141
3566.11,8057,139,141
3563.56,8054,139,141

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
3162.35,930,159,160
3164.35,903,159,160
3162.35,903,159,160

Done


victor_pv
Sun May 07, 2017 9:17 pm
Pito wrote:This is with your newest SPI.h and SPI.cpp (BLUE F103ZE @72MHz, 36MHz SPI, Sandisk CL10):
File size 5 MB
Buffer size 512 bytes
Starting write test, please wait.

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
3576.31,7794,139,141
3566.11,8057,139,141
3563.56,8054,139,141

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
3162.35,930,159,160
3164.35,903,159,160
3162.35,903,159,160

Done


Pito
Sun May 07, 2017 9:21 pm
So the current status is:
1. F1 DMA on – the bug in F1 DMA found and fixed,
2. F1 none DMA – needs debug/fix by Daniel?
3. F4 DMA is ok (?),
4. F4 none DMA – needs debug/fix by Daniel?
Is that correct?

victor_pv
Sun May 07, 2017 10:02 pm
Pito wrote:So the current status is:
1. F1 DMA on – the bug in F1 DMA found and fixed,
2. F1 none DMA – needs debug/fix by Daniel?
3. F4 DMA is ok (?),
4. F4 none DMA – needs debug/fix by Daniel?
Is that correct?

danieleff
Mon May 08, 2017 5:05 am
victor_pv wrote:victor_pv wrote:
UPDATE:
Seems like I finally fond the problem. The DMA ch .CCR register was not taking the value because the DMA channel was still enabled.
I thought the channel was going to disabled state by itself when the transfer is completed, but it doesn’t, and as far as it’s enabled the CCR won’t take a change to the MINC bit. If I first clear the CCREN bit in the CCR, then it will take changes.Not sure if it works in the F4 because the HAL takes care of disabling the channel when DMA_Init is called, or because the peripheral takes changes even if enabled, will have to check more on that.
Every other series I checked has these 2 lines in HAL_DMA_Init to disable the channel before making changes to CCR, except the F1 HAL:
/* Disable the peripheral */
__HAL_DMA_DISABLE(hdma);

Pito
Mon May 08, 2017 9:49 am
While compiling for F407 I getC:\Users\pito\MyCode\Arduino\hardware\Arduino_STM32SerBuff\STM32DE\libraries\SPI\src\SPI.cpp:234:25: error: 'struct DMA_Stream_TypeDef' has no member named 'CCR'
hdma_spi_tx.Instance->CCR &= ~DMA_CCR_EN;

victor_pv
Mon May 08, 2017 1:29 pm
Pito wrote:While compiling for F407 I getC:\Users\pito\MyCode\Arduino\hardware\Arduino_STM32SerBuff\STM32DE\libraries\SPI\src\SPI.cpp:234:25: error: 'struct DMA_Stream_TypeDef' has no member named 'CCR'
hdma_spi_tx.Instance->CCR &= ~DMA_CCR_EN;

Pito
Mon May 08, 2017 3:08 pm
I commented out the stuff and compiled for F4, tried at 42MHz, but my Samsung refused to cooperate. Maybe the card is dead after all the experiments.. Have you done any changes with the F4 part which may affect the max speed?
I have not tried with F1 none DMA yet, I have to buy new cards tomorrow :)

victor_pv
Mon May 08, 2017 5:26 pm
Pito wrote:I commented out the stuff and compiled for F4, tried at 42MHz, but my Samsung refused to cooperate. Maybe the card is dead after all the experiments.. Have you done any changes with the F4 part which may affect the max speed?
I have not tried with F1 none DMA yet, I have to buy new cards tomorrow :)

victor_pv
Tue May 09, 2017 4:13 am
I have been running some tests comparing with libmaple, and 512bytes buffer and 16KB buffer. The speeds are very similar.
At 16KB buffer libmaple seems to have a slight edge on reads, and this core seems to have it in writes, but nothing really significant.
Also flash and ram utilization is very similar, with smaller flash used in libmaple and smaller RAM used in this core.

This is all with the same class 6 sd card.
Memory usage:
libmaple:
section size addr
.text 28048 134217728
.ARM.exidx 8 134245776
.data 2632 536870912
.rodata 3056 134248416
.bss 18952 536873544


stevestrong
Tue May 09, 2017 5:12 am
Victor, I assume this test was done for F4, right? Otherwise the RAM (.data and .bss) would go over the 20k limit of F1 chip.
And you used the “standard” SdFat lib, not the SdFatEX, right? Can you also test with the EX version?

Btw, the FLASH/RAM ratio for libmaple F1/F4 could be improved if you would place the PIN_MAP into FLASH, as I did it in my black F4 branch.


ag123
Tue May 09, 2017 7:47 am
Pito wrote:I commented out the stuff and compiled for F4, tried at 42MHz, but my Samsung refused to cooperate. Maybe the card is dead after all the experiments.. Have you done any changes with the F4 part which may affect the max speed?
I have not tried with F1 none DMA yet, I have to buy new cards tomorrow :)

Pito
Tue May 09, 2017 8:23 am
It is recommended to format the Sdcards with “SD Card Formatter” –
https://www.sdcard.org/downloads/formatter_4/
instead to format them with “standard OS” formatters..

ag123
Tue May 09, 2017 9:05 am
OT:strictly speaking if we present the SD card as USB mass storage to the host, that propriety sd card formatter s/w may be able to deal with it, unfortunately that’d be another ‘project’ hence, the fastest way for now is just get 1 of those dongles & run the s/w

& for the adventurous it may be possible to brute force it in linux
https://forum.xda-developers.com/showth … p?t=502095

some other fun stuff about sd card, sd cards are mcus
https://www.bunniestudios.com/blog/?p=3554
The crux is that a firmware loading and update mechanism is virtually mandatory, especially for third-party controllers. End users are rarely exposed to this process, since it all happens in the factory, but this doesn’t make the mechanism any less real. In my explorations of the electronics markets in China, I’ve seen shop keepers burning firmware on cards that “expand” the capacity of the card — in other words, they load a firmware that reports the capacity of a card is much larger than the actual available storage. The fact that this is possible at the point of sale means that most likely, the update mechanism is not secured.

In our talk at 30C3, we report our findings exploring a particular microcontroller brand, namely, Appotech and its AX211 and AX215 offerings. We discover a simple “knock” sequence transmitted over manufacturer-reserved commands (namely, CMD63 followed by ‘A’,’P’,’P’,’O’) that drop the controller into a firmware loading mode. At this point, the card will accept the next 512 bytes and run it as code.

http://s2.q4cdn.com/000096926/files/doc … ENERAL.pdf
11.3. CSD REGISTER
The Card-Specific Data register provides information regarding access to the card contents.data format, error correction type, maximum data access time, whether the DSR register can be used, etc.

C_SIZE
This field is expanded to 22 bits and can indicate up to 2 T Bytes (It is the same as the maximum memory space specified by a 32-bit block address.)
This parameter is used to calculate the user data area capacity in the SD memory card (not included the protected area). The user data area capacity is calculated from C_SIZE as follows Memory capacity = (C_SIZE+1) * 512K byte. As the maximum capacity of the Physical Layer Specification Version 2.00 is 32 GB, the upper 6 bits of this field shall be set to 0.

C_SIZE seem to imply that the super secret “protected area” is a *hidden partition* i.e. not visible even through normal register hardware access.
if true we could possibly just *format* that card (with any tools) and try again, since it won’t touch that “protected area” anyway
:lol:

Don’t Format SD cards with OS utilities – on arduino forum:
https://forum.arduino.cc/index.php?topic=228201.0


michael_l
Tue May 09, 2017 10:32 am
Would it be possible to create a test program for one board using SPI1 (master) and SPI2 (slave) to verify SPI and DMA functionality. Test cases could be simply sending bytes and reading them from SPI2.

danieleff
Tue May 09, 2017 10:45 am
michael_l wrote:Would it be possible to create a test program for one board using SPI1 (master) and SPI2 (slave) to verify SPI and DMA functionality. Test cases could be simply sending bytes and reading them from SPI2.

ag123
Tue May 09, 2017 10:49 am
i’m not too sure if a rather ‘fun’ way to test it is to connect MISO to MOSI, if i understand SPI correctly, SPI.transfer(0xaa) would be echoed as the return value, this probably isn’t a very good test :lol:

victor_pv
Tue May 09, 2017 12:38 pm
stevestrong wrote:Victor, I assume this test was done for F4, right? Otherwise the RAM (.data and .bss) would go over the 20k limit of F1 chip.
And you used the “standard” SdFat lib, not the SdFatEX, right? Can you also test with the EX version?

Btw, the FLASH/RAM ratio for libmaple F1/F4 could be improved if you would place the PIN_MAP into FLASH, as I did it in my black F4 branch.


stevestrong
Tue May 09, 2017 12:43 pm
victor_pv wrote:On libmaple core for F1 we place the PIN_MAP in flash already.

michael_l
Tue May 09, 2017 12:45 pm
danieleff wrote:michael_l wrote:Would it be possible to create a test program for one board using SPI1 (master) and SPI2 (slave) to verify SPI and DMA functionality. Test cases could be simply sending bytes and reading them from SPI2.

victor_pv
Tue May 09, 2017 1:09 pm
stevestrong wrote:victor_pv wrote:On libmaple core for F1 we place the PIN_MAP in flash already.

stevestrong
Tue May 09, 2017 2:36 pm
Victor, I cannot identify the corresponding commit to put PIN_MAP into flash from the list for master branch: https://github.com/rogerclarkmelbourne/ … its/master
You are most probably using either the “PIN_MAP_in_flash” or the “PIN_MAP_in_flash_take_2” branch.

Yeah, as you probably noticed, I am not giving up the libmaple core, my “first love” :P
I have already adapted the DMA channels from F1 to streams for F4, it was not very difficult, 2 hours of work.
The DMA is not yet properly working with SPI, I have to debug it. But I am confident that it will work.
Meanwhile I am optimizing the multiple (non-DMA) read routine for SPI to get a (mostly) continuous clock while reading the bytes, similar to the write routine. I had a similar try couple of month before, it was not working, but now I found a way to make it work correctly.
The WR performance of the non-DMA version, what I checked, seems to be very-very close to the DMA version of generic core @21 MHz, according to Pito’s measurements.


victor_pv
Tue May 09, 2017 4:58 pm
stevestrong wrote:Victor, I cannot identify the corresponding commit to put PIN_MAP into flash from the list for master branch: https://github.com/rogerclarkmelbourne/ … its/master
You are most probably using either the “PIN_MAP_in_flash” or the “PIN_MAP_in_flash_take_2” branch.

Yeah, as you probably noticed, I am not giving up the libmaple core, my “first love” :P
I have already adapted the DMA channels from F1 to streams for F4, it was not very difficult, 2 hours of work.
The DMA is not yet properly working with SPI, I have to debug it. But I am confident that it will work.
Meanwhile I am optimizing the multiple (non-DMA) read routine for SPI to get a (mostly) continuous clock while reading the bytes, similar to the write routine. I had a similar try couple of month before, it was not working, but now I found a way to make it work correctly.
The WR performance of the non-DMA version, what I checked, seems to be very-very close to the DMA version of generic core @21 MHz, according to Pito’s measurements.


stevestrong
Tue May 09, 2017 5:24 pm
Victor, your analysis was correct, at least identical with mine.
The most critical part is to read the Rx(n) data before Tx(n+1) is written.
And the solution is indeed to disable the interrupts between writing byte(n) and reading byte(n-1). Then interrupts can again be enabled.
The maximum reserved time is up to 8 SPI clock periods.
Here is how it looks in C code:
void SPIClass::read(uint8 *buf, uint32 len)
{
if ( len == 0 ) return;
spi_rx_reg(_currentSetting->spi_d); // clear the RX buffer in case a byte is waiting on it.
spi_reg_map * regs = _currentSetting->spi_d->regs;
// start sequence: write byte 0
regs->DR = 0x00FF; // write the first byte
// main loop
while ( (--len) ) {
while( !(regs->SR & SPI_SR_TXE) ); // wait for TXE flag
noInterrupts(); // go atomic level - avoid interrupts to surely get the previously received data
regs->DR = 0x00FF; // write the next data item to be transmitted into the SPI_DR register. This clears the TXE flag.
while ( !(regs->SR & SPI_SR_RXNE) ); // wait till data is available in the DR register
*buf++ = (uint8)(regs->DR); // read and store the received byte. This clears the RXNE flag.
interrupts(); // let systick do its job
}
// read remaining last byte
while ( !(regs->SR & SPI_SR_RXNE) ); // wait till data is available in the Rx register
*buf = (uint8)(regs->DR); // read and store the last received byte
}

victor_pv
Tue May 09, 2017 5:33 pm
stevestrong wrote:Victor, I cannot identify the corresponding commit to put PIN_MAP into flash from the list for master branch: https://github.com/rogerclarkmelbourne/ … its/master
You are most probably using either the “PIN_MAP_in_flash” or the “PIN_MAP_in_flash_take_2” branch.


stevestrong
Tue May 09, 2017 5:38 pm
Yes, indeed, you’re right, now I realize that I was checking the old F4 MAP file instead F1.
Anyway, I added the same feature to my black F4 branch, too.

victor_pv
Tue May 09, 2017 5:50 pm
Steve, great minds think alike :D
But I suspect it may not be enough like that.
What happens if an interrupt is triggered between this line
regs->DR = 0x00FF; // write the first byt
and
noInterrupts(); // go atomic level – avoid interrupts to surely get the previously received data

the ISR will get called, count 13 cycles min, first byte is completely shifted out, byte incoming goes to DR. Then ISR returns, interrupts disable, you write a new byte to DR that start shifting out quick since the previous one is completed (I think it takes 1 cycle to move from the DR fifo to the spi shift register), then you have to execute all this before it finishes shifting out that second byte:
while ( !(regs->SR & SPI_SR_RXNE) ); // wait till data is available in the DR register
*buf++ = (uint8)(regs->DR);

If those 2 lines can’t complete in under 8 cycles, we lost a byte. (but wouldn’t rather be closer to 16 cycles max since the SPI runs at 36Mhz max?).
If it’s 16 cycles, then may be enough, if I’m wrong and it’s 8, then you may need to check the generated code to make sure that doesn’t turn in too many instructions, and that may even be out of our control if people changes the optimization flags…

BTW I am talking cpu cycles, not SPI cycles.

EDIT: I see you were talking 8 spi cycles, so we are talking the same time at max speed. 8 spi or 16 cpu cycles in an F1. With that I can’t see any problem unless the compiler does something crazy with those 2 lines.
The F1 should benefit from the same changes, was your test above in F1 or F4?
In the transfer function you should be able to do the same to optimize speed, as long as we are sending more than 1 byte. did you test optimizing that one?


stevestrong
Tue May 09, 2017 6:02 pm
Victor, I had the same concern.
But here I assumed that the CPU is able to execute those couple of instructions, being that CPU clock is always a multiple of SPI clock.
I am investigating the number of needed instructions with the debugger, will come back soon to this.

victor_pv
Tue May 09, 2017 6:13 pm
stevestrong wrote:Victor, I had the same concern.
But here I assumed that the CPU is able to execute those couple of instructions, being that CPU clock is always a multiple of SPI clock.
I am investigating the number of needed instructions with the debugger, will come back soon to this.

Pito
Tue May 09, 2017 6:16 pm
You can do 16bit transfers with SPI as well (not sure with 32bit ones, pic32mx can do).
BTW the best check on SPI with Sdfat will be under FreeRtos :)

stevestrong
Tue May 09, 2017 6:52 pm
This is the assembly code of the new block read function.
SPIClass::read(unsigned char*, unsigned int):
080068be: 0x000030b5 push {r4, r5, lr}
336 if ( len == 0 ) return;
080068c0: 0x0000f2b1 cbz r2, 0x8006900 <SPIClass::read(unsigned char*, unsigned int)+66>
337 spi_rx_reg(_currentSetting->spi_d); // clear the RX buffer in case a byte is waiting on it.
080068c2: 0x0000036e ldr r3, [r0, #96] ; 0x60
080068c4: 0x0000db68 ldr r3, [r3, #12] 080068c6: 0x00001b68 ldr r3, [r3, #0] 080068c8: 0x0000d868 ldr r0, [r3, #12] 341 regs->DR = 0x00FF; // write the first byte
080068ca: 0x0000ff20 movs r0, #255 ; 0xff
080068cc: 0x0000d860 str r0, [r3, #12] 080068ce: 0x00008c18 adds r4, r1, r2
080068d0: 0x0000481c adds r0, r1, #1
343 while ( (--len) ) {
080068d2: 0x0000a042 cmp r0, r4
080068d4: 0x00000ed0 beq.n 0x80068f4 <SPIClass::read(unsigned char*, unsigned int)+54>
344 while( !(regs->SR & SPI_SR_TXE) ); // wait for TXE flag
080068d6: 0x00009d68 ldr r5, [r3, #8] 080068d8: 0x0000ad07 lsls r5, r5, #30
080068da: 0x0000fcd5 bpl.n 0x80068d6 <SPIClass::read(unsigned char*, unsigned int)+24>
080068dc: 0x000072b6 cpsid i
346 regs->DR = 0x00FF; // write the next data item to be transmitted into the SPI_DR register. This clears the TXE flag.
080068de: 0x0000ff25 movs r5, #255 ; 0xff
080068e0: 0x0000dd60 str r5, [r3, #12] 347 while ( !(regs->SR & SPI_SR_RXNE) ); // wait till data is available in the DR register
080068e2: 0x00009d68 ldr r5, [r3, #8] 080068e4: 0x0000ed07 lsls r5, r5, #31
080068e6: 0x0000fcd5 bpl.n 0x80068e2 <SPIClass::read(unsigned char*, unsigned int)+36>
348 *buf++ = (uint8)(regs->DR); // read and store the received byte. This clears the RXNE flag.
080068e8: 0x0000dd68 ldr r5, [r3, #12] 080068ea: 0x00f8015c strb.w r5, [r0, #-1] 080068ee: 0x000062b6 cpsie i
080068f0: 0x00000130 adds r0, #1
080068f2: 0x0000eee7 b.n 0x80068d2 <SPIClass::read(unsigned char*, unsigned int)+20>
080068f4: 0x0000013a subs r2, #1
352 while ( !(regs->SR & SPI_SR_RXNE) ); // wait till data is available in the Rx register
080068f6: 0x00009868 ldr r0, [r3, #8] 080068f8: 0x0000c007 lsls r0, r0, #31
080068fa: 0x0000fcd5 bpl.n 0x80068f6 <SPIClass::read(unsigned char*, unsigned int)+56>
353 *buf++ = (uint8)(regs->DR); // read and store the received byte
080068fc: 0x0000db68 ldr r3, [r3, #12] 080068fe: 0x00008b54 strb r3, [r1, r2] 08006900: 0x000030bd pop {r4, r5, pc}

victor_pv
Tue May 09, 2017 7:48 pm
stevestrong wrote:This is the assembly code of the new block read function.
SPIClass::read(unsigned char*, unsigned int):
080068be: 0x000030b5 push {r4, r5, lr}
336 if ( len == 0 ) return;
080068c0: 0x0000f2b1 cbz r2, 0x8006900 <SPIClass::read(unsigned char*, unsigned int)+66>
337 spi_rx_reg(_currentSetting->spi_d); // clear the RX buffer in case a byte is waiting on it.
080068c2: 0x0000036e ldr r3, [r0, #96] ; 0x60
080068c4: 0x0000db68 ldr r3, [r3, #12] 080068c6: 0x00001b68 ldr r3, [r3, #0] 080068c8: 0x0000d868 ldr r0, [r3, #12] 341 regs->DR = 0x00FF; // write the first byte
080068ca: 0x0000ff20 movs r0, #255 ; 0xff
080068cc: 0x0000d860 str r0, [r3, #12] 080068ce: 0x00008c18 adds r4, r1, r2
080068d0: 0x0000481c adds r0, r1, #1
343 while ( (--len) ) {
080068d2: 0x0000a042 cmp r0, r4
080068d4: 0x00000ed0 beq.n 0x80068f4 <SPIClass::read(unsigned char*, unsigned int)+54>
344 while( !(regs->SR & SPI_SR_TXE) ); // wait for TXE flag
080068d6: 0x00009d68 ldr r5, [r3, #8] 080068d8: 0x0000ad07 lsls r5, r5, #30
080068da: 0x0000fcd5 bpl.n 0x80068d6 <SPIClass::read(unsigned char*, unsigned int)+24>
080068dc: 0x000072b6 cpsid i
346 regs->DR = 0x00FF; // write the next data item to be transmitted into the SPI_DR register. This clears the TXE flag.
080068de: 0x0000ff25 movs r5, #255 ; 0xff
080068e0: 0x0000dd60 str r5, [r3, #12] 347 while ( !(regs->SR & SPI_SR_RXNE) ); // wait till data is available in the DR register
080068e2: 0x00009d68 ldr r5, [r3, #8] 080068e4: 0x0000ed07 lsls r5, r5, #31
080068e6: 0x0000fcd5 bpl.n 0x80068e2 <SPIClass::read(unsigned char*, unsigned int)+36>
348 *buf++ = (uint8)(regs->DR); // read and store the received byte. This clears the RXNE flag.
080068e8: 0x0000dd68 ldr r5, [r3, #12] 080068ea: 0x00f8015c strb.w r5, [r0, #-1] 080068ee: 0x000062b6 cpsie i
080068f0: 0x00000130 adds r0, #1
080068f2: 0x0000eee7 b.n 0x80068d2 <SPIClass::read(unsigned char*, unsigned int)+20>
080068f4: 0x0000013a subs r2, #1
352 while ( !(regs->SR & SPI_SR_RXNE) ); // wait till data is available in the Rx register
080068f6: 0x00009868 ldr r0, [r3, #8] 080068f8: 0x0000c007 lsls r0, r0, #31
080068fa: 0x0000fcd5 bpl.n 0x80068f6 <SPIClass::read(unsigned char*, unsigned int)+56>
353 *buf++ = (uint8)(regs->DR); // read and store the received byte
080068fc: 0x0000db68 ldr r3, [r3, #12] 080068fe: 0x00008b54 strb r3, [r1, r2] 08006900: 0x000030bd pop {r4, r5, pc}

ag123
Tue May 09, 2017 8:44 pm
wow, these things are pretty complicated while i try to figure these out against the clock diagrams in the rm, but i’d figure out 1 little thing, as interrupts are disabled waiting for the next incoming byte, once that arrives read that enable interrupts, check write availability, ok to write, disable interrupts, write byte, wait for receiving bytes again. the fun part would be it seemed that there would hardly be any chance for an interrupt to occur while the write, wait, read loop is running

i’d guess it isn’t quite possible to ‘have the cake & eat it’, but it shows meddling with bare metal isn’t easy with lots of compromises, e.g. the cpu is busy, keystrokes, usb serial nusances won’t be attended to until it is done :lol:

either way 4MB (32 mbps) per secs is excellent for SPI transfers on a single pin, multi tasking can still be achieved at a higher level by yielding say every 4k bytes so that other tasks have a chance to run, at these speeds read/write 4k bytes take a mere 1ms! ether way on f1, it would need to yield at 512 bytes due to lack of ram :mrgreen:

it would seem to me that to work with no interrupts suppression, the protocol above this spi hardware layer would need to have some error correction mechanisms, e.g. if a byte is missed during reads due to an interrupt, a checksum need to catch that so that the master/host can request for the same packet to be re-transmitted, if that being the case it may be possible to do without masking interrupts, but won’t be easy to figure those complicated interactions out as well


victor_pv
Tue May 09, 2017 9:35 pm
ag123 wrote:wow, these things are pretty complicated while i try to figure these out against the clock diagrams in the rm, but i’d figure out 1 little thing, as interrupts are disabled waiting for the next incoming byte, once that arrives read that enable interrupts, check write availability, ok to write, disable interrupts, write byte, wait for receiving bytes again. the fun part would be it seemed that there would hardly be any chance for an interrupt to occur while the write, wait, read loop is running

i’d guess it isn’t quite possible to ‘have the cake & eat it’, but it shows meddling with bare metal isn’t easy with lots of compromises, e.g. the cpu is busy, keystrokes, usb serial nusances won’t be attended to until it is done :lol:

either way 4MB (32 mbps) per secs is excellent for SPI transfers on a single pin, multi tasking can still be achieved at a higher level by yielding say every 4k bytes so that other tasks have a chance to run, at these speeds read/write 4k bytes take a mere 1ms! ether way on f1, it would need to yield at 512 bytes due to lack of ram :mrgreen:


victor_pv
Wed May 10, 2017 2:40 am
Wow, can’t believe the night and day difference when using SdFatEX!!
Same class 6 card, with 512 buffer gets 3MB/s, and using a 16KB buffer barely improves it anymore, so it looks like that’s the limit for that card.
STM32Generic core, F103RFT mcu, 72Mhz, DMA, DIV/2, SdFatEX, 512 bytes buffer:
FreeStack: 92664
Type is FAT32
Card size: 4.08 GB (GB = 1E9 bytes)

Manufacturer ID: 0X3
OEM ID: SD
Product: SD04G
Version: 8.0
Serial number: 0X5700E101
Manufacturing date: 8/2008

File size 5 MB
Buffer size 512 bytes
Starting write test, please wait.

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
3071.06,154124,141,164
3391.91,24856,141,148
3056.04,153749,141,165

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
3054.17,1200,165,166
3054.17,1201,165,166
3054.17,1202,165,166

Done
Type any character to start


ag123
Wed May 10, 2017 7:22 am
this is probably *much* faster than those cheapo usb card readers & btw usb full speed is a mere 12mbps. in our case on spi itself we are doing 24mpbs (3 KB ps) , usb full speed is the bottleneck :mrgreen:

victor_pv
Wed May 10, 2017 8:56 pm
After the short discussion here with Steve on how to optimize the transfer loop to keep spi fed without losing data, I went to check how optimal the HAL is.
Besides a bunch of overhead starting the transfer, the rest of the loop is exactly the same send 2 bytes, then read and send 1 by one, and read a last one.
But with 1 major difference. It doesn’t disable interrupts at any time! At least not in the F1 and F4 HALs.

I think that may explain the corruption Pito was getting when running tests without DMA. Since interrupts are enabled and ongoing for systick, usb…


stevestrong
Wed May 10, 2017 9:47 pm
I think, too, that the interrupt disable/enable sequence is crucial, otherwise data can be corrupted.

EDIT
For comparison, the libmaple core SD bench on blue pill for my Sandisk ultra CL10 (red/grey) card, F103 @ 72MHz, SPI1 @ 36MHz, with DMA:
File size 5 MB
Buffer size 512 bytes
Starting write test, please wait.

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
3700.73,9752,131,136
3703.47,16335,131,136

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
3382.73,989,149,150
3385.02,847,149,150


danieleff
Thu May 11, 2017 6:35 am
I was looking at the SPI APIs.
Here is a compilation of implementations in various cores: https://gist.github.com/danieleff/64e51 … 6ab1167ad7
(The current discussion on Arduino dev forums: https://groups.google.com/a/bcmi-labs.c … 12I7sB_Elk )

The libmaple looks just… weird?


victor_pv
Thu May 11, 2017 1:35 pm
danieleff wrote:I was looking at the SPI APIs.
Here is a compilation of implementations in various cores: https://gist.github.com/danieleff/64e51 … 6ab1167ad7
(The current discussion on Arduino dev forums: https://groups.google.com/a/bcmi-labs.c … 12I7sB_Elk )

The libmaple looks just… weird?


ag123
Thu May 11, 2017 2:19 pm
oh i’ve ignorantly just use SPI.transfer(0xaa) and SPI.transfer16(0xaaaa), they just seemed to work ‘out of the box’ :mrgreen:
didn’t figure out there is quite a bit of SPI ‘history’ with various interfaces

Pito
Sat May 13, 2017 10:17 am
With the latest Victor’s stm32generic SPI I cannot get more than 800kB/s with any speed settings in bench with F407..

victor_pv
Sat May 13, 2017 2:26 pm
Pito wrote:With the latest Victor’s stm32generic SPI I cannot get more than 800kB/s with any speed settings in bench with F407..

Pito
Sat May 13, 2017 2:39 pm
Yes, I’ve done my best today with F407 but the max was ~800kB/s for SPI speeds from 10-51 in bench, with 3 cards, SPI, SdFatEX, DMA :?
I will double check with the same SPI lib (latest) with F1 now..
Blue F103ZET, new Samsung EVO 16GB CL10, 36MHz SPI speed, SdFatEX, DMA:
File size 5 MB
Buffer size 512 bytes
Starting write test, please wait.
write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
3431.49,16295,140,146
3472.00,8603,140,145
3438.57,20800,140,147
Starting read test, please wait.
read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
3176.42,1878,158,159
3180.46,1414,158,159
3180.46,1417,158,159

ag123
Sat May 13, 2017 3:39 pm
maybe the aggressive *no_interrupts()* prevented data corruption, but leaves a big queue by sd fat waiting to ‘do its work’ :lol:

victor_pv
Sat May 13, 2017 6:29 pm
We dont disable interrupts in the stm32generic version, only on libmaple cores, and only Steve implemented it so far, not yet in the core, so that would not be related.

I have several times caught it witht the debugger haging on waiting for TXE to the set, and stays at 0. I still don’t have an explanation for that. And I have seen that in the F1 only as far as I remember.
As long as the SPI peripheral is enabled, after a few cycles TXE has to become 1 when the current data is completely shifted out, but a few times that doesn’t happen. It may have to do with having the debugger connected and poking around data and registers though, not sure, but just can’t understand why would that happen.

Other than that, the 20cm wires may be just at the limit of it working or not, you get a bit more noise in the line at 42Mhz and the spi port or the sdcard miss the edges…
I have done a few cosmetic corrections, but nothing major, and just pushed those updates to my repo. I haven’t sent a new PR to Daniel until we confirm the library itself is reliable. I have to wire an sdcard socket to the F4 board and run a few tests.

I think it would be good to test the library with displays too as they will make it very apparent if something is corrupted in the screen. Probably more so that the bench test, since I think it doesn’t check every byte on reads, but only the last 2 from each block.


Pito
Sat May 13, 2017 7:14 pm
I have the Jlink permanently connected to the F407 – I do flash via the jlink.. Maybe it interfered somehow this morning.. I have to shorten the wires too..
Anyhow, we need more testers for SPI and SDIO stuff to be involved :)

danieleff
Sun May 14, 2017 4:24 am
victor_pv wrote:danieleff wrote:I was looking at the SPI APIs.
Here is a compilation of implementations in various cores: https://gist.github.com/danieleff/64e51 … 6ab1167ad7
(The current discussion on Arduino dev forums: https://groups.google.com/a/bcmi-labs.c … 12I7sB_Elk )

The libmaple looks just… weird?


stevestrong
Sun May 14, 2017 8:44 am
danieleff wrote:Voila, now you can send the same byte X times (fill ili9341 screen).

danieleff
Sun May 14, 2017 10:50 am
Yes I included it in the comparison page. I was mainly talking about that Thomas L4 SPI API can be extended for it just by calling with NULL, NULL to send the same byte, without introducing new methods. Which is not true because it needs new methods so lets leave that for now.

victor_pv
Sun May 14, 2017 4:18 pm
danieleff wrote:Yes I included it in the comparison page. I was mainly talking about that Thomas L4 SPI API can be extended for it just by calling with NULL, NULL to send the same byte, without introducing new methods. Which is not true because it needs new methods so lets leave that for now.

danieleff
Fri Jun 16, 2017 4:58 am
Added transfer[16](OUT, IN, count[, callback]). OUT can be uint[8|16]_t / *buffer / NULL; IN can be *buffer / NULL; callback can be nothing/NULL/funcptr. This should cover all use cases.

Added an example video+audio player. Plays at 27 FPS on F407, ILI9341, SPI+I2S nonblocking DMA, SDIO blocking DMA. Example output without sound cause my crappy phone cant record sound: http://danieleff.com/stm32/test/output_no_sound.mp4

(Victor I integrated most of your changes, but DMA is now centralized, please use master for further development)


Pito
Fri Jun 16, 2017 6:40 am
There is a library called Adafruit_ILI9341_STM_SPI2 I used to use with MM/BP.
This library was modified to support DMA in Maple Mini by Victor in 2015.
Would this lib work with Generic? (Not tested yet).

Leave a Reply

Your email address will not be published. Required fields are marked *