Non blocking SPI DMA - Added callback to the SPI DMA functions (dmaSend, dmaTransfer...)

victor_pv

Tue Feb 21, 2017 8:05 pm

UPDATE:
Working code in this post, and an explanation how to modify sdFat to use it in the next post after that:
http://www.stm32duino.com/viewtopic.php … =60#p30306

======================================================

This is something Roger has suggested several times, and it seems to only make sense.

Currently the dma TX/RX functions we added to the SPI library block until the DMA transfer is completed or has timed out.
This change would add 2 functions to the SPI library, similar to the Arduino official I2S library:

onTransmit(handler); onReceive(handler);

RogerClark

Tue Feb 21, 2017 8:34 pm

Victor

Its probably worth PM’ing @stevstrong if he does not see this thread, as he has done a lot of work in SPI recently.

I think I have one pending PR from stev, but it seems to slow the SPI down, hence I have not actioned it yet. But it may be necessary as it contains bug fixes.

I think there is a general problem with callbacks into C++ classes, as only static functions addresses can be accessed.
So it would need to be a shared ISR for all instances.

Other APIs I work with, let you pass a pointer to the callback function, into the Transfer function. Which seems logical to me.

victor_pv

Tue Feb 21, 2017 8:50 pm

RogerClark wrote:Victor

Its probably worth PM’ing @stevstrong if he does not see this thread, as he has done a lot of work in SPI recently.

I think I have one pending PR from stev, but it seems to slow the SPI down, hence I have not actioned it yet. But it may be necessary as it contains bug fixes.

I think there is a general problem with callbacks into C++ classes, as only static functions addresses can be accessed.
So it would need to be a shared ISR for all instances.

Other APIs I work with, let you pass a pointer to the callback function, into the Transfer function. Which seems logical to me.

RogerClark

Tue Feb 21, 2017 8:56 pm

Hi victor

I have not tested @stevstrongs latest PR, but the previous one was slower than what we had before

Re: one ISR per SPI channel

Sounds OK to me.

thanks

roger

stevestrong

Tue Feb 21, 2017 9:14 pm

While trying to optimze the ILI9486 lib i have already implemented the non-blocking DMA with callback at job end. This included a short isr at job end which called the cb function if set by the user within the isr.
Unfortunatelly the DMA slows down the CPU very strongly, so at the end, including the ovehead to init the DMA, no time saving could take place compared to the blocking non-DMA version. I tested it with the Adafruit graphics test. You can follow the results from that thread.
My version was designed to reserve the respective DMA channel (i think channel 3) only for SPI. Still, i was not happy with the result.

victor_pv

Tue Feb 21, 2017 9:32 pm

stevestrong wrote:While trying to optimze the ILI9486 lib i have already implemented the non-blocking DMA with callback at job end. This included a short isr at job end which called the cb function if set by the user within the isr.
Unfortunatelly the DMA slows down the CPU very strongly, so at the end, including the ovehead to init the DMA, no time saving could take place compared to the blocking non-DMA version. I tested it with the Adafruit graphics test. You can follow the results from that thread.
My version was designed to reserve the respective DMA channel (i think channel 3) only for SPI. Still, i was not happy with the result.

stevestrong

Tue Feb 21, 2017 10:00 pm

Sure, it depends on the application.
The display lib was also partially writing larger blocks, no overall speed gain achieved however.
Once again, the cpu is slowed down strongly by the DMA running in background.
Dont understand me wrong, i see the theoretical benefit, that is why i also tested it. Still, the results did not convince me.

As i dont have any other application where saved time would play larger role than in the display lib, i have given up to push it into the repo.
But feel free to get it run. If you think it could help i could share my local version.

victor_pv

Wed Feb 22, 2017 1:02 am

stevestrong wrote:Sure, it depends on the application.
The display lib was also partially writing larger blocks, no overall speed gain achieved however.
Once again, the cpu is slowed down strongly by the DMA running in background.
Dont understand me wrong, i see the theoretical benefit, that is why i also tested it. Still, the results did not convince me.

stevestrong

Wed Feb 22, 2017 9:50 am

Please check my comments posted here: https://github.com/rogerclarkmelbourne/ … -277516813

racemaniac

Wed Feb 22, 2017 10:20 am

stevestrong

Wed Feb 22, 2017 11:39 am

and how can DMA not give a speedup even if the cpu slows down? your transfer will take equally long as with a blocking transfer, and even if the cpu slows down, at least it can do some work during the transfer.
You forgot the overhead to setup the DMA before each transaction. And if you implement the callback at job end, this will also take time and block completely the CPU from doing other tasks.
Thus, dependent on the SPI clock speed, the overhead together with the post-processing can take the time necessary to transfer, let’s say, 25 bytes.
So if you transfer 20 bytes without DMA, it is faster than transferring it with DMA.

Hence, again, to choose the appropriate strategy strongly depends on the application.
If you always write blocks of 256 bytes or more and have a lot of tasks to do between consecutive block writes (not only to wait for the previous SPI job to finish), then using DMA is clearly a good approach. Otherwise it can be slower than the non-DMA version.

racemaniac

Wed Feb 22, 2017 12:06 pm

stevestrong wrote:and how can DMA not give a speedup even if the cpu slows down? your transfer will take equally long as with a blocking transfer, and even if the cpu slows down, at least it can do some work during the transfer.
You forgot the overhead to setup the DMA before each transaction. And if you implement the callback at job end, this will also take time and block completely the CPU from doing other tasks.
Thus, dependent on the SPI clock speed, the overhead together with the post-processing can take the time necessary to transfer, let’s say, 25 bytes.
So if you transfer 20 bytes without DMA, it is faster than transferring it with DMA.

stevestrong

Wed Feb 22, 2017 3:06 pm

Extract from RM0008:
13.3 DMA functional description
The DMA controller performs direct memory transfer by sharing the system bus with the
Cortex®-M3 core. The DMA request may stop the CPU access to the system bus for some
bus cycles, when the CPU and DMA are targeting the same destination (memory or
peripheral). The bus matrix implements round-robin scheduling, thus ensuring at least half
of the system bus bandwidth (both to memory and peripheral) for the CPU.
So the CPU may be slowed down to the half of its speed capacity. The higher the SPI clock, the worse the situation for the CPU if performs a lot of memory accesses.

racemaniac

Wed Feb 22, 2017 3:28 pm

stevestrong wrote:Extract from RM0008:
13.3 DMA functional description
The DMA controller performs direct memory transfer by sharing the system bus with the
Cortex®-M3 core. The DMA request may stop the CPU access to the system bus for some
bus cycles, when the CPU and DMA are targeting the same destination (memory or
peripheral). The bus matrix implements round-robin scheduling, thus ensuring at least half
of the system bus bandwidth (both to memory and peripheral) for the CPU.
So the CPU may be slowed down to the half of its speed capacity. The higher the SPI clock, the worse the situation for the CPU if performs a lot of memory accesses.

victor_pv

Wed Feb 22, 2017 4:17 pm

stevestrong wrote:Please check my comments posted here: https://github.com/rogerclarkmelbourne/ … -277516813

victor_pv

Wed Feb 22, 2017 4:31 pm

racemaniac wrote:stevestrong wrote:Extract from RM0008:
13.3 DMA functional description
The DMA controller performs direct memory transfer by sharing the system bus with the
Cortex®-M3 core. The DMA request may stop the CPU access to the system bus for some
bus cycles, when the CPU and DMA are targeting the same destination (memory or
peripheral). The bus matrix implements round-robin scheduling, thus ensuring at least half
of the system bus bandwidth (both to memory and peripheral) for the CPU.
So the CPU may be slowed down to the half of its speed capacity. The higher the SPI clock, the worse the situation for the CPU if performs a lot of memory accesses.

racemaniac

Wed Feb 22, 2017 7:28 pm

victor_pv wrote:racemaniac wrote:stevestrong wrote:Extract from RM0008:

So the CPU may be slowed down to the half of its speed capacity. The higher the SPI clock, the worse the situation for the CPU if performs a lot of memory accesses.

victor_pv

Wed Feb 22, 2017 7:51 pm

racemaniac wrote:

indeed

. It would still be nice to be able to configure that you won’t reuse the dma channel for something else and get the cheaper dma setup (likely the most often usecase since so far dma is hardly used anyway ^^)

And i’m still wondering about the performance hit for what is running simultaneous to it . i’m really going to give that a try when i have some time, it’s very good to know .

any suggestions on what to benchmark it with? just a dhrystone benchmark or so?

victor_pv

Thu Feb 23, 2017 2:51 am

racemaniac wrote:
indeed

racemaniac

Thu Feb 23, 2017 8:31 am

victor_pv wrote:racemaniac wrote:
indeed

stevestrong

Thu Feb 23, 2017 8:36 am

In my code I initialized the DMA once (dmaSendInit()), then I just updated the number of data to transfer and the pointer in dmaSendBuffer() and enabled the DMA, so that the overhead could be kept minimal.
I will post my version here in the evening.

racemaniac

Thu Feb 23, 2017 8:51 am

stevestrong wrote:In my code I initialized the DMA once (dmaSendInit()), then I just updated the number of data to transfer and the pointer in dmaSendBuffer() and enabled the DMA, so that the overhead could be kept minimal.
I will post my version here in the evening.

victor_pv

Thu Feb 23, 2017 3:47 pm

racemaniac wrote:victor_pv wrote:racemaniac wrote:
indeed

racemaniac

Thu Feb 23, 2017 3:55 pm

ugh, that eternal trade off between performance and usability of a framework ^^’
there indeed is no silver bullet or “best” solution >_<. i’d still like the cheap setup functions for when using DMA directly, but for usage in frameworks, the full setup is indeed the safer choice.

stevestrong

Thu Feb 23, 2017 6:09 pm

I think we could reserve the DMA usage of channels 2 to 5 for SPI1 and 2 (both Rx and Tx), user configurable. The other channels would be then always free for other purposes.
I am attaching my local version of SPI.cpp, where I started to implement a dual-buffered DMA transfer: while sending one buffer, the other buffer gets filled. When the first is sent, in the ISR is checked whether the other buffer contains data to send or not. If yes, the DMA will be setup automatically again. So far the plan, but the ISR part is not yet coded.
The interesting functions start from line 479, conditioned by the SPI_USE_DMA define.
Hopefully you can get some ideas from it.

Ollie

Fri Feb 24, 2017 4:15 pm

An additional reference can be found at http://www.emblocks.org/wiki/tutorials: … _disco:spi. I have included the Logic Analyzer screenshots to show how 3 SPI devices are communicating at the same time using DMA for the bulk processing.

This is based on the principle that the SPI transfer requests are non-blocking. That means that the requests are put into a queue. When taking a request from the queue the following actions are done
– the device address is sent using polling
– the register number is sent using polling
– one or two DMAs are setup depending if it is read, write, or simultaneous read and write
– interrupt is set for DMA completion
– once the interrupt is received, the “callback” is executed and the next transfer is started

This code demonstrates how the speed and other SPI settings can be changed for different devices and how the SPI configuration registers are modified only when required.

This code demonstrates how the LCD driver command/data line is driven synchronized with the SPI transfers.

Cheers, Ollie

stevestrong

Fri Feb 24, 2017 4:48 pm

Ollie wrote:This code demonstrates how the LCD driver command/data line is driven synchronized with the SPI transfers.
Cheers, Ollie

mrburnette

Sat Feb 25, 2017 3:56 pm

…. thinking out loud…. no response expected ….

Anyone looked into what Paul Stoffregen is doing around Teensy 3.1/3.2 ?

RogerClark

Sat Feb 25, 2017 8:24 pm

mrburnette wrote:…. thinking out loud…. no response expected ….

Anyone looked into what Paul Stoffregen is doing around Teensy 3.1/3.2 ?

victor_pv

Sat Feb 25, 2017 10:51 pm

From what I have found, they have not integrated any DMA functions in the normal SPI library.
There is another library, DmaSpi, which is the one that has DMA capabilities:

https://github.com/crteensy/DmaSpi

That one works similar to what Steve was working on, it queues transfers with a series of properties (buffer pointer, size, and a pin object to control CS), then services those transfers in order.

Not sure if all that is worth the effort, as Steve tested, for small transfers doesn’t improve performance due to all the overhead, and the libraries using it have to be heavily modified.

RogerClark

Sun Feb 26, 2017 12:40 am

I Totally agree.

Its not worth adding that level of complexity as hardly any one will use it.

victor_pv

Sun Feb 26, 2017 1:23 am

First shot at separating the setting up of the DMA transfer, and the firing of the transfer.
Basically I just separated the code that was previously in dmaSend in two parts, dmaSensdSet configured the transfer, except for the length, and enabling the channel. Next you call dmaSendRepeat (I dont really like that name, but can’t think on other appropriate one that I like) with the length of the transfer, and that sets the DMACNDTR register, which needs to be reloaded for a new transfer, and enables the channel. If a callback function has been set previously, doesn’t block and return 0 for success. If a callback function was not set, it will block.
SPI.dmaSend has the exact same functionality as before, so any code using it should not break in any way, but now it uses the other 2 functions to do the work.

This way if you want to send a number of bytes from the same buffer pointer, you only call SPI.dmaSendRepeat with 1 parameter, the number of bytes to send, everything else stays the same as the last transfer, so you only need to fill the buffer and fire the transfer.

What do you guys think? This still doesn’t implement the internal buffer Steven had been testing, but code that repeat transmissions from the same buffer should get less overhead.

EDIT: I realize it may be good to add checks to confirm whether there is a transmission already in progress before changing settings. It depends if we prefer that little overhead for safety.

uint8 SPIClass::dmaSendRepeat(uint16 length) { if (length == 0) return 1; dma_set_num_transfers(_currentSetting->spiDmaDev, _currentSetting->spiTxDmaChannel, length); dma_enable(_currentSetting->spiDmaDev, _currentSetting->spiTxDmaChannel);// enable transmit if (_currentSetting->TXcallback){ return 0; } uint32_t m = millis(); uint8 b = 0; while ((dma_get_isr_bits(_currentSetting->spiDmaDev, _currentSetting->spiTxDmaChannel) & DMA_ISR_TCIF1)==0) {//Avoid interrupts and just loop waiting for the flag to be set. if ((millis() - m) > DMA_TIMEOUT) { b = 2; break; } } dma_clear_isr_bits(_currentSetting->spiDmaDev, _currentSetting->spiTxDmaChannel);


while (spi_is_tx_empty(_currentSetting->spi_d) == 0); // "5. Wait until TXE=1 ..."

while (spi_is_busy(_currentSetting->spi_d) != 0); // "... and then wait until BSY=0 before disabling the SPI."

dma_disable(_currentSetting->spiDmaDev, _currentSetting->spiTxDmaChannel);

spi_tx_dma_disable(_currentSetting->spi_d);

//uint16 x = spi_rx_reg(_currentSetting->spi_d); // dummy read, needed, don't remove!

return b;

}
void SPIClass::dmaSendSet(void * transmitBuf, bool minc)

{

uint32 flags = ( (DMA_MINC_MODE*minc) | DMA_FROM_MEM | DMA_TRNS_CMPLT);

dma_init(_currentSetting->spiDmaDev);

// TX

spi_tx_dma_enable(_currentSetting->spi_d);

dma_xfer_size dma_bit_size = (_currentSetting->dataSize==SPI_DATA_SIZE_16BIT) ? DMA_SIZE_16BITS : DMA_SIZE_8BITS;

dma_setup_transfer(_currentSetting->spiDmaDev, _currentSetting->spiTxDmaChannel, &_currentSetting->spi_d->regs->DR, dma_bit_size,

transmitBuf, dma_bit_size, flags);// Transmit buffer DMA

}

uint8 SPIClass::dmaSend(void * transmitBuf, uint16 length, bool minc) { dmaSendSet(transmitBuf, minc); return dmaRepeat(length); }

RogerClark

Sun Feb 26, 2017 2:34 am

Thanks Victor

The blocking / non blocking selection via the callback argument being non null is what Nordic semi do in their SDK / API, so if its good enough for them, I think its good enough for us.

Also

dmaSendRepeat() seems a prefectly good name to me, as its descriptive and concise

stevestrong

Sun Feb 26, 2017 7:04 am

That’s nice so far.
It should be non-blocking if a non-null CB function is passed as parameter.
I would really tend to reserve the DMA channels for SPI, if configured by the user, in order to remove the overhead calling dmaSend/Set each time if only the buffer pointer and its length have changed. This way one could quickly switch between two buffers, one being filled while the previous one being sent.
I think it is very unlikely to use same DMA channel for other purposes if one use the DMA for SPI, which means the SPI is constantly working, and then with DMA, like display-related applications.

Ollie

Sun Feb 26, 2017 4:07 pm

If/when you are preparing to port the F1xx architecture to F4xx/F7xx/H7xx, the DMA allocation management has to be addressed. With multiple alternative DMA streams available for the peripherals and more peripherals requiring DMA, the visibility of this mapping matrix is important. In long run, it will not work if the peripheral libraries are assuming that certain DMA streams are available for them.

We either need to have on demand DMA stream request/release or static allocations with conflict detection.

stevestrong

Sun Feb 26, 2017 4:25 pm

Ollie wrote:If/when you are preparing to port the F1xx architecture to F4xx/F7xx/H7xx, the DMA allocation management has to be addressed. With multiple alternative DMA streams available for the peripherals and more peripherals requiring DMA, the visibility of this mapping matrix is important. In long run, it will not work if the peripheral libraries are assuming that certain DMA streams are available for them.

We either need to have on demand DMA stream request/release or static allocations with conflict detection.

victor_pv

Sun Feb 26, 2017 4:30 pm

How do you guys suggest to manage the DMA channels allocation?
As far as I know currently the core doesn’t keep any kind of data on what channels are enabled and for what peripheral, I guess that would require an additional library, that keeps that data, and manage allocation and releasing channels.

About the F4 porting, I think the first effort we should probably try is changing to the “tubes” API that leaflabs added to manage the F4 DMA streams. I haven’t used it in the F1, but I have seen code using it, so probably works.

Now, I am of the opinion of changing things one at a time when they are already working, so not to break anything.
Personally I will finish with what I started, adding the part to manage a callback, and repeated transfers. I will add another function to cover changing the source address and length at once as Steve suggested, that will help manage doublebuffering.

If someone can start working on code that manage DMA channels reservations, we can work in parallel. Once I’m finished with the extra functions and they work, I’ll try to see if I can get it working with tubes so it works for the F4.

EDIT: One way could be try to build some table like the PIN_MAP one, that maps the DMA channels with the peripherals that can use them, and add a bool indicating if that peripheral DMA requests have been enabled in that channel.

RogerClark

Sun Feb 26, 2017 9:00 pm

Is it going to take much space to store the reservations ? e.g. do we need to just store something for each DMA channel e.g. whether its reserved or do we need to store a struct per channel ?

Is there just a way read back the information from the hardware. From what I recall it was possible to effectively parse the DMA registers to figure out quite a lot of information about their setup.

victor_pv

Mon Feb 27, 2017 1:36 am

RogerClark wrote:Is it going to take much space to store the reservations ? e.g. do we need to just store something for each DMA channel e.g. whether its reserved or do we need to store a struct per channel ?

Is there just a way read back the information from the hardware. From what I recall it was possible to effectively parse the DMA registers to figure out quite a lot of information about their setup.

Ollie

Mon Feb 27, 2017 4:41 am

I agree, that we will have many moons before we will use Arduino IDE for serious or semiserious F4/F7/H7 programming. In that frontier, the developers need to be aware of the capabilities and restrictions of different peripheral buses and the DMA channels in them.

I do propose that we are using STM terminology for these concepts. In that sense we need to identify that there are only few DMA channels and each of them have multiple streams. The streams in different channels can be active at the same time, but for the streams in a channel only one can be active at any point of time.

victor_pv

Mon Feb 27, 2017 5:14 pm

Ollie wrote:I agree, that we will have many moons before we will use Arduino IDE for serious or semiserious F4/F7/H7 programming. In that frontier, the developers need to be aware of the capabilities and restrictions of different peripheral buses and the DMA channels in them.

Ollie

Mon Feb 27, 2017 5:50 pm

Sorry, now I have foot in my mouth. The F1 and other STM MCUs share the same name for the controllers in the peripheral buses. So here is the summary for F4/F7/H7
1) DMA Controllers have multiple streams
2) Streams have multiple channels
3) The stream channels are hard-wired to peripheral devices
– stream configuration selects one of the devices

victor_pv

Mon Feb 27, 2017 7:21 pm

Ollie wrote:Sorry, now I have foot in my mouth. The F1 and other STM MCUs share the same name for the controllers in the peripheral buses. So here is the summary for F4/F7/H7
1) DMA Controllers have multiple streams
2) Streams have multiple channels
3) The stream channels are hard-wired to peripheral devices
– stream configuration selects one of the devices

racemaniac

Tue Feb 28, 2017 8:09 am

victor_pv wrote:From what I have found, they have not integrated any DMA functions in the normal SPI library.
There is another library, DmaSpi, which is the one that has DMA capabilities:

https://github.com/crteensy/DmaSpi

That one works similar to what Steve was working on, it queues transfers with a series of properties (buffer pointer, size, and a pin object to control CS), then services those transfers in order.

Not sure if all that is worth the effort, as Steve tested, for small transfers doesn’t improve performance due to all the overhead, and the libraries using it have to be heavily modified.

victor_pv

Tue Feb 28, 2017 1:37 pm

racemaniac wrote:victor_pv wrote:From what I have found, they have not integrated any DMA functions in the normal SPI library.
There is another library, DmaSpi, which is the one that has DMA capabilities:

https://github.com/crteensy/DmaSpi

That one works similar to what Steve was working on, it queues transfers with a series of properties (buffer pointer, size, and a pin object to control CS), then services those transfers in order.

Not sure if all that is worth the effort, as Steve tested, for small transfers doesn’t improve performance due to all the overhead, and the libraries using it have to be heavily modified.

racemaniac

Tue Feb 28, 2017 1:48 pm

victor_pv wrote:racemaniac wrote:victor_pv wrote:From what I have found, they have not integrated any DMA functions in the normal SPI library.
There is another library, DmaSpi, which is the one that has DMA capabilities:

https://github.com/crteensy/DmaSpi

That one works similar to what Steve was working on, it queues transfers with a series of properties (buffer pointer, size, and a pin object to control CS), then services those transfers in order.

Not sure if all that is worth the effort, as Steve tested, for small transfers doesn’t improve performance due to all the overhead, and the libraries using it have to be heavily modified.

victor_pv

Mon Mar 06, 2017 7:50 pm

It seems like I have a working version now that allows to:
-Set callback functions that will be called when a DMA transfer completed. In case the callbacks are set dmaSend and dmaTransfer are non-blocking.
-Allow to set all the DMA related settings with one function (enable DMA controller, set transfer address, destination, data size, etc), and then a second function to reload the DMA transfer size, which needs to be reloaded before enabling the channel again, since the value is not kept at the end of a transmission. So if the buffer address, data side, are reused, only the second function needs to be called repeatedly.

I have tested the callback with the sdfat library and with an ILI spi display. Now the weird thing:
When using sdfat at spi div/2 speed, and using callbacks, somethings the DMA RX never completed, and leaves either 1 or 2 bytes pending.
So let’s say I want to receive 512bytes. For that the DMA RX is set to 512 bytes, the DR is read if RXNE is set, RX DMA enabled, and next set and enable TX DMA for 512 bytes.
After each byte goes out, one will come in, and the DMA controller reads it from DR, stores it in the RX buffer, and decrements the count of the RX DMA pending requests.

All works fine if I do not use callbacks, and just block until RX is completed. It also works fine if I set the port to 18Mb (DIV/4) while using callbacks.
But if I use callbacks and set the port to DIV/2 speed, then some times the RX never completes. The TX buffer is all sent, and 1 or 2 bytes still pending in RX. Since TX is completed, it’s not producing a clock any more, and RX will never get the last bytes in.

I have run it thru the debugger, and some times it completes several transmissions correctly before one fails, but is a different number of transmissions each time. Some times goes for longer, some for shorter.
I have tried setting the RX DMA priority to very high and the TX to medium, in case the DMA controller was servicing a TX while an RX was pending, which would overwrite the DR register and get the RX byte lost, but that did not help.

Other than setting a callback for Transfer complete event, the DMA setup is exactly the same whether blocking or not, so I can’t figure out what is happening, other than perhaps even when blocking the transfers are not always completing, but since there is a timeout check, some times the transfer is just being terminated on timeout and not because RX actually completed.

I need to test that theory by removing the timeout, but has anyone experienced any issue when receiving data from with the sdfat library and the max spi port speed when using DMA, or noticed any corruption in the data read?

stevestrong

Mon Mar 06, 2017 8:30 pm

I think it is normal that Rx bytes are still to be received when Tx ready, since TXE comes earlier than RXNE. Meaning that TXE is set before the last byte is received.
Do you wait in the Tx end callback function for not TXE and BSY?

victor_pv

Mon Mar 06, 2017 10:11 pm

stevestrong wrote:I think it is normal that Rx bytes are still to be received when Tx ready, since TXE comes earlier than RXNE. Meaning that TXE is set before the last byte is received.
Do you wait in the Tx end callback function for not TXE and BSY?

stevestrong

Tue Mar 07, 2017 1:03 am

Ok, so you run both SPI ports in DMA mode…
You could try to let only the Rx part in DMA mode, the Tx part in “normal” mode to see if Rx bytes are still lost.

victor_pv

Tue Mar 07, 2017 1:52 am

stevestrong wrote:Ok, so you run both SPI ports in DMA mode…
You could try to let only the Rx part in DMA mode, the Tx part in “normal” mode to see if Rx bytes are still lost.

stevestrong

Tue Mar 07, 2017 9:23 am

I think the callback should not have any influence on the test result if you wait for TXE and BSY flags within the Tx end ISR.
When checking figures 246 and 247 overlapped, the one bit (2 APB1 clock) gap between RXNE and TXE should give priority to Rx channel if is set to highest priority (11), please double check that you have set the priorities correctly. DMA for Tx can be set to lowest priority (00).

Furthermore, instead of timeout, you could check the DMA_CNDTRx register value to determine whether there are still bytes to be received when Tx finished.

Does it fail when using only SPI_1 both Tx and Rx DMA with 36MHz clock (SPI2 not working)?

If not, then it is clear a race-condition/limitation of the hardware (bus matrix, AHB system bus and the two bridges to APB1 and APB2 peripheral buses as indicated in figure 2) which cannot handle so many data (DMA and CPU<->RAM) transfer requests within the short time period of one byte transfer at 36MHz, as SPI_1 is served on APB2 and SPI_2 on APB1.

Alternatively you could check what happens when setting the flash wait states down to 1 (CPU at 72MHz) or up to 3.

In addition, you could monitor the MODF and OVR error flags of SPI (enable these IRQs?) and/or TEIF of DMA.

EDIT
Can you please specify more details about how exactly do you test? Do you read blocks of 512 bytes on SPI 1 from SD card and read blocks (how many bytes?) from ILI display repeatedly? Do you use the older SdFat lib or the newer one, SdFat beta? I could maybe test in parallel if you share the testing code.

victor_pv

Tue Mar 07, 2017 4:58 pm

stevestrong wrote:I think the callback should not have any influence on the test result if you wait for TXE and BSY flags within the Tx end ISR.
When checking figures 246 and 247 overlapped, the one bit (2 APB1 clock) gap between RXNE and TXE should give priority to Rx channel if is set to highest priority (11), please double check that you have set the priorities correctly. DMA for Tx can be set to lowest priority (00).

Furthermore, instead of timeout, you could check the DMA_CNDTRx register value to determine whether there are still bytes to be received when Tx finished.

Does it fail when using only SPI_1 both Tx and Rx DMA with 36MHz clock (SPI2 not working)?

Alternatively you could check what happens when setting the flash wait states down to 1 (CPU at 72MHz) or up to 3.

In addition, you could monitor the MODF and OVR error flags of SPI (enable these IRQs?) and/or TEIF of DMA.

stevestrong

Tue Mar 07, 2017 8:44 pm

I don’t think that SPI_DR register for Tx and Rx would mix up, the hardware should handle the correct way of data flow of Tx and Rx data to and from SPI_DR. I think it is rather a matter of limitation on internal data buses.

Still, you did not try to run Tx and Rx DMA on SPI 1 with 36MHz and let SPI 2 disabled/inactive. This would give us more information.

victor_pv

Tue Mar 07, 2017 10:13 pm

stevestrong wrote:I don’t think that SPI_DR register for Tx and Rx would mix up, the hardware should handle the correct way of data flow of Tx and Rx data to and from SPI_DR. I think it is rather a matter of limitation on internal data buses.

Still, you did not try to run Tx and Rx DMA on SPI 1 with 36MHz and let SPI 2 disabled/inactive. This would give us more information.

stevestrong

Wed Mar 08, 2017 10:10 am

Victor, how did you adapt the SdFat lib in order to cope with the callback functionality?
Because as far as I know, the SdFat uses currently SPI transfers in a blocking way.

victor_pv

Wed Mar 08, 2017 3:18 pm

stevestrong wrote:Victor, how did you adapt the SdFat lib in order to cope with the callback functionality?
Because as far as I know, the SdFat uses currently SPI transfers in a blocking way.

victor_pv

Sun Mar 12, 2017 6:03 pm

I have done a few tests today with a different board, with an F103RCT6 chip.
I connected MISO to MOSI in SPI1, and repeatedly sent and receive with DMA to 2 buffers. Then compare the content.
All was going well when I was only using SPI1, with or without callback.
Then I started using SPI2 also (spi2 without DMA), and the problems start. Some bits are not received correctly, and changes a 0 for 1 in the last bit of some transferred bytes.

The sketch run transfers in a loop and compares the result at the end. If there is an error, it stops and wait for user input, then can repeat. If there is not error it repeats the loop without stopping. When errors happen they happen every few passes, but not every single one. The errors also happen on different bytes, not always the same.

EDIT: I have repeated the same test using SPI2 with DMA, and also setting SPI1 to Div4 speed. These are the results:

DIV2 Speed on spi1:
SPI1 Alone Without callback: OK
SPI1 alone with callback: OK
SPI1 with callback, in parallel SPI2 without DMA: errors
SPI1 with callback, in parallel SPI2 with DMA: errors

DIV4 speed on spi1:
SPI1 with callback, in parallel SPI2 without DMA: OK
SPI1 with callback, in parallel SPI2 with DMA: OK

Seems so far that errors only happen when SPI1 and SPI2 are working at the same time, and SPI1 is operating at 36Mhz (over specs).
I’ll try to post all my code to github. I’ll stop with this tests here, since it started having issues when I started writting the callback stuff, and didn’t know if the problem was on that. I’m convinced now the callback code doesn’t have any problem, but the only problem was due to using both SPI ports completely in parallel, which was only allowed by using callbacks to signal the end of transfer.

Conclusion for me: spi1 is not reliable when operating at 36Mhz, specially when using another spi port at the same time. Probably ok for only sending data, or when used alone, or non critical reception.

Rick Kimball

Sun Mar 12, 2017 7:46 pm

maybe you could try adjusting the relative priority of the interrupts?

victor_pv

Sun Mar 12, 2017 10:42 pm

Rick Kimball wrote:maybe you could try adjusting the relative priority of the interrupts?

victor_pv

Fri Jun 23, 2017 7:21 pm

Resurrecting this thread to add the code.

I tested the functions extensively and the only problem I ever found was when using SPI1 at 36Mbit with DMA, as posted above. I verified problem would happen whether using callback or not.
Now, the version I am posting is not the one I tested. I had to redo it to add the latest changes from Roger, and need to retest it, but I am posting it so more people can test it.
This version does not use the dmatubes, although I have another version that does, which I wrote for the F4 support, but Steve is the one working the most in the F4 and his core doesn’t use the dma tubes, so I see no point using that.
I’ll try to test it as soon as I can and post back, if someone finds any problem let me know.

SPI.zip: (10.42 KiB) Downloaded 14 times

stevestrong

Fri Jun 23, 2017 8:34 pm

Hi Victor,
first try to build:

C:\Users\S\Documents\Arduino\hardware\Arduino_STM32\STM32F1\libraries\SPI\src\SPI.cpp:433:26: error: 'class SPISettings' has no member named 'receiveCallback'
if (_currentSetting->receiveCallback){
^
C:\Users\S\Documents\Arduino\hardware\Arduino_STM32\STM32F1\libraries\SPI\src\SPI.cpp: At global scope:
C:\Users\S\Documents\Arduino\hardware\Arduino_STM32\STM32F1\libraries\SPI\src\SPI.cpp:463:7: error: prototype for 'uint8 SPIClass::dmaTransfer(uint8*, uint8*, uint16)' does not match any in class 'SPIClass'
uint8 SPIClass::dmaTransfer(uint8 *transmitBuf, uint8 *receiveBuf, uint16 length) {
^
In file included from C:\Users\S\Documents\Arduino\hardware\Arduino_STM32\STM32F1\libraries\SPI\src\SPI.cpp:32:0:
C:\Users\S\Documents\Arduino\hardware\Arduino_STM32\STM32F1\libraries\SPI\src\SPI.h:306:8: error: candidate is: uint8 SPIClass::dmaTransfer(void*, void*, uint16)
uint8 dmaTransfer(void * transmitBuf, void * receiveBuf, uint16 length);
^
C:\Users\S\Documents\Arduino\hardware\Arduino_STM32\STM32F1\libraries\SPI\src\SPI.cpp: In member function 'uint8 SPIClass::dmaSendRepeat(uint16)':
C:\Users\S\Documents\Arduino\hardware\Arduino_STM32\STM32F1\libraries\SPI\src\SPI.cpp:499:26: error: 'class SPISettings' has no member named 'transmitCallback'
if (_currentSetting->transmitCallback)
^
C:\Users\S\Documents\Arduino\hardware\Arduino_STM32\STM32F1\libraries\SPI\src\SPI.cpp: At global scope:
C:\Users\S\Documents\Arduino\hardware\Arduino_STM32\STM32F1\libraries\SPI\src\SPI.cpp:563:2: error: expected unqualified-id before '/' token
*/
^

C:\Users\S\Documents\Arduino\hardware\Arduino_STM32\STM32F1\libraries\SPI\src\SPI.cpp:563:2: error: expected constructor, destructor, or type conversion before '/' token

stevestrong

Fri Jun 23, 2017 8:50 pm

Made some necessary changes to be able to build it.
I did not test the new functions, only the old (original) DMA ones, they seem to work.

victor_pv

Sat Jun 24, 2017 9:02 pm

[stevestrong – Fri Jun 23, 2017 8:50 pm] –
Made some necessary changes to be able to build it.
I did not test the new functions, only the old (original) DMA ones, they seem to work.

Thanks Steve, I just finished compiling it and found similar errors and corrected them too.
I’m uploading the new. It’s pretty much the same changes you had to do, except in TransferSet and SendSet I changed it to void * rather than the uint8 * I was using before (my original code was based off an older version of the core).

I also did not use a typedef for the function pointers, but I see you did. I would imagine you added to make the code clearer to read, but you think is necessary and does not actually makes more difficult to read the code by having to check what that type is? or did you use it for some other reason?

Here is the new version that compiles right. Like I said pretty close to Steve’s except for those differences. I still need to test running my previous test sketch with it, which uses the callbacks. I do not have any code to test the new async function, I hope I didn’t break it.

SPI.zip: (10.44 KiB) Downloaded 19 times

victor_pv

Tue Jun 27, 2017 3:35 am

I just tested with my test sketch and can use the callbacks successfully with both SPI1 and SPI2 with the code in the post above, which is based in the latest from Roger’s repo.

For my test I used my old wav player test code, it uses FreeRTOS 900. DMA is running 3 peripherals, SPI1, SPI2, and a timer to produce the PWM output.
I modified sdFat so it set’s a callback function and after dmaSend or dmaTransfer it sets the task to block until released by the ISR, to the RTOS changes to the next task and keeps the cpu busy until the SPI DMA transfer is over. the ISR causes a new context switch upon exit and RTOS returns back to the task that was reading from the sdcard.
I have tested it with the display too, but just introduces jitter since the display task only writes a few bytes at a time, and all the context switching just wastes any cpu time that could be used for anything else. So in the display it works but doesn’t provide any performance gain.

These are my changes to the SdSpiSTM32F1.cpp file in the sdfat library in case anyone is interested in testing:
First include the RTOS (9.0 in my case, should work with 8.2.1 too):
#include <MapleFreeRTOS900.h>

danieleff

Tue Jun 27, 2017 4:42 am

Instead of all of this, wouldn’t it be enough to add `yield()` to the while loop that waits for the DMA to finish? That will do a context switch for you while waiting.

victor_pv

Wed Jun 28, 2017 3:11 am

[danieleff – Tue Jun 27, 2017 4:42 am] –
Instead of all of this, wouldn’t it be enough to add `yield()` to the while loop that waits for the DMA to finish? That will do a context switch for you while waiting.

Daniel I’m not sure I understand your suggestion, so correct me if I’m wrong, but you suggest to use FreeRTOS taskYIELD() is that right?
That wouldn’t work for 2 reasons.
1.- It does a yield only if a higher priority task is waiting to execute. If the task doing the sdfat access at the moment is the highest one ready to execute, it will not yield and continue running, so you effectively didn’t yield to another task to run while DMA is ongoing.
2.- If we just yielded, it will return to this task at a moment that is not synched with the DMA transfer, we still needs to know if the DMA is completed before continuing with the sdfat code. Unless you use some sort of semaphore and check for it, or you have to be polling the DMA controller to check if it’s done. The first case is exactly what’s implemented, only with task notifications that according to the FreeRTOS docs cost less cycles and RAM than a semaphore. If we did the second case, poll the DMA controller, then we are wasting cycles just polling. We already have that in the library when we run in blocking mode.

By using the task notification we achieve 2 things. First the task will yield to the next one to run, even if the ones available are lower priority. And second as soon as the Callback function is called because the DMA transfer is over, it will request a yield to the RTOS. If the task executing has the same or lower priority, it will go to the sdfat one that was blocked. If the task currently running is higher priority, then it will mark this one as ready to run and will run next time it can according to priorities etc.

If you meant with the yield() is something else please let me know. I know there is some yield() funtion in the sdfat library but I understood that was to put the cpu in sleep mode until the next interrupt, in which case it will not execute another task.

Finally this is just a usage example because I had this FreeRTOS sketch, but there is no need to use any RTOS, since the callback is declared by the user code it could be used to set a variable, change a pin level etc.

RogerClark

Mon Jul 03, 2017 5:53 am

Victor

The SPI zip seems to work OK for me, but I’m not using any callbacks.

I used the code with my OV7670 camera testbed and it was fine.

But it doesnt include the dmaSendAsync

I can manually merge that new function into your code and then check my local copy back into github, but it would not be as clean as you pulling the latest repo into you github repo and then submitting a PR based on the changes.

So if you want to do that, I’ll merge the PR asap e.g. tomorrow

Thanks

Roger

PS. As you can see from my PM, I’ve merged you other PR’s

victor_pv

Mon Jul 10, 2017 3:33 pm

[RogerClark – Mon Jul 03, 2017 5:53 am] –
Victor

The SPI zip seems to work OK for me, but I’m not using any callbacks.

I used the code with my OV7670 camera testbed and it was fine.

But it doesnt include the dmaSendAsync

I can manually merge that new function into your code and then check my local copy back into github, but it would not be as clean as you pulling the latest repo into you github repo and then submitting a PR based on the changes.

So if you want to do that, I’ll merge the PR asap e.g. tomorrow

Thanks

Roger

PS. As you can see from my PM, I’ve merged you other PR’s

My version was redone taking your latest as the base, so it has sendAsync too.
I’ll sent a PR with those 2 files.

RogerClark

Mon Jul 10, 2017 10:00 pm

Thanks

stevestrong

Mon Jul 31, 2017 11:37 am

I just saw in the newly added SPI code, it seems that there are 3 dma_set_transfer() calls, although there normally should be only one, with adapted flags as input, dependent on the MINC bit:
https://github.com/rogerclarkmelbourne/ … I.cpp#L399

I would change it like this:
uint32_t flags = (DMA_MINC_MODE | DMA_FROM_MEM); if (!transmitBuf) { transmitBuf = &ff; flags &= ~DMA_MINC_MODE; } dma_setup_transfer(_currentSetting->spiDmaDev, _currentSetting->spiTxDmaChannel, &_currentSetting->spi_d->regs->DR, dma_bit_size, transmitBuf, dma_bit_size, flags);

victor_pv

Mon Jul 31, 2017 12:31 pm

[stevestrong – Mon Jul 31, 2017 11:37 am] –
https://github.com/rogerclarkmelbourne/ … I.cpp#L399

I would change it like this:
uint32_t flags = (DMA_MINC_MODE | DMA_FROM_MEM); if (!transmitBuf) { transmitBuf = &ff; flags &= ~DMA_MINC_MODE; } dma_setup_transfer(_currentSetting->spiDmaDev, _currentSetting->spiTxDmaChannel, &_currentSetting->spi_d->regs->DR, dma_bit_size, transmitBuf, dma_bit_size, flags);

stevestrong

Mon Jul 31, 2017 12:36 pm

[victor_pv – Mon Jul 31, 2017 12:31 pm] –
Could you point to the lines you are referring to?

These two lines move to line 420.
It makes sense to clear them before any further request is launched.
And it would be nice to have them readable after a transfer to check on an upper software level whether any error flags have been set or not.

victor_pv

Mon Jul 31, 2017 12:49 pm

[stevestrong – Mon Jul 31, 2017 12:36 pm] –

[victor_pv – Mon Jul 31, 2017 12:31 pm] –
Could you point to the lines you are referring to?

These two lines move to line 420.
It makes sense to clear them before any further request is launched.
And it would be nice to have them readable after a transfer to check on an upper software level whether any error flags have been set or not.

The problem would that in a situation like this:
You run a dmaTransfer. (the ISR bits are not cleared)
You go doing something with that DMA channel and enable IRQ requests. Since the interrupt request is still set, I think an interrupt will trigger right away.

I believe that’s the way they work from what I remember from the reference manual.

stevestrong

Mon Jul 31, 2017 12:55 pm

Victor, this is an extract from F4 reference manual regarding DMA status bits. page 326, chapter 10.4:

Note: Before setting an Enable control bit to ‘1’, the corresponding event flag should be cleared,

otherwise an interrupt is immediately generated.

victor_pv

Mon Jul 31, 2017 3:52 pm

Ok, let’s move it then to before enabling DMA and retest.
Not sure why I would have moved them there, can’t remember a specific reason, so we can test and confirm is all good, in case I faced a problem that made me move them.

If you are making the change as part of the code clean up, let me know once you have the PR and I will run a test with my sketch that uses then with callbacks.

EDIT:
Just a note, when using ISRs with DMA, the core dma irq handler clears them, right after calling the user dma handler:
https://github.com/rogerclarkmelbourne/ … vate.h#L45
So when using callbacks they will be cleared at the end of the transfer, but if the user code handler wants to check them, that can be done since they are not cleared yet, but on return.

RogerClark

Mon Jul 31, 2017 10:30 pm

Guys

While you are looking at the SPI DMA stuff, I see that the OV7670 camera sketch that will use the ILI9341 as high speed RAM to store an image..

Would benefit from a function that does a dma Read, but does not care what is transmitted.

At the moment the best way to do this would be to use dmaTransfer, and perhaps point the TX buffer at some arbitary piece of flash (as in this case I don’t know if it matters what is sent – though I’d need to double check….)

But ideally, something like the dmaSend which only receives and perhaps sends a specific value (passed to the function)

I’m not sure how complex it would be to write something like this, but I presume it would be modified copy of dmaSend

victor_pv

Tue Aug 01, 2017 2:40 am

I see 3 possible ways of doing it:
1.-We add a dmaRead function, that just sends FF repeatedly and reads to a buffer.
2.-We modify dmaTransfer to detect a Null value in the send bufffer, and if so, it sends FF repeatedly while it reads to the receive buffer.
3.-We add a MINC variable like we did for dmaSend. In case MINC is 0, it doesn’t increment the buffer for the TX channel, and so sends the first byte repeately.

The advantage on option 3 is that it allows to send an arbitrary value. Option 2 just adds a check to dmaTransfer, so should keep the code a bit shorter if both the normal transfer and the transfer with a null buffer are used in the same sketch, but should have a small performance impact from the check. Should not be much for a dmaTransfer though.

And option1 has the advantage of more closely resembling the current SPI.read(&buf, n) function, taking the same parameters and working in the same way except for using DMA, but is not as flexible as option 3.

We could also do a complete new function that works differently than the current dma ones and read(), and instead takes both a value to send repeatedly, and a receive buffer to read the data to. Something like:
dmaRead (uint8 tx_val, uint8 *rx_buf, uint8 n){}

RogerClark

Tue Aug 01, 2017 5:19 am

Thanks Victor

I think we would need to have a way to repeatedly send whatever data, as I’m not sure if devices like the ILI9341 need 0x00 or 0xff when data is being read.
I suspect the ILI9341 should not care what its being sent when data is being read, but I’d need to confirm that, and its safer to be able to specify the byte that is sent when reading.

So the option which uses MINC looks like it may be the best solution

stevestrong

Tue Aug 01, 2017 7:31 am

I would opt for a combination of 2) and 3), having both advantages, with really minimal overhead, increasing the code size only minimally.

victor_pv

Tue Aug 01, 2017 3:54 pm

Checking the code, looks like we forgot the dmaTransfer function already can send FF repeatedly if transmitBuf == NULL.
current dmaTransfer:

uint8 dmaTransfer(void * transmitBuf, void * receiveBuf, uint16 length);

victor_pv

Tue Aug 01, 2017 3:58 pm

[RogerClark – Mon Jul 31, 2017 10:30 pm] –
Would benefit from a function that does a dma Read, but does not care what is transmitted.

Roger see my post above. The dmaTransfer function can already do that if the transmitBuf parameter is passed as 0. In that case it sends FF while receiving to the rx buffer.
We will add the MINC feature so the value sent can be selected by the user with a variable.

RogerClark

Tue Aug 01, 2017 9:26 pm

Thanks victor

I will need to test that, when I have time, on a ili9341 display

Non blocking SPI DMA – Added callback to the SPI DMA functions (dmaSend, dmaTransfer…)

Making examples easier to find

Custom STM32F103C8T6

Leave a Reply Cancel reply

Non blocking SPI DMA – Added callback to the SPI DMA functions (dmaSend, dmaTransfer…)

New Posts

Related Posts

Leave a Reply Cancel reply