Async(ish) SPI.write?

heisan
Sat Aug 04, 2018 7:25 pm
(Using Roger’s core)

Playing around today, I noticed something quite interesting:

In libmaple’s spi_tx:
while ( (regs->SR & SPI_SR_TXE)==0 ) ; //while ( spi_is_tx_empty(dev)==0 ); // wait Tx to be empty
regs->DR = *dp8++;


stevestrong
Sun Aug 05, 2018 6:36 am
Can you provide a simple sketch to test?
Do u use single writes or buffered (multiple) writes?
Because using buffered write spi.write(buf, nr_bytes) will be almost as fast as with dma, for less then ~70 bytes even faster than with dma.

EDIT
If you refer to these lines: https://github.com/rogerclarkmelbourne/ … #L107-L108
this is for buffered write, and they are optimized so that it will not wait for the end of the last transferred byte.
But the wait for the last byte must be there somewhere, and therefore it is transferred to SPI.cpp: https://github.com/rogerclarkmelbourne/ … #L357-L358.
I really don’t see any problem with that.

EDIT2
I think you mixed the single byte writes with the buffered write.
Your extract from SPI.c is for single write: https://github.com/rogerclarkmelbourne/ … #L327-l328, this calls the (inline) function spi_tx_reg, not to mix with spi_tx !!!


heisan
Sun Aug 05, 2018 10:37 am
I will be in and out throughout the day. If I manage to sit down long enough, I will prepare a stand-alone sketch.

This is where the code is intended to be used (added to Adafruit_ILI9341_STM.h):

#ifdef _ADAFRUIT_FB_H_
void drawFB(int x, int y, Adafruit_FB1BPP& fb, uint16_t pal[]) {
if(x < 0 || y < 0 || x + fb.fbwidth() > WIDTH || y + fb.fbheight() > HEIGHT) return;
setAddrWindow(x, y, x + fb.fbwidth() - 1, y + fb.fbheight() - 1);

int iy, ix;
uint16_t p;
int bp = 0;
uint8_t m = 1;
int ms = 0;
for(iy = 0; iy < fb.fbheight(); iy++) {
for(ix = 0; ix < fb.fbwidth(); ix++) {
mSPI.writeAsync(pal[(fb.buffer[bp++] & m)>>ms]);
}
m <<= 1;
ms++;
if(m) {
bp -= fb.fbwidth();
} else {
m = 1;
ms = 0;
}
}
mSPI.writeAsyncEnd();
cs_set();
}
#endif


stevestrong
Sun Aug 05, 2018 11:08 am
Your code seems reasonable.
I assume the array pal[] contains uint16 values, so you are using the SPI in 16bit mode, right?

You could speed up the code a bit if you declare the writeAsync() function as “inline” in a header file.


heisan
Sun Aug 05, 2018 11:25 am
[stevestrong – Sun Aug 05, 2018 11:08 am] –
Your code seems reasonable.
I assume the array pal[] contains uint16 values, so you are using the SPI in 16bit mode, right?

You could speed up the code a bit if you declare the writeAsync() function as “inline” in a header file.

Thanks – I haven’t started optimising that yet. This was a quick hack to try find out why SPI.write() was so slow.

I was wondering if there is a particular reason that the SPIClass wrapper was written in such a way, rather than using the libmaple primitives (which are already asynchronous)? Should I rather spend some time prepping a change to SPIClass, or just add a separate class of async functions?


heisan
Sun Aug 05, 2018 11:27 am
As a side note, rendering 240×240 dial type instruments is now 2x faster than drawing directly to the ILI9341 library, and flicker free ;) .

stevestrong
Sun Aug 05, 2018 5:40 pm
[heisan – Sun Aug 05, 2018 11:25 am] –
I was wondering if there is a particular reason that the SPIClass wrapper was written in such a way, rather than using the libmaple primitives (which are already asynchronous)? Should I rather spend some time prepping a change to SPIClass, or just add a separate class of async functions?

The SPI lib functions are Arduino compatible, and I would not suggest to change them.
Your solution is a non-standard tweak, which helps in a particular situation.
As you said yourself, your application would be actually a good candidate for a dual buffered DMA transfer.


heisan
Sun Aug 05, 2018 8:56 pm
[stevestrong – Sun Aug 05, 2018 5:40 pm] –
The SPI lib functions are Arduino compatible, and I would not suggest to change them.
Your solution is a non-standard tweak, which helps in a particular situation.
As you said yourself, your application would be actually a good candidate for a dual buffered DMA transfer.

SPIClass.write() is not a standard Arduino API function – it only has transfer(), which by definition must be synchronous. But I can see that modifying the standard write() function will not actually be practical – there is not enough discipline in the various libraries in calling endTransaction() – so data output would not be guaranteed in all circumstances.

As an alternative, what would be your thoughts on moving _currentSetting from private to protected, so that derived classes can be made with async functions? I would like to avoid the additional memory penalty of two DMA buffers (especially when the possible performance gain is so small), but I also don’t want users to need a custom core to build the application…

Thanks,
Justin


heisan
Wed Aug 08, 2018 9:21 pm
Finally got to play with this some more, and I am making some progress.

First a note. The SPI clock is 36MHz – i.e. 2 CPU clocks per bit. SPI writes are 16 bits wide. So there are 32 CPU clocks available to keep the SPI buffer full. If you can do that then the final result will be faster and use less memory than a DMA implementation.

The problem is, 32 clocks is not a lot ;) .

As a performance reference I am using:
tfto.fillRect(0,0,240,240,0);


stevestrong
Thu Aug 09, 2018 5:05 am
Use “-O3” compiler directive.

heisan
Thu Aug 09, 2018 8:11 am
Thanks “-O3” gets it to DMA speed for all cases – at the cost of 10% of the flash.

Strangely, any attempts at local optimisations, eg:
void __attribute__((optimize("O3"))) drawFB(int x, int y, Adafruit_FB& fb, uint16_t pal[]) {


heisan
Fri Aug 17, 2018 7:24 pm
I have been playing around with this some more, and I am still wondering…

A lot of work has been done to add the DMA layer, and port libraries over to using DMA. But as far as I can see, in every case the result is actually bigger, slower code.

The only way to get a real advantage out of DMA is to use dmaSendAsync, but the current API is not workable from libraries – especially if the bus is shared.

At the very least, state needs to be polled in beginTransaction and endTransaction to ensure that it has gone back to SPI_STATE_READY (async transfer complete) before releasing/grabbing the SPI device. Or it could be made more transparent to application developers by checking state on all the API entry points.

Opinions?


Leave a Reply

Your email address will not be published. Required fields are marked *