Fast bitbanding gpio/sram access

arpruss

Sat Nov 25, 2017 8:00 pm

I made some fast bitband-based gpio preprocessor macros that generate fast and small i/o code. They work only with pin numbers explicitly specified like PB13 and PA6. Usage:

value = DIGITAL_READ(PB13);

DIGITAL_WRITE(PB14,1);

dannyf

Sun Nov 26, 2017 3:32 am

I made some fast bitband-based gpio preprocessor macros that generate fast and small i/o code.

just curious: how much faster are they to the standard approaches?

arpruss

Sun Nov 26, 2017 5:36 am

[dannyf – Sun Nov 26, 2017 3:32 am] –
I made some fast bitband-based gpio preprocessor macros that generate fast and small i/o code.

just curious: how much faster are they to the standard approaches?

Basically, reading and writing becomes as fast as reading/writing a uint32 to a location pointed by a global uint32* pointer. I haven’t timed it yet.

Updated: Not quite. The processor seems to have an extra overhead on accessing bitbanded memory locations, especially when writing.

RogerClark

Sun Nov 26, 2017 6:46 am

FYI.

I have looked at the digital write function and its pretty well optimised, despite not looking like it is.

The compiler is a strange beast.

I’d try timing your code and take into consideration the call overhead and see if it is much faster

Pito

Sun Nov 26, 2017 9:12 am

1mil for-loops switching pin 1/0 with:

digitalWrite       1517ms   (591ns/Write)

DIGITAL_WRITE       570ms   (118ns/WRITE)

stevestrong

Sun Nov 26, 2017 9:54 am

In the libs there is a wide use of using core defines like these:

const uint8_t inPin = PB0;

const uint8_t outPin = PB10;

volatile uint32_t * inPort = portInputRegister(digitalPinToPort(inPin)); uint16_t inMask = BIT(digitalPinToBit(inPin)); volatile uint32_t * setPort = portSetRegister(outPin); uint16_t outMask = BIT(digitalPinToBit(outPin)); ... bool rd = ( (*inPort) & inMask ) ? 1 : 0; // read ... *outPort = outMask; // set the pin *outPort = outMask<<16; // reset the pin

dannyf

Sun Nov 26, 2017 12:58 pm

I haven’t timed it yet.

I happened to be running IAR (7.10 I think) when I asked the question.

So I decided to benchmark my GPIO macros:

//port/gpio oriented macros #define IO_SET(port, pins) port->ODR |= (pins) //set bits on port #define IO_CLR(port, pins) port->ODR &=~(pins) //clear bits on port

//fast routines through BRR/BSRR registers #define FIO_SET(port, pins) port->BSRR = (pins) #define FIO_CLR(port, pins) port->BRR = (pins)

Pito

Sun Nov 26, 2017 1:18 pm

Here under our core:

//port/gpio oriented macros

#define IO_SET(port, pins)          port->regs->ODR |= (pins)       //set bits on port

#define IO_CLR(port, pins)          port->regs->ODR &=~(pins)       //clear bits on port

//fast routines through BRR/BSRR registers #define FIO_SET(port, pins) port->regs->BSRR = (pins) #define FIO_CLR(port, pins) port->regs->BRR = (pins)

arpruss

Sun Nov 26, 2017 2:21 pm

An advantage of my macros over some of the other solutions is that my macros don’t require any setup — no masks or addresses to define.

dannyf

Sun Nov 26, 2017 2:45 pm

It doesn’t because you explicitly define them there.

You could have done it with other macros. And it is actually a good idea to NOT define your own and rely on the device header file – so your macros are ortable across platforms.

arpruss

Sun Nov 26, 2017 3:08 pm

Here are cycle timings per operation and code sizes for my test code using -O3 and DWT->CYCCNT for timing. The test code uses 400 unrolled operations. For read operations, this is a sequence of 400 reads from PB12 and PA7, alternating. For write operations, this is a sequence of 400 writes of 1>PB12, 1>PA7, 0>PB12, 0>PA7. All of the tests have the same setup code, so the byte sizes should mainly vary due to the actual read/write operations.

Summary: For reading, bitbanded is fastest and smallest. For constant value writing, gpio_write_bit() is by far fastest, while bitbanded is slightly smaller.

digitalRead: 52.5 cycles (40568 bytes)

gpio_read_bit: 9.5 cycles (39744 bytes)

reading from register with premade mask: 9 cycles (39360 bytes)

bitbanded read: 7 cycles (38128 bytes)

digitalWrite: 54 cycles (39760 bytes)

gpio_write_bit: 2 cycles (37344 bytes)

writing to ODR register with premade mask: 12.5 cycles (41072 bytes)

writing to BSRR register with premade mask: 4 cycles (38144 bytes)

bitbanded write: 7 cycles (37328 bytes)

However, gpio_write_bit() becomes significantly less space and time efficient when the value being written is not known at compile-time (e.g., I had it write a volatile uint8 which was flipped each write). It still beats bitbanded writing by one clock cycle, at the expense of a lot of space lost.

dave j

Sun Nov 26, 2017 3:20 pm

The advantages from bit-banding really come from pre-calculating the address – that way you just need to do a read or write when you use it. Calculating it each time as you are doing loses the main advantage of the technique.

arpruss

Sun Nov 26, 2017 3:49 pm

I made a new version that replaces the bitbanded DIGITAL_WRITE() with an optimized version of gpio_write_bit(). The optimized version is one clock cycle faster than gpio_write_bit() when the value being written (which must be either 0 or 1; other values yield unpredictable results) is unknown at compile time, and has the same speed as gpio_write_bit() when the value is known at compile time. https://gist.github.com/arpruss/5be978f … 7abf954c68

The new DIGITAL_WRITE() trades space for speed. If you want to trade speed for space, use DIGITAL_WRITE_BITBAND() instead, which will always be faster and smaller than digitalWrite().

Note that my DIGITAL_READ() has an advantage over gpio_read_bit(), because DIGITAL_READ() always returns 0 or 1, while gpio_read_bit() returns 0 or a 32-bit mask. Thus, one can do things like:
uint8 value = DIGITAL_READ(PA8); DIGITAL_WRITE(PA8,value);

arpruss

Sun Nov 26, 2017 3:53 pm

[dave j – Sun Nov 26, 2017 3:20 pm] –
The advantages from bit-banding really come from pre-calculating the address – that way you just need to do a read or write when you use it. Calculating it each time as you are doing loses the main advantage of the technique.

The macros only work when the port is explicitly specified using the PXxx format, and then all the calculations are done by the compiler at compile time. I checked the assembly output both with -O3 and with no optimization. Here is a snippet without optimization (-g):
aa=DIGITAL_READ(PB12); 8002466: 6808 ldr r0, [r1, #0] 8002468: 6018 str r0, [r3, #0] aa=DIGITAL_READ(PA7); 800246a: 6810 ldr r0, [r2, #0] 800246c: 6018 str r0, [r3, #0]

stevestrong

Sun Nov 26, 2017 4:13 pm

You have to add to your benchmark the instructions which load the constant values into registers r1, r2, r3.

Pito

Sun Nov 26, 2017 4:16 pm

..trades space for speed. If you want to trade speed for space,..
How could you trade speed for size and vice versa while manipulating the pins?
The size means clocks here..

mrburnette

Sun Nov 26, 2017 5:03 pm

I absolutely “love” these threads; truly interesting from a chip architecture perspective.

But I usually (done this many times before) make a post that such processes are anti-Arduino, conceptually. For someone coming over to STM32duino from AVR, it is paradigm quicksand because not only are we STM32_centric, we may even be writing code to a specific uC within the STM32 product family.

This is a good read and a good refresher for why Arduino cores are inherently pin_address inefficient. This all plays into the concerns of library writers on whether they write generic or use #ifdef to broaden the library’s appeal (useful scope.)

Essentially, the choice made will both delight and dismay prospective users; which is just another way of saying ‘you cannot please everyone.’

Ray

Ollie

Sun Nov 26, 2017 5:40 pm

My reasons to like bitbanding I/O were

Monotonic operations did eliminate conflicts with interrupts
Self-documentation when using variable names instead of port letters and pin numbers
Accessing pins that were not known at compilation time

The reason why I have abandoned bitbanding is

It is not supported in F7 and H7

Rick Kimball

Sun Nov 26, 2017 5:53 pm

[arpruss – Sun Nov 26, 2017 3:08 pm] –
The test code uses 400 unrolled operations. For read operations, this is a sequence of 400 reads from PB12 and PA7, alternating.

I think that is a ridiculous test. Who is going to be doing that? The typical use case of digital write is some conditional code then a pin change, then some more conditional logic and another digitalWrite. This is going to push your bitband registers out of r1,r2,r3 and it will have to reload them.

Can you explain why this test matters?

dannyf

Sun Nov 26, 2017 6:22 pm

The main disadvantage of my macros is that you can’t store the port in a variable (though you can store it in a macro).

not that big of a deal, to most C programmer. The arduino crowd, however, seems to be more challenged by that.

One thing I like about my DIGITAL_*() macros is that I don’t need to have two #defines per port in my sketch header. If I was using gpio_*_bit(), I would need to do something like

you are paying a (small) price for that, however.

arpruss

Sun Nov 26, 2017 8:40 pm

[stevestrong – Sun Nov 26, 2017 4:13 pm] –
You have to add to your benchmark the instructions which load the constant values into registers r1, r2, r3.

Good point. If I access more pins in a function, the compiler eventually runs out of registers and does things like:
aa=DIGITAL_READ(PA4); 80028aa: 491e ldr r1, [pc, #120] ; (8002924 <_Z4loopv+0x700>) 80028ac: 6809 ldr r1, [r1, #0] 80028ae: 6019 str r1, [r3, #0]

arpruss

Sun Nov 26, 2017 9:21 pm

I added a fallback to digitalRead() and digitalWrite() when the argument is not a constant of the format PXxx. This lets one just use DIGITAL_READ() and DIGITAL_WRITE() as pretty much drop-in replacements for digitalRead() and digitalWrite(), saving space and time when the argument is a constant, and being no worse (assuming gcc optimization does its job) otherwise.

arpruss

Sun Nov 26, 2017 9:23 pm

[Pito – Sun Nov 26, 2017 4:16 pm] –
..trades space for speed. If you want to trade speed for space,..
How could you trade speed for size and vice versa while manipulating the pins?
The size means clocks here..

As far as I can tell, writing to (and probably reading from, but I haven’t tested this) the bitbanded memory area takes several more clock cycles than writing to ordinary memory. So you can have fewer instructions in the code, but with one of the instructions taking rather longer.

Fast bitbanding gpio/sram access

Making examples easier to find

Custom STM32F103C8T6

Leave a Reply Cancel reply

Fast bitbanding gpio/sram access

New Posts

Related Posts

Leave a Reply Cancel reply