value = DIGITAL_READ(PB13);
DIGITAL_WRITE(PB14,1);
just curious: how much faster are they to the standard approaches?
[dannyf – Sun Nov 26, 2017 3:32 am] –
I made some fast bitband-based gpio preprocessor macros that generate fast and small i/o code.just curious: how much faster are they to the standard approaches?
Basically, reading and writing becomes as fast as reading/writing a uint32 to a location pointed by a global uint32* pointer. I haven’t timed it yet.
Updated: Not quite. The processor seems to have an extra overhead on accessing bitbanded memory locations, especially when writing.
I have looked at the digital write function and its pretty well optimised, despite not looking like it is.
The compiler is a strange beast.
I’d try timing your code and take into consideration the call overhead and see if it is much faster
digitalWrite 1517ms (591ns/Write)
DIGITAL_WRITE 570ms (118ns/WRITE)const uint8_t inPin = PB0;
const uint8_t outPin = PB10;
volatile uint32_t * inPort = portInputRegister(digitalPinToPort(inPin));
uint16_t inMask = BIT(digitalPinToBit(inPin));
volatile uint32_t * setPort = portSetRegister(outPin);
uint16_t outMask = BIT(digitalPinToBit(outPin));
...
bool rd = ( (*inPort) & inMask ) ? 1 : 0; // read
...
*outPort = outMask; // set the pin
*outPort = outMask<<16; // reset the pin
I happened to be running IAR (7.10 I think) when I asked the question.
So I decided to benchmark my GPIO macros:
//port/gpio oriented macros
#define IO_SET(port, pins) port->ODR |= (pins) //set bits on port
#define IO_CLR(port, pins) port->ODR &=~(pins) //clear bits on port
//fast routines through BRR/BSRR registers
#define FIO_SET(port, pins) port->BSRR = (pins)
#define FIO_CLR(port, pins) port->BRR = (pins)
//port/gpio oriented macros
#define IO_SET(port, pins) port->regs->ODR |= (pins) //set bits on port
#define IO_CLR(port, pins) port->regs->ODR &=~(pins) //clear bits on port
//fast routines through BRR/BSRR registers
#define FIO_SET(port, pins) port->regs->BSRR = (pins)
#define FIO_CLR(port, pins) port->regs->BRR = (pins)
You could have done it with other macros. And it is actually a good idea to NOT define your own and rely on the device header file – so your macros are ortable across platforms.
Summary: For reading, bitbanded is fastest and smallest. For constant value writing, gpio_write_bit() is by far fastest, while bitbanded is slightly smaller.
- digitalRead: 52.5 cycles (40568 bytes)
- gpio_read_bit: 9.5 cycles (39744 bytes)
- reading from register with premade mask: 9 cycles (39360 bytes)
- bitbanded read: 7 cycles (38128 bytes)
- digitalWrite: 54 cycles (39760 bytes)
- gpio_write_bit: 2 cycles (37344 bytes)
- writing to ODR register with premade mask: 12.5 cycles (41072 bytes)
- writing to BSRR register with premade mask: 4 cycles (38144 bytes)
- bitbanded write: 7 cycles (37328 bytes)
However, gpio_write_bit() becomes significantly less space and time efficient when the value being written is not known at compile-time (e.g., I had it write a volatile uint8 which was flipped each write). It still beats bitbanded writing by one clock cycle, at the expense of a lot of space lost.
The new DIGITAL_WRITE() trades space for speed. If you want to trade speed for space, use DIGITAL_WRITE_BITBAND() instead, which will always be faster and smaller than digitalWrite().
Note that my DIGITAL_READ() has an advantage over gpio_read_bit(), because DIGITAL_READ() always returns 0 or 1, while gpio_read_bit() returns 0 or a 32-bit mask. Thus, one can do things like:
uint8 value = DIGITAL_READ(PA8);
DIGITAL_WRITE(PA8,value);
[dave j – Sun Nov 26, 2017 3:20 pm] –
The advantages from bit-banding really come from pre-calculating the address – that way you just need to do a read or write when you use it. Calculating it each time as you are doing loses the main advantage of the technique.
The macros only work when the port is explicitly specified using the PXxx format, and then all the calculations are done by the compiler at compile time. I checked the assembly output both with -O3 and with no optimization. Here is a snippet without optimization (-g):
aa=DIGITAL_READ(PB12);
8002466: 6808 ldr r0, [r1, #0]
8002468: 6018 str r0, [r3, #0]
aa=DIGITAL_READ(PA7);
800246a: 6810 ldr r0, [r2, #0]
800246c: 6018 str r0, [r3, #0]
How could you trade speed for size and vice versa while manipulating the pins?
The size means clocks here..
But I usually (done this many times before) make a post that such processes are anti-Arduino, conceptually. For someone coming over to STM32duino from AVR, it is paradigm quicksand because not only are we STM32_centric, we may even be writing code to a specific uC within the STM32 product family.
This is a good read and a good refresher for why Arduino cores are inherently pin_address inefficient. This all plays into the concerns of library writers on whether they write generic or use #ifdef to broaden the library’s appeal (useful scope.)
Essentially, the choice made will both delight and dismay prospective users; which is just another way of saying ‘you cannot please everyone.’
Ray
- Monotonic operations did eliminate conflicts with interrupts
- Self-documentation when using variable names instead of port letters and pin numbers
- Accessing pins that were not known at compilation time
The reason why I have abandoned bitbanding is
- It is not supported in F7 and H7
[arpruss – Sun Nov 26, 2017 3:08 pm] –
The test code uses 400 unrolled operations. For read operations, this is a sequence of 400 reads from PB12 and PA7, alternating.
I think that is a ridiculous test. Who is going to be doing that? The typical use case of digital write is some conditional code then a pin change, then some more conditional logic and another digitalWrite. This is going to push your bitband registers out of r1,r2,r3 and it will have to reload them.
Can you explain why this test matters?
not that big of a deal, to most C programmer. The arduino crowd, however, seems to be more challenged by that.
One thing I like about my DIGITAL_*() macros is that I don’t need to have two #defines per port in my sketch header. If I was using gpio_*_bit(), I would need to do something like
you are paying a (small) price for that, however.
[stevestrong – Sun Nov 26, 2017 4:13 pm] –
You have to add to your benchmark the instructions which load the constant values into registers r1, r2, r3.
Good point. If I access more pins in a function, the compiler eventually runs out of registers and does things like:
aa=DIGITAL_READ(PA4);
80028aa: 491e ldr r1, [pc, #120] ; (8002924 <_Z4loopv+0x700>)
80028ac: 6809 ldr r1, [r1, #0]
80028ae: 6019 str r1, [r3, #0]
[Pito – Sun Nov 26, 2017 4:16 pm] –
..trades space for speed. If you want to trade speed for space,..
How could you trade speed for size and vice versa while manipulating the pins?
The size means clocks here..
As far as I can tell, writing to (and probably reading from, but I haven’t tested this) the bitbanded memory area takes several more clock cycles than writing to ordinary memory. So you can have fewer instructions in the code, but with one of the instructions taking rather longer.

