A faster ringbuffer implementation

Rick Kimball

Thu Jul 14, 2016 5:21 pm

A while back I had created a template based ring buffer (circular buffer) using C++ templates for the msp430 mcu. It was faster than the c based one used in Arduino/Energia. I gave it a try on stm32duino and compared the speed to the libmaple/ring_buffer.h version. The ringbuffer_t below seems to take 36% less time.

For me the time to insert an item is most important as I was using it to buffer incoming serial bytes. I wanted it to be fast as possible. The maple ring_buffer insert takes about 500 nano seconds. The insert function in the code below (push_bach) only takes about 320 nano seconds.

(BTW: this also highlights how kick ass it is to have a 72MHz ARM cpu, on the msp430 an insert takes about ~1680 ns)

ringbuffer.h
/* ringbuffer.h - yet another version of ringbuffer_t c++ template


Desc: This template class creates a ringbuffer with arbitrary types. Provides push

and pop methods for adding and removing items.  Access function for checking

the number of items in the buffer, capacity and full or empty methods

provide safe access to the info.
This version has been optimized for the msp430-gcc and stm32. It doesn't disable

or enable any interrupts. It is safe nonetheless for use when there is a single

writer and single reader. This is common when using an interrupt and the main

line thread working together to move data in and out of peripherals on demand.
Version: 1.0.3

Created: Jul-24-2012

Author: [email protected]

Date: 02-28-2013
=========================================================================

Copyright © 2012-2016 Rick Kimball
This program is free software: you can redistribute it and/or modify

it under the terms of the GNU General Public License as published by

the Free Software Foundation, either version 3 of the License, or

(at your option) any later version.
This program is distributed in the hope that it will be useful,

but WITHOUT ANY WARRANTY; without even the implied warranty of

MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

GNU General Public License for more details.
You should have received a copy of the GNU General Public License

along with this program.  If not, see <http://www.gnu.org/licenses/>.
Jul-14-2016: [email protected] removed extra stuff not needed for stm32

formatted using auto format in arduino ide
*/
#ifndef RINGBUFFER_T_H_

#define RINGBUFFER_T_H_

#include <stdint.h>
/**

struct_is_power_of_two - power of 2 checker

enforce at compile time that SIZE is power of 2 and >= 2

*/
template<unsigned SIZE>

struct is_power_of_two {

enum {val = (SIZE >= 2) & (SIZE > 1) & !(SIZE & (SIZE - 1))};

static const unsigned badSIZE[(val == 1) ? 1 : -1]; // SIZE is not a power of 2 if you an error here.

};
/**

uint16x2_t - a union containing 16 bit head and tail offsets into the ring buffer. The union

allows the c code to grab both values with one assembler instruction access.
*/

union uint16x2_t {

// access as 32 bit

unsigned long both;

// -- or as 2 16 bit values --

struct {

unsigned head: 16;

unsigned tail: 16;

};

};
/**

ringbuffer_t - provide a circular_buffer without disabling interrupts

expects a power of 2 container, and only one reader and one writer

container capacity SIZE-1

*/
template <

typename T,                     /* works with any type */

uint16_t SIZE,                  /* how many elements-1 must be power of 2 */

typename POP_T = int16_t,       /* return type of pop_front */

POP_T EMPTY_ELEM = (POP_T) - 1  /* default return value when empty */

>

struct ringbuffer_t {
// --- private structure data ---

// although variables are accessible because this is a struct
volatile uint16x2_t offsets; // comes first so we can use 0 offset to variables

// for fastest access

T elements[SIZE];
enum { CAPACITY = SIZE - 1 }; // leave one slot open
is_power_of_two<SIZE> check_buffer_size; // your SIZE is not a power of 2, if you get an error here
// --- public methods ---
// write access zeros out head and tail

inline void clear(void ) {

offsets.both = 0;

}
// return the count of used slots

size_t available() {

register uint16x2_t temp = { offsets.both };
temp.both = (temp.head - temp.tail) & CAPACITY;
return temp.both;

}
// return maximum number of slots available

size_t capacity() {

return CAPACITY;

}
// returns true when there is no used slots

bool inline empty() {

return !available();

}
// returns true when all slots used

bool inline full() {

return available() == capacity();

}
/*

push_back() - adds an element to the end of the queue
Note: affects head, reads tail, element ignored if overflow ~300 ns @72MHz

*/

void push_back(const T element) {

register uint16_t temp_head = offsets.head;
elements[temp_head++] = element;

temp_head &= CAPACITY;

if ( !(temp_head == offsets.tail) ) { // !full

offsets.head = temp_head;

}
return;

}
// no bounds check version, affects head ~250 ns @72MHz

void push_back_nc(const T element) {

register uint16_t temp_head = offsets.head;
elements[temp_head++] = element;

offsets.head = temp_head & CAPACITY;

return;

}
// affects tail, reads head

POP_T pop_front(void) {

register uint16x2_t temp = { offsets.both };
if ( (temp.head - temp.tail) & CAPACITY ) { // !empty

POP_T elem = elements[temp.tail++];

offsets.tail = temp.tail & CAPACITY;

return elem;

}
return EMPTY_ELEM; // on underflow return default element

}
// no bounds check, affects tail

POP_T pop_front_nc(void) {

register uint16_t temp_tail = offsets.tail;
POP_T elem = elements[temp_tail++];

offsets.tail = temp_tail & CAPACITY;

return elem;

}
};

#endif /* RINGBUFFER_T_H_ */

Slammer

Thu Jul 14, 2016 8:45 pm

Thanks Rick! Very Nice!

edogaldo

Thu Jul 14, 2016 8:52 pm

Hei Rick, interesting solution.
One request: would it be possible for you to benchmark also my version of ring_buffer?

Anyway one question: I didn’t understand if you said your solution is race conditions safe even without the need to disable interrupts and, if yes, why, could you please explain?

Thanks, E.

Rick Kimball

Thu Jul 14, 2016 9:11 pm

edogaldo wrote:Anyway one question: I didn’t understand if you said your solution is race conditions safe even without the need to disable interrupts and, if yes, why, could you please explain?

Rick Kimball

Thu Jul 14, 2016 9:21 pm

edogaldo wrote:One request: would it be possible for you to benchmark also my version of ring_buffer?

Rick Kimball

Sun Mar 12, 2017 7:27 pm

I’ve been moving over to arm-none-eabi-gcc 6.2.1. I was looking at my asm output and noticed the compiler is getting even smarter.
In the original code I had posted, the available() function used 3 registers, r0,r2,r3. In 6.2.1 generated code, it is accomplished with one register and a single asm call:

arm-none-eabi-gcc 4.8.3
// return the count of used slots size_t available() { register uint16x2_t temp = { offsets.both }; 800015a: 6803 ldr r3, [r0, #0]

temp.both = (temp.head - temp.tail) & CAPACITY; 800015c: b29a uxth r2, r3 800015e: eba2 4013 sub.w r0, r2, r3, lsr #16

return temp.both; } 8000162: f000 000f and.w r0, r0, #15 8000166: 4770 bx lr

C_D

Wed Jul 05, 2017 9:20 pm

I’m really interested in using something like this in one of my projects, I currently have a number of ring buffers storing sensor values and calculating moving averages. A template class like this is going to be great for tidying up my sketch.

I have a quick question though, what happens if you try to use it with an array? From doing some reading I see that template classes do work with arrays as the datatype, but I assume I would have to write special functions for copying in and returning data?

EDIT:
In hindsight passing whole arrays in and out of a ringbuffer is kind of an odd thing to want to do. My solution in the end was to add a discard_front() method which could remove invalid messages from the front of queue.

C_D

Wed Oct 04, 2017 2:26 am

Just came back to my project using these neat ringbuffers, and found theres a bit of bug in the posted code.

typename POP_T = int16_t, /* return type of pop_front */


...
// affects tail, reads head

POP_T pop_front(void) {

register uint16x2_t temp = { offsets.both };
if ( (temp.head - temp.tail) & CAPACITY ) { // !empty

POP_T elem = elements[temp.tail++];

offsets.tail = temp.tail & CAPACITY;

return elem;

}

return EMPTY_ELEM; // on underflow return default element }

Rick Kimball

Wed Oct 04, 2017 3:54 am

You just need to provide the POP_T data type and make it the same as the T data type in your template definition.

ringbuffer_t<T, buffer_size, T, -1> values; /* <<<< note this is different than yours */

For the POP_T, you can use the same type as used for the element T or a different one.

I used this “feature” to my advantage using uint8_t as the element data type but a int for the POP_T with an empty value of -1.

This allows me to loop on the return value:
ringbuffer_t<uint8_t, 64, int, -1> buffer; ... int c; while((c=buffer.pop_front()) != -1) { Serial.print((char)c); }

mrburnette

Wed Oct 04, 2017 1:44 pm

[Rick Kimball – Thu Jul 14, 2016 5:21 pm] –
…
Note: as this stuff is template based, you could change the element buffer from an array of uint8_t data to something completely different. It could be an ring buffer of custom structure, a ring buffer of int16_t, or even a ring buffer of function pointers.
…

Rick,
Wow … seriously cool. We mortals are unworthy, but we will use it anyway!

Ray

C_D

Wed Oct 04, 2017 11:54 pm

[Rick Kimball – Wed Oct 04, 2017 3:54 am] –
You just need to provide the POP_T data type and make it the same as the T data type.

I used this “feature” to my advantage using uint8_t as the data type but returning a int16_t with an empty value of -1.

Aha, I knew there had to be a reason for it, I just couldn’t think of any reason why the return type would be different to the stored type. That makes more sense now!

[Rick Kimball – Wed Oct 04, 2017 3:54 am] –
BTW: You should probably call values.clear() in your constructor so it will work as a local variable on the stack.

Is that just ensuring that the structure is initialised to 0 when it is created?

Also, why is the capacity one less than the actual size of the buffer? It really slows down my moving mean code if the number of actual elements is not a power of two because you have to do actual division not just a bit shift.

Rick Kimball

Thu Oct 05, 2017 3:45 pm

[C_D – Wed Oct 04, 2017 11:54 pm] –
Is that just ensuring that the structure is initialised to 0 when it is created?

Yes. However this is only a problem if it instanced on the stack. If you make it a global variable the .bss section gets zeroed out at startup.

[C_D – Wed Oct 04, 2017 11:54 pm] –
Also, why is the capacity one less than the actual size of the buffer?

I leave one slot open so I can detect when the element buffer is full without having to keep a count. There are different ways to implement a circular buffer. I used the approach that uses 2 indices and an element buffer. This minimizes code size and locking issues.

This code make a few assumptions:

1. There will be a single producer and a single consumer. Typically the producer is an interrupt routine getting data asynchronously and a main thread acting as the consumer. I initially created it for dealing with UART data.

2. I used a datatype of uint16_t for the head and tail indices. That limits the size of the buffer elements to (1<<16)-1 (65535). This isn’t a problem on the chips I’m targetting. The stm32f103c8 only has 20k of ram total. However, it also allows me to do an atomic read of the both indices without locking as I stuff the two uint16_t values into a uint32_t union. The producer is the only one who is going to set the head index. The consumer is the only one who is going to set the tail index. This makes the code pretty safe if you ignore the case where it overflows. I leave that as an exercise to the reader. They will know best about how they are producing and consuming. They can decide to block or ignore on an overflow condition. I don’t put that in the code.

3. I assume a buffer element size that is a power of 2 so I can use a mask to deal with buffer index wrap around instead of using a modulo operator. See my earlier post that shows how arm-none-eabi-gcc reduces the computation of the number of available elements to a single thumb2 instruction.

[C_D – Wed Oct 04, 2017 11:54 pm] –
It really slows down my moving mean code if the number of actual elements is not a power of two because you have to do actual division not just a bit shift.

You are already checking to see if the buffer is full to decide if you need to throw out an element. Instead, you could always use a size of 16 for the ringbuffer and then check for an available() count of 8 instead of the the buffer.capacity(). That would allow you to use a shift divide. However, you still have to deal with the initial conditions when you have less than 8 samples using a normal divide instruction.

C_D

Thu Oct 05, 2017 8:39 pm

[Rick Kimball – Thu Oct 05, 2017 3:45 pm] – You are already checking to see if the buffer is full to decide if you need to throw out an element. Instead, you could always use a size of 16 for the ringbuffer and then check for an available() count of 8 instead of the the buffer.capacity(). That would allow you to use a shift divide. However, you still have to deal with the initial conditions when you have less than 8 samples using a normal divide instruction.

That’s actually a not a bad idea, trading off speed for a small amount of wasted space. For my current application I am sampling relatively slowly so the integer division isn’t a problem, but that’s probably the best approach if I do need to speed things up further down the track, thanks

A faster ringbuffer implementation

Making examples easier to find

Custom STM32F103C8T6

Leave a Reply Cancel reply

A faster ringbuffer implementation

New Posts

Related Posts

Leave a Reply Cancel reply