For me the time to insert an item is most important as I was using it to buffer incoming serial bytes. I wanted it to be fast as possible. The maple ring_buffer insert takes about 500 nano seconds. The insert function in the code below (push_bach) only takes about 320 nano seconds.
(BTW: this also highlights how kick ass it is to have a 72MHz ARM cpu, on the msp430 an insert takes about ~1680 ns)
ringbuffer.h
/*
ringbuffer.h - yet another version of ringbuffer_t c++ template
Desc: This template class creates a ringbuffer with arbitrary types. Provides push
and pop methods for adding and removing items. Access function for checking
the number of items in the buffer, capacity and full or empty methods
provide safe access to the info.
This version has been optimized for the msp430-gcc and stm32. It doesn't disable
or enable any interrupts. It is safe nonetheless for use when there is a single
writer and single reader. This is common when using an interrupt and the main
line thread working together to move data in and out of peripherals on demand.
Version: 1.0.3
Created: Jul-24-2012
Author: [email protected]
Date: 02-28-2013
=========================================================================
Copyright © 2012-2016 Rick Kimball
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
Jul-14-2016: [email protected] removed extra stuff not needed for stm32
formatted using auto format in arduino ide
*/
#ifndef RINGBUFFER_T_H_
#define RINGBUFFER_T_H_
#include <stdint.h>
/**
struct_is_power_of_two - power of 2 checker
enforce at compile time that SIZE is power of 2 and >= 2
*/
template<unsigned SIZE>
struct is_power_of_two {
enum {val = (SIZE >= 2) & (SIZE > 1) & !(SIZE & (SIZE - 1))};
static const unsigned badSIZE[(val == 1) ? 1 : -1]; // SIZE is not a power of 2 if you an error here.
};
/**
uint16x2_t - a union containing 16 bit head and tail offsets into the ring buffer. The union
allows the c code to grab both values with one assembler instruction access.
*/
union uint16x2_t {
// access as 32 bit
unsigned long both;
// -- or as 2 16 bit values --
struct {
unsigned head: 16;
unsigned tail: 16;
};
};
/**
ringbuffer_t - provide a circular_buffer without disabling interrupts
expects a power of 2 container, and only one reader and one writer
container capacity SIZE-1
*/
template <
typename T, /* works with any type */
uint16_t SIZE, /* how many elements-1 must be power of 2 */
typename POP_T = int16_t, /* return type of pop_front */
POP_T EMPTY_ELEM = (POP_T) - 1 /* default return value when empty */
>
struct ringbuffer_t {
// --- private structure data ---
// although variables are accessible because this is a struct
volatile uint16x2_t offsets; // comes first so we can use 0 offset to variables
// for fastest access
T elements[SIZE];
enum { CAPACITY = SIZE - 1 }; // leave one slot open
is_power_of_two<SIZE> check_buffer_size; // your SIZE is not a power of 2, if you get an error here
// --- public methods ---
// write access zeros out head and tail
inline void clear(void ) {
offsets.both = 0;
}
// return the count of used slots
size_t available() {
register uint16x2_t temp = { offsets.both };
temp.both = (temp.head - temp.tail) & CAPACITY;
return temp.both;
}
// return maximum number of slots available
size_t capacity() {
return CAPACITY;
}
// returns true when there is no used slots
bool inline empty() {
return !available();
}
// returns true when all slots used
bool inline full() {
return available() == capacity();
}
/*
push_back() - adds an element to the end of the queue
Note: affects head, reads tail, element ignored if overflow ~300 ns @72MHz
*/
void push_back(const T element) {
register uint16_t temp_head = offsets.head;
elements[temp_head++] = element;
temp_head &= CAPACITY;
if ( !(temp_head == offsets.tail) ) { // !full
offsets.head = temp_head;
}
return;
}
// no bounds check version, affects head ~250 ns @72MHz
void push_back_nc(const T element) {
register uint16_t temp_head = offsets.head;
elements[temp_head++] = element;
offsets.head = temp_head & CAPACITY;
return;
}
// affects tail, reads head
POP_T pop_front(void) {
register uint16x2_t temp = { offsets.both };
if ( (temp.head - temp.tail) & CAPACITY ) { // !empty
POP_T elem = elements[temp.tail++];
offsets.tail = temp.tail & CAPACITY;
return elem;
}
return EMPTY_ELEM; // on underflow return default element
}
// no bounds check, affects tail
POP_T pop_front_nc(void) {
register uint16_t temp_tail = offsets.tail;
POP_T elem = elements[temp_tail++];
offsets.tail = temp_tail & CAPACITY;
return elem;
}
};
#endif /* RINGBUFFER_T_H_ */
One request: would it be possible for you to benchmark also my version of ring_buffer?
Anyway one question: I didn’t understand if you said your solution is race conditions safe even without the need to disable interrupts and, if yes, why, could you please explain?
Thanks, E.
In the original code I had posted, the available() function used 3 registers, r0,r2,r3. In 6.2.1 generated code, it is accomplished with one register and a single asm call:
arm-none-eabi-gcc 4.8.3 temp.both = (temp.head - temp.tail) & CAPACITY; return temp.both;
// return the count of used slots
size_t available() {
register uint16x2_t temp = { offsets.both };
800015a: 6803 ldr r3, [r0, #0]
800015c: b29a uxth r2, r3
800015e: eba2 4013 sub.w r0, r2, r3, lsr #16
}
8000162: f000 000f and.w r0, r0, #15
8000166: 4770 bx lr
I have a quick question though, what happens if you try to use it with an array? From doing some reading I see that template classes do work with arrays as the datatype, but I assume I would have to write special functions for copying in and returning data?
EDIT:
In hindsight passing whole arrays in and out of a ringbuffer is kind of an odd thing to want to do. My solution in the end was to add a discard_front() method which could remove invalid messages from the front of queue.
typename POP_T = int16_t, /* return type of pop_front */
...
// affects tail, reads head
POP_T pop_front(void) {
register uint16x2_t temp = { offsets.both };
if ( (temp.head - temp.tail) & CAPACITY ) { // !empty
POP_T elem = elements[temp.tail++];
offsets.tail = temp.tail & CAPACITY;
return elem;
}
return EMPTY_ELEM; // on underflow return default element
}
ringbuffer_t<T, buffer_size, T, -1> values; /* <<<< note this is different than yours */
For the POP_T, you can use the same type as used for the element T or a different one.
I used this “feature” to my advantage using uint8_t as the element data type but a int for the POP_T with an empty value of -1.
This allows me to loop on the return value:
ringbuffer_t<uint8_t, 64, int, -1> buffer;
...
int c;
while((c=buffer.pop_front()) != -1) {
Serial.print((char)c);
}
[Rick Kimball – Thu Jul 14, 2016 5:21 pm] –
…
Note: as this stuff is template based, you could change the element buffer from an array of uint8_t data to something completely different. It could be an ring buffer of custom structure, a ring buffer of int16_t, or even a ring buffer of function pointers.
…
Rick,
Wow … seriously cool. We mortals are unworthy, but we will use it anyway!
Ray
[Rick Kimball – Wed Oct 04, 2017 3:54 am] –
You just need to provide the POP_T data type and make it the same as the T data type.I used this “feature” to my advantage using uint8_t as the data type but returning a int16_t with an empty value of -1.
Aha, I knew there had to be a reason for it, I just couldn’t think of any reason why the return type would be different to the stored type. That makes more sense now!
[Rick Kimball – Wed Oct 04, 2017 3:54 am] –
BTW: You should probably call values.clear() in your constructor so it will work as a local variable on the stack.
Is that just ensuring that the structure is initialised to 0 when it is created?
Also, why is the capacity one less than the actual size of the buffer? It really slows down my moving mean code if the number of actual elements is not a power of two because you have to do actual division not just a bit shift.
[C_D – Wed Oct 04, 2017 11:54 pm] –
Is that just ensuring that the structure is initialised to 0 when it is created?
Yes. However this is only a problem if it instanced on the stack. If you make it a global variable the .bss section gets zeroed out at startup.
[C_D – Wed Oct 04, 2017 11:54 pm] –
Also, why is the capacity one less than the actual size of the buffer?
I leave one slot open so I can detect when the element buffer is full without having to keep a count. There are different ways to implement a circular buffer. I used the approach that uses 2 indices and an element buffer. This minimizes code size and locking issues.
This code make a few assumptions:
1. There will be a single producer and a single consumer. Typically the producer is an interrupt routine getting data asynchronously and a main thread acting as the consumer. I initially created it for dealing with UART data.
2. I used a datatype of uint16_t for the head and tail indices. That limits the size of the buffer elements to (1<<16)-1 (65535). This isn’t a problem on the chips I’m targetting. The stm32f103c8 only has 20k of ram total. However, it also allows me to do an atomic read of the both indices without locking as I stuff the two uint16_t values into a uint32_t union. The producer is the only one who is going to set the head index. The consumer is the only one who is going to set the tail index. This makes the code pretty safe if you ignore the case where it overflows. I leave that as an exercise to the reader. They will know best about how they are producing and consuming. They can decide to block or ignore on an overflow condition. I don’t put that in the code.
3. I assume a buffer element size that is a power of 2 so I can use a mask to deal with buffer index wrap around instead of using a modulo operator. See my earlier post that shows how arm-none-eabi-gcc reduces the computation of the number of available elements to a single thumb2 instruction.
[C_D – Wed Oct 04, 2017 11:54 pm] –
It really slows down my moving mean code if the number of actual elements is not a power of two because you have to do actual division not just a bit shift.
You are already checking to see if the buffer is full to decide if you need to throw out an element. Instead, you could always use a size of 16 for the ringbuffer and then check for an available() count of 8 instead of the the buffer.capacity(). That would allow you to use a shift divide. However, you still have to deal with the initial conditions when you have less than 8 samples using a normal divide instruction.
[Rick Kimball – Thu Oct 05, 2017 3:45 pm] – You are already checking to see if the buffer is full to decide if you need to throw out an element. Instead, you could always use a size of 16 for the ringbuffer and then check for an available() count of 8 instead of the the buffer.capacity(). That would allow you to use a shift divide. However, you still have to deal with the initial conditions when you have less than 8 samples using a normal divide instruction.
That’s actually a not a bad idea, trading off speed for a small amount of wasted space. For my current application I am sampling relatively slowly so the integer division isn’t a problem, but that’s probably the best approach if I do need to speed things up further down the track, thanks ![]()


