Here is a proof of concept that it actually works, and the cool thing is indeed it takes (almost) no CPU cycles when the duty cycles dont get changed.
I’m using double buffering of 32bit for all 16pins of a port, so the buffer required is 2*resolution*4 bytes.
I’ve tested on a F1 all 16 pins of PortC successfully up to a pulse of 4us, which is a division of period frequency and resolution. So for instance for 8bit resolution 980 Hz would work.
I’m not sure if it makes too much sense or not, eager to hear what you think. It could be extended to a real library of course and support other Ports. Also the question is where the limits are regarding the frequency.
BTW., I wonder is it Software PWM any more since its done via hardware DMA … ![]()
#include <Arduino.h>
#include <libmaple/dma.h>
#include <dma_private.h>
#define RESOLUTION 255 // PWM resolution
#define FREQUENCY 500 // PWM frequency
#if 1000000 / RESOLUTION / FREQUENCY < 4
#error did not work for me
#endif
class DMASoftPWM
{
public:
DMASoftPWM();
void begin(gpio_dev *port);
void setPinMode(uint8_t pin, bool enable);
void writePWM(uint8_t pin, uint16_t val);
uint32_t buffer[RESOLUTION * 2];
private:
static DMASoftPWM *anchor;
static void marshall() { anchor->DMAEvent(); }
inline void fillBuffer(uint16_t ptr);
uint16_t pinVal[16];
uint16_t pinmask;
uint8_t refresh;
void DMAEvent();
dma_tube_config tube_config;
};
DMASoftPWM::DMASoftPWM()
{
anchor = this;
}
void DMASoftPWM::begin(gpio_dev *port)
{
refresh = 2;
dma_init(DMA1);
tube_config.tube_src = buffer;
tube_config.tube_src_size = DMA_SIZE_32BITS;
tube_config.tube_dst = (uint32_t *)&GPIOC->regs->BSRR; // Load pointer to porta clear/set
tube_config.tube_dst_size = DMA_SIZE_32BITS;
tube_config.tube_nr_xfers = RESOLUTION * 2;
tube_config.tube_flags = DMA_CFG_SRC_INC | DMA_CFG_CIRC | DMA_CFG_CMPLT_IE | DMA_CFG_HALF_CMPLT_IE; // Source pointer increment,circular mode
tube_config.target_data = 0;
tube_config.tube_req_src = DMA_REQ_SRC_TIM2_CH3; // DMA request source.
dma_set_priority(DMA1, DMA_CH1, DMA_PRIORITY_VERY_HIGH);
dma_tube_cfg(DMA1, DMA_CH1, &tube_config); // Attach the tube to channel 1 (timer2 ch3)
dma_attach_interrupt(DMA1, DMA_CH1, DMASoftPWM::marshall);
dma_enable(DMA1, DMA_CH1);
//TIMER setup
Timer2.pause();
Timer2.setPeriod(10000000UL / FREQUENCY / RESOLUTION);
Timer2.setChannel3Mode(TIMER_OUTPUT_COMPARE);
Timer2.setCompare(TIMER_CH3, 1);
Timer2.refresh();
TIMER2_BASE->DIER = TIMER_DIER_CC3DE;
Timer2.resume();
}
void DMASoftPWM::fillBuffer(uint16_t ptr)
{
for (uint16_t step = 1; step <= RESOLUTION; step++)
{
buffer[ptr] = pinmask << 16;
for (uint8_t p = 0; p < 16; p++)
{
if (pinmask & (BIT(p)) && pinVal[p] >= step)
buffer[ptr] |= BIT(p);
}
ptr++;
}
refresh--;
}
void DMASoftPWM::DMAEvent()
{
dma_irq_cause event = dma_get_irq_cause(DMA1, DMA_CH1);
if (refresh == 0) // no update so just keep the mem
return;
switch (event)
{
case DMA_TRANSFER_COMPLETE: // now setting the upper half
fillBuffer(RESOLUTION);
break;
case DMA_TRANSFER_HALF_COMPLETE: //now setting the lower half
fillBuffer((uint16_t)0);
break;
case DMA_TRANSFER_ERROR:
ASSERT(0);
break;
case DMA_TRANSFER_DME_ERROR:
ASSERT(0);
break;
case DMA_TRANSFER_FIFO_ERROR:
ASSERT(0);
break;
}
}
void DMASoftPWM::setPinMode(uint8_t pin, bool enable)
{
pinMode(pin, OUTPUT);
if (enable)
pinmask |= digitalPinToBitMask(pin);
else
pinmask &= ~digitalPinToBitMask(pin);
}
void DMASoftPWM::writePWM(uint8_t pin, uint16_t val)
{
pinVal[pin] = val;
refresh = 2;
}
DMASoftPWM *DMASoftPWM::anchor = NULL;
DMASoftPWM softPWMPortC;
void setup()
{
Serial.begin(115200);
Serial.println("starting usb serial");
softPWMPortC.begin(GPIOC);
softPWMPortC.setPinMode(PC13, true);
}
void loop()
{
if (Serial.available())
{
int pin = Serial.parseInt();
int val = Serial.parseInt();
while (Serial.available())
Serial.read();
softPWMPortC.writePWM(pin, val);
Serial.print(pin);
Serial.print(':');
Serial.println(val);
#ifdef DEBUGBUFFER
delay(100);
for (int u = 0; u < RESOLUTION * 2; u++)
Serial.println(softPWMPortC.buffer[u], BIN);
#endif
}
static uint32_t sweep;
static uint16_t t = 0;
if (millis() - sweep > 1000 / RESOLUTION)
{
sweep = millis();
t = ++t % RESOLUTION;
softPWMPortC.writePWM(13, t);
}
}
There is no way to get 4ns pulse with F1..
With
Timer2.setPrescaleFactor(F_CPU / RESOLUTION / FREQUENCY);
Timer2.setOverflow(1);
Shouldn’t 1 be enough since you are using circular mode?
My guess is that’s so you can update the PWM duty cycle in one while the DMA is sending the other, but you could also update the duty cycle in the one being sent, and avoid getting artifacts if you are updating the values that the DMA is sending, but if you fill it from the top down for updates I think you shouldn’t get artifacts and at most during 1 cycle the duty cycle may be between the original and the updated one.
It’s a nice idea. I had been thinking on something similar but just to send pulses, without a specific duty cycle.
I have also used dma to do real hardware PWM in a timer and works great, which each value in the buffer representing the duty cycle for 1 PWM pulse. That’s for audio so each pulse needs a different one. But for something that needs a certain frequency generated in multiple pins with different duty cycles I think your idea is great.
[victor_pv – Fri Jul 07, 2017 2:13 pm] –
Why do you use 2 buffers with the capacity of resolution?
Shouldn’t 1 be enough since you are using circular mode?
My guess is that’s so you can update the PWM duty cycle in one while the DMA is sending the other, but you could also update the duty cycle in the one being sent, and avoid getting artifacts if you are updating the values that the DMA is sending, but if you fill it from the top down for updates I think you shouldn’t get artifacts and at most during 1 cycle the duty cycle may be between the original and the updated one.
Oh very interesting, I was plainly assuming that I will get into serious issues if I get into a race condition on concurrent access of the same memory. Actually, I have no idea what might happen, do you?
If thats no issue, could you explain a bit more why filling from top is better here?
If my meassurement is accurate, the fill process takes about 50us, so lets say less than 20 steps. Not sure how long the jump to isr takes, but that means the buffer fillup will probably overtake quite soon, giving that the above is a valid situation.
Cheers, Ollie
[universam10 – Fri Jul 07, 2017 2:33 pm] –[victor_pv – Fri Jul 07, 2017 2:13 pm] –
Why do you use 2 buffers with the capacity of resolution?
Shouldn’t 1 be enough since you are using circular mode?
My guess is that’s so you can update the PWM duty cycle in one while the DMA is sending the other, but you could also update the duty cycle in the one being sent, and avoid getting artifacts if you are updating the values that the DMA is sending, but if you fill it from the top down for updates I think you shouldn’t get artifacts and at most during 1 cycle the duty cycle may be between the original and the updated one.Oh very interesting, I was plainly assuming that I will get into serious issues if I get into a race condition on concurrent access of the same memory. Actually, I have no idea what might happen, do you?
If thats no issue, could you explain a bit more why filling from top is better here?If my meassurement is accurate, the fill process takes about 50us, so lets say less than 20 steps. Not sure how long the jump to isr takes, but that means the buffer fillup will probably overtake quite soon, giving that the above is a valid situation.
The bus will arbitrate access. Since neither the DMA at this frequencies neither the CPU writing to a buffer will be able to deplete the bandwidth of the RAM, there should be no issue. If it did happen that the DMA and the CPU try to access the memory at the exact same time, the bus will split access at 50% for each. So an access request from one of them may be hold for a cycle or so to complete the other’s transaction. Racemaniac wrote a separate thread where he pushed the limits of the CPU and DMA, doing multiple transfers at the same time. He could clog it using 2 SPI ports at full speed + Mem2mem DMA access + the cpu doing something else. I think in his results the mem2mem access (which goes as fast as the ram can go) and 1 SPI port at full speed woudl still run fine without affecting the CPU much, but anyway check that thread for the details. The result is that unless you were running the DMA at full 72Mhz you are not likely to have any problem at all.
About filling from top to bottom, I was thinking on a situation in which the period is changing and the DMA can flip the same bit twice in the same period. But as I was writing and example, I realized filling from top to bottom could have the same effect only in a different situation, but still cause the same signal to flip twice in the same period. So it would not be a solution.
Double buffer would avoid that. The other option is use a single buffer, but update half of it at a time. But that’s just the same as you do, only with each buffer taking half the period. If there is enough ram, there is no advantage on each buffer having half the period.
[Ollie – Fri Jul 07, 2017 3:02 pm] –
The classic PWM is analog and quite slow – it used to be 20 ms, but it is still limited by the servo signal definition of 1 – 2 ms.
What do you mean that the classic PWM is analog and slow? are you referring to STM32F1 or something else?
[universam10 – Fri Jul 07, 2017 12:29 pm] –
Looks like Timer2.setPeriod() does a bit weird stuff if it comes below 4.With
Timer2.setPrescaleFactor(F_CPU / RESOLUTION / FREQUENCY);
Timer2.setOverflow(1);
[victor_pv – Fri Jul 07, 2017 3:47 pm] –[Ollie – Fri Jul 07, 2017 3:02 pm] –
The classic PWM is analog and quite slow – it used to be 20 ms, but it is still limited by the servo signal definition of 1 – 2 ms.What do you mean that the classic PWM is analog and slow? are you referring to STM32F1 or something else?
He refers to a standard used in RC systems since sixties – a channel “pwm” period 20ms (50Hz), with active “pwm” pulse length 1-2ms where 1500us is the middle position of the servo in particular channel. People say it is analog even it is not today, what is analog is the Servo loop. ESC (it converts the servo pulse length into the actual power of the bldc motor) is fed by that signal as well. But, the guys flying acro and similar stuff need something faster as the 50Hz control loop per channel is slow for them. They need something like 1ms period and less as the flying with a quadcopter 100knots among the trees in the forest requires fast responses in the control chain ![]()
[Pito – Fri Jul 07, 2017 4:08 pm] –[victor_pv – Fri Jul 07, 2017 3:47 pm] –[Ollie – Fri Jul 07, 2017 3:02 pm] –
The classic PWM is analog and quite slow – it used to be 20 ms, but it is still limited by the servo signal definition of 1 – 2 ms.What do you mean that the classic PWM is analog and slow? are you referring to STM32F1 or something else?
He refers to a standard used in RC systems since sixties – a channel “pwm” period 20ms (50Hz), with active “pwm” pulse length 1-2ms where 1500us is the middle position of the servo in particular channel. People say it is analog even it is not today, what is analog is the Servo loop. ESC (it converts the servo pulse length into the actual power of the bldc motor) is fed by that signal as well. But, the guys flying acro and similar stuff need something faster as the 50Hz control loop per channel is slow for them. They need something like 1ms period and less as the flying with a quadcopter 100knots among the trees in the forest requires fast responses in the control chain
![]()
Ahhh that makes sense. HW PWM in the STMs should have no problem with 1ms periods, and looks like DMA/Software doesn’t either.
The new Dshot600 is a fixed 26us frame and the newest Dshot1200 is a fixed 13us frame – these two are called “digital” as they send within the “frame” (from the control unit) a pulse length coded binary number (frame is 16bits with 11bits of the actual power value and 4bit CRC) into the ESC controller. The ESC contains an mcu (ie. stm32) and it decodes the number out the frame and sets the power accordingly. They can do ~30k updates per second.. The stm32 inside the ESC must a) decode the frame, b) generate 3x 30kHz (11bit) pwm to drive 3 phases of the brush-less dc engine.
So a typical 4-8copter setup would require an F4 in the control unit (IMU+RC signals fusion), and 4-8x F3(4) in the ESCes (one ESC per motor). A lot of silicon..
Q: are you able to generate ie. at 8 gpio outputs 8x Dshot600/1200 frames via DMA (all 8 frames shot out in parallel) ?
https://github.com/cleanflight/cleanfli … m_output.c
[victor_pv – Fri Jul 07, 2017 4:02 pm] –
Is the problem related to the timer, or is it perhaps due to fast rate of ISR and the time it takes to fill the buffer?
That bothers me that below 1us timer trigger the F1 just crashes from the start. Unfortunately I got no debugger right now so I’m in the dark why this happens. Afaik and according to your quotation the DMA and the port shouldn’t be limited at 1 MHz so I wonder what’s the limit here. Any ideas?
[universam10 – Fri Jul 07, 2017 10:01 pm] –
Thanks for the explanation about the race conditions, I will try with a single buffer.[victor_pv – Fri Jul 07, 2017 4:02 pm] –
Is the problem related to the timer, or is it perhaps due to fast rate of ISR and the time it takes to fill the buffer?That bothers me that below 1us timer trigger the F1 just crashes from the start. Unfortunately I got no debugger right now so I’m in the dark why this happens. Afaik and according to your quotation the DMA and the port shouldn’t be limited at 1 MHz so I wonder what’s the limit here. Any ideas?
They shouldn’t be limited, and is strange that it crashes. If the DMA was going to fast for the ram, it would just not keep up with the frequency, but not crash, as Racemaniac tested pushing the limits and he didn’t crashes.
If I have some time later today I’ll flash it to a board and check with the debugger.
Just tested, I can get up to 8Khz with 255 bit resolution.
That’s about 2Mb if I calculate it right.
At 10Khz the DMA handler crashes. I think it’s tripping an ISR before the previous one is completed.
Was 800Hz with 255 stepsresolution. See my note about about the setPeriod division.
With 100 steps resolution I can go in frequency up to 2.5Khz.
From the debugger, it fails this ASSERT:
dma_irq_cause dma_get_irq_cause(dma_dev *dev, dma_channel channel) {
/* Grab and clear the ISR bits. */
uint8 status_bits = dma_get_isr_bits(dev, channel);
dma_clear_isr_bits(dev, channel);
/* If the channel global interrupt flag is cleared, then
* something's very wrong. */
ASSERT(status_bits & 0x1)
[victor_pv – Fri Jul 07, 2017 11:21 pm] –
I think the ISR is taking too long to get serviced. Right now the filling is done inside the ISR, so a possible solution is to not do that, and instead set a flag and have a loop waiting for the flag to fill the DMA.
That would prevent the problem with the nesting ISRs, but likely the DMA would complete a cycle before the fillBuffer function has been able to fill the buffer completely.
Great, thanks for debugging!
I took your advice and the paradigm to update the buffer outside of the ISR as the race condition will not occur… see below
[victor_pv – Fri Jul 07, 2017 11:21 pm] – I think it would be better to use 2 independet buffers, and you refill each buffer outside of the ISR, so doesn’t need to complete before the next DMA transfer. In fact should start refilling a buffer as soon as it starts using the other for the DMA. By the time the DMA has ran X cycles, the function that refills the buffer has had X times the time to complete filling up the second buffer, then the ISR just switches buffer address and starts the DMA over.
Thats not clear to me, since I am using and F1 where there are no ping-pong buffers for DMA. If I therefore switch the buffer I need to stop – start the DMA which, I cant measure right now, will likely be a longer stop of the PWM. Are there methods to do this within a single/few cycles?
[victor_pv – Fri Jul 07, 2017 11:21 pm] –
So if you want to increase the speed considerably, offloading fillBuffer to be run outside of the ISR, and possibly using more buffers, so one can be filled up during a period longer than the DMA takes to run a cycle should be a good solution.
Great, so I put the suggestions together:
As being said the fillbuffer can be outside of the ISR and not necessarily in sync, which obviously does work. There may be some glitches
Therefore, I could _completelly_ remove the whole ISR part as this became unnecessary.
Next was to remove the double buffer as also being unnecessary.
While thinking twice another idea came to my mind is that with the writePWM() I”m updating only one Pin, but in the fillbuffer() I’m rewriting the whole port so all 16 pins. Therefore I created a vertical fillbuffer that only changes the pin that was updated, which in result runs 16x faster.
Having done above changes, now the DMA trigger (duty cycle) I can go down to a prescaler of 1 which is a clock speed of 72MHz!
Of course, this is technically not possible, but actually the F1 doesn’t crash any more at PWM of 280kHz and 8bit resolution.
Now I would be super curious whats the real speed on the pins if you got equipment to measure that?
#include <Arduino.h>
#include <libmaple/dma.h>
#include <dma_private.h>
#define RESOLUTION 255 // PWM resolution
#define FREQUENCY 200000 // PWM frequency
// #if F_CPU / RESOLUTION / FREQUENCY < 120
// #error did not work for me
// #endif
class DMASoftPWM
{
public:
DMASoftPWM();
void begin(gpio_dev *port);
void setPinMode(uint8_t pin, bool enable);
void writePWM(uint8_t pin, uint16_t val);
uint32_t buffer[RESOLUTION];
uint16_t pinmask;
private:
void fillBufferVert(uint8_t channel);
// static DMASoftPWM *anchor;
// static void marshall() { anchor->DMAEvent(); }
inline void fillBuffer(uint16_t ptr);
uint16_t pinVal[16];
// uint8_t refresh;
// void DMAEvent();
dma_tube_config tube_config;
};
DMASoftPWM::DMASoftPWM()
{
// anchor = this;
}
void DMASoftPWM::begin(gpio_dev *port)
{
dma_init(DMA1);
tube_config.tube_src = buffer;
tube_config.tube_src_size = DMA_SIZE_32BITS;
tube_config.tube_dst = (uint32_t *)&GPIOC->regs->BSRR; // Load pointer to porta clear/set
tube_config.tube_dst_size = DMA_SIZE_32BITS;
tube_config.tube_nr_xfers = RESOLUTION;
tube_config.tube_flags = DMA_CFG_SRC_INC | DMA_CFG_CIRC; // | DMA_CFG_CMPLT_IE | DMA_CFG_HALF_CMPLT_IE;
tube_config.target_data = 0;
tube_config.tube_req_src = DMA_REQ_SRC_TIM2_CH3; // DMA request source.
dma_set_priority(DMA1, DMA_CH1, DMA_PRIORITY_VERY_HIGH);
dma_tube_cfg(DMA1, DMA_CH1, &tube_config); // Attach the tube to channel 1 (timer2 ch3)
// dma_attach_interrupt(DMA1, DMA_CH1, DMASoftPWM::marshall);
dma_enable(DMA1, DMA_CH1);
//TIMER setup
Timer2.pause();
Timer2.setPrescaleFactor(F_CPU / RESOLUTION / FREQUENCY);
Timer2.setOverflow(1);
Timer2.setChannel3Mode(TIMER_OUTPUT_COMPARE);
Timer2.setCompare(TIMER_CH3, 1);
Timer2.refresh();
TIMER2_BASE->DIER = TIMER_DIER_CC3DE;
Timer2.resume();
}
void DMASoftPWM::fillBufferVert(uint8_t channel)
{
for (uint16_t step = 0; step < RESOLUTION; step++)
{
if (pinVal[channel] >= step)
buffer[step] |= BIT(channel);
else
buffer[step] &= ~ BIT(channel);
}
}
void DMASoftPWM::setPinMode(uint8_t pin, bool enable)
{
pinMode(pin, OUTPUT);
if (enable)
pinmask |= digitalPinToBitMask(pin);
else
pinmask &= ~digitalPinToBitMask(pin);
// pre fill the reset buffer
for (uint16_t step = 0; step < RESOLUTION; step++)
{
buffer[step] = (uint32_t)pinmask << 16;
}
}
void DMASoftPWM::writePWM(uint8_t pin, uint16_t val)
{
pinVal[pin] = val;
// refresh = 2;
fillBufferVert(pin);
// fillBuffer((uint16_t) 0);
}
// DMASoftPWM *DMASoftPWM::anchor = NULL;
DMASoftPWM softPWMPortC;
void setup()
{
Serial.begin(115200);
Serial.println("starting usb serial");
softPWMPortC.begin(GPIOC);
softPWMPortC.setPinMode(PC13, true);
}
// #define DEBUGBUFFER 1
void loop()
{
if (Serial.available())
{
int pin = Serial.parseInt();
int val = Serial.parseInt();
while (Serial.available())
Serial.read();
softPWMPortC.writePWM(pin, val);
Serial.print(pin);
Serial.print(':');
Serial.println(val);
#ifdef DEBUGBUFFER
delay(100);
for (int u = 0; u < RESOLUTION; u++)
Serial.println(softPWMPortC.buffer[u], BIN);
#endif
}
static uint32_t sweep;
static uint16_t t = 0;
if (millis() - sweep > 1000 / RESOLUTION)
{
sweep = millis();
t = ++t % RESOLUTION;
softPWMPortC.writePWM(13, t);
}
}
isr fires on dma complete?? so why can’t you alternate dma start point in the isr and re-trigger the dma??
stephen
Anyway, it works like this very good, one has to proof that there is a glitch firsthand.
[universam10 – Mon Jul 10, 2017 11:14 am] –
How much time / cycles will it take to stop, reconfigure, start the DMA for every duty cycle change? Not sure, but anyway the PWM will be out of sync. On the other side, if I alter the buffer there might be a sample not accurate if the PWM duty really “jumps”, but I doubt with former change the impact is way more dramatic!Anyway, it works like this very good, one has to proof that there is a glitch firsthand.
Should not be too much, depending how you do it. If you do direct register manupulation (and I can ‘t see a reason to not do so), should be just a few instructions.
You only need to disable the channel, change the source address, set the transfer size again (I believe it goes to 0, would need to read the reference manual again to confirm), and enable the channel.
Since you are not changing target address, interrupt settings, callback function… it should not be that much time.
But of course depending how fast you are going on the timer requests, could be longer than what you want. and cause that pulse to last longer than it should.
The other possibility, that’s using the single buffer and modifying it while being sent by the DMA always has the chance that the DMA catches up with the CPU writing the new table and you get a pin to go down and up twice in the same cycle if the DMA is fast enough, so I guess it depends on the application, but I think for most applications may be better to get a pulse that’s a few uS longer than it should, than getting 2 fast pulses instead of 1.
The F4 double buffering DMA would be great for pushing this to the limit ![]()
I have a small 8 channel analyzer, I think max speed it 10Mhz or 24Mhz, I’ll see if I get a chance to measure the pulses, and see if I can force PWM “jumps” and see them. Did you update the code in the first post with fillBuffer decoupled from the DMA ISR?
TIMER2_BASE->DIER = TIMER_DIER_CC3DE;
Since it’s an inline shouldn’t take many instructions and makes the code more portable between timers
/**
* @brief Enable a timer channel's DMA request.
* @param dev Timer device, must have type TIMER_ADVANCED or TIMER_GENERAL
* @param channel Channel whose DMA request to enable.
*/
static inline void timer_dma_enable_req(timer_dev *dev, uint8 channel) {
*bb_perip(&(dev->regs).gen->DIER, channel + 8) = 1;
}


