[STM32GENERIC/HAL] SerialUSB TX/RX speed problem

Pito
Mon Jul 17, 2017 8:15 am
This is a short sketch to demonstrate the problem, fixed by steve’s patch in libmaple:
// USBSERIAL TX PROBLEM DEMONSTRATION
// Pito 7/2017

#include "Arduino.h"

void setup() {
Serial.begin(115200);
delay(3000);
}

#define TXCHARS 1000000

void loop() {
uint32_t i;
uint8_t x = 85;
uint32_t elapsed = micros();

for (i = 0; i < TXCHARS; i++) {
Serial.write(x);
}

elapsed = micros() - elapsed;
Serial.println("***");
Serial.print("USB TX speed = ");
Serial.print((1000.0 * TXCHARS) / elapsed, 2);
Serial.println(" KBytes/sec");
delay(1000);
}


stevestrong
Mon Jul 17, 2017 8:31 am
[Pito – Mon Jul 17, 2017 8:15 am] –
Fixed libmaple:
***
USB TX speed = 213.34 KBytes/sec

danieleff
Mon Jul 17, 2017 3:08 pm
A few weeks back I also added buffered USB TX, https://github.com/danieleff/STM32GENER … 8154715833, https://github.com/danieleff/STM32GENER … b86394c84e

Your code gives me:
USB TX speed = 308.41 KBytes/sec


Pito
Wed Jul 19, 2017 11:10 am
Great! I will try, I updated my local on 24.6. so maybe I miss your patch..

BTW, I’ve compiled the test for Black F407 @168MHz under my old libmaple (patched manually with steve’s patch) and I get 1013-1064kB/sec..


Pito
Wed Jul 19, 2017 11:25 am
I’ve replaced the cores and the system with your latest and I get (Black F407ZE @168MHz)
***
USB TX speed = 64.87 KBytes/sec

Pito
Fri Jul 21, 2017 7:47 am
@Daniel: what compiler version do you use? Even with your vanilla repo I cannot get more than 64kB..

danieleff
Fri Jul 21, 2017 8:12 am
I think it is 6-2017-q1-update from https://developer.arm.com/open-source/g … /downloads, newest is 6-2017-q2-update

Also CDC_SERIAL_BUFFER_SIZE is still 128 in STM32/cores/arduino/usb/cdc/usbd_cdc_if.h , upping that might help.


Pito
Fri Jul 21, 2017 9:25 am
Ok, with CDC_SERIAL_BUFFER_SIZE 512 I get now with F407
DELETED
The standard compiler.

With CDC_SERIAL_BUFFER_SIZE 2048 I get with F407 730-994KB/sec.
DELETED

Update: with maybe more realistic scenario – with 1mil chars sent to TeraTerm terminal (Win7)
#define TXCHARS 1000000


Pito
Sun Jul 23, 2017 12:13 pm
The bigger buffers are fast, but TX looses the data :(

While running the Tek demo against TeraTerm Tek emulator
http://www.stm32duino.com/viewtopic.php … =20#p31835

the buffer sizes larger than 256 bytes show corruptions in the picture..

The libmaple usb works, not sure on the buffer size there.
It could be the TeraTerm is causing that as well..


Pito
Sun Jul 23, 2017 6:31 pm
It seems my above results (the tables with speeds) were wrong.. :evil:
The Teraterm does not receive all 1mil chars with larger cdc buffer, but a fraction of it.
The larger the buffer in CDC the smaller amount of data I get.
Therefore the total time for TX was smaller and the TX speed was higher.

I’ve checked that by logging the incoming bytes into a file.
I get 1mil chars received ONLY with CDC_SERIAL_BUFFER_SIZE=128 (that is TX speed = 64kB/sec).

I’ve deleted the tables.

PS: with libmaple latest and its stock cdc settings I get 1mil chars with TX speed 120-170kB/sec.


Pito
Sat Aug 12, 2017 9:50 am
Any thoughts on this?
Would be great if the larger buffer sizes work..
PS: As Steve wrote his buffer is 2kB.

victor_pv
Sat Aug 12, 2017 2:34 pm
[Pito – Sat Aug 12, 2017 9:50 am] –
Any thoughts on this?
Would be great if the larger buffer sizes work..
PS: As Steve wrote his buffer is 2kB.

With what cores did you experience bytes loses? I was testing the serial speed in the libmaple F4 and noticed from Host -> MCU it loses bytes if the host sends faster than the sketch can receive.


Pito
Sat Aug 12, 2017 2:45 pm
With STM32GENERIC (this thread) I see loss of data when TX via USB from F4 to TeraTerm with CDC_SERIAL_BUFFER_SIZE (in STM32/cores/arduino/usb/cdc/usbd_cdc_if.h) > 128bytes. The loss is up to 50% of data with larger buffers.
You see nothing wrong unless you start to log the incoming data into a file (in TeraTerm) and count the bytes in that file.
The test used: http://www.stm32duino.com/viewtopic.php … 354#p31552
Why we need larger buffer? Because the TX is slow – 64kB/sec only with 128bytes.

The libmaple’s TX via USB from F4 to TeraTerm at the stock buffer size (2kB as per Steve’s info) does not show the loss (speed around 150-220kB/sec).

There is an RX USB speed test written by PaulS I tried with the same results (MapleM) as in the following link

https://www.pjrc.com/teensy/benchmark_u … ceive.html

BTW – if there was a talented programmer who can write similar benchmark (see PaulS DOS side source) for TX as well, it would be great!


victor_pv
Sat Aug 12, 2017 4:09 pm
[Pito – Sat Aug 12, 2017 2:45 pm] –

There is an RX USB speed test written by PaulS I tried with the same results (MapleM) as in the following link

That’s the same test I was doing when I noticed the libmaple F4 would dump incoming bytes if the sketch is not picking them.
If you open the serial port, then don’t care to read, and send with Paul’s command line utility, it will keep going and going even if the sketch doesn’t care to read at all.

A test for TX would be great. Perhaps there is some tool already available somewhere.

On the libmaple F4 I think I have it corrected, at least is not dropping everything, but I need to confirm I’m not missing any byte at all.
I’ll see if I can have a look at the generic TX. What buffer sizes did you test that would drop bytes for sure?
In the libmaple core, because of the way the code is written, the buffer needs to be an exact power of 2.


Pito
Sat Aug 12, 2017 4:59 pm
I’ll see if I can have a look at the generic TX. What buffer sizes did you test that would drop bytes for sure?

The CDC_SERIAL_BUFFER_SIZE (in STM32/cores/arduino/usb/cdc/usbd_cdc_if.h) size which works fine is 128.

These I tested with missing bytes: 256, 512, 1024, 2048, 4096, 8k, 16k, 32k on F4, and the same till 16k on F103 (so it is not only about F4).
When talking missing bytes – it is not about a few bytes, but hundreds of kilobytes..

I published the results few weeks back (see my previous posts here in this thread) with speeds up to 1MB/sec for F4 and 450kB/sec for F103, until I started to analyze the amount of data transferred.
It showed the great speeds had been achieved because the transfer did only a fraction of the amount of data and finished smoothly sooner, thus the speeds were such fantastic figures :)

Therefore I deleted the results not to evoke false expectations (until fixed).
I discovered that while messing with TEK emulator, where larger pictures I streamed to TEK via USB started to show defects with larger buffer sizes.


victor_pv
Sat Aug 12, 2017 6:07 pm
[Pito – Sat Aug 12, 2017 4:59 pm] –
I’ll see if I can have a look at the generic TX. What buffer sizes did you test that would drop bytes for sure?
The size which works fine is 128.

These I tested with missing bytes: 256, 512, 1024, 2048, 4096, 8k, 16k, 32k on F4, and the same till 16k on F103 (so it is not only about F4).
When talking missing bytes – it is not about a few bytes, but hundreds of kilobytes..

I published the results few weeks back (see my previous posts here in this thread) with speeds up to 1MB/sec for F4 and 450kB/sec for F103, until I started to analyze the amount of data transferred.
It showed the great speeds had been achieved because the transfer did only a fraction of the amount of data and finished smoothly sooner, thus the speeds were such fantastic figures :)

Therefore I deleted the results not to evoke false expectations (until fixed).
I discovered that while messing with TEK emulator, where larger pictures I streamed to TEK via USB started to show defects with larger buffer sizes.

I bet it was nice to see 1MB/s until you found out they were being dropped somewhere :D

I just found this tool:
http://www.serialporttool.com/CommEcho.htm

I’m about to test it, I understand it will echo back all it gets, plus it counts, so if we send let’s say 1MB we should receive back 1MB, plus the program should show if Windows received 1MB. Let’s see…


Pito
Sat Aug 12, 2017 6:19 pm
Try the test in the first post here..

victor_pv
Sat Aug 12, 2017 11:45 pm
That program sends down the port the same it receives. Due to that, I had to do some moficications to the sketch so it would read RX bytes too.
Also that allows me to count both the bytes sent up the pipe and received back.

I ran several test, allowing it to wait for longer at the end to see if it would receive any extra bytes, from some misconnunication somewhere, but did not.
Configured like this it waits until the received ammount is the same as the sent amount and displays the total time and speed. It reports 50KB/s, that’s each way.
Important to note that either because of Windows, or because or the test program in Windows, the last few bytes take a few seconds to be received back, I guess the program waits to see if it can fill a buffer or something. If that wait wasn’t there, the performance would be better. But at least I know at that speed Windows is getting the right amount of bytes, sends them back, and the right amount are received again.
If I raise the TX speed by sending blocks instead of individual bytes (Serial.print (buf, XXXX), at some point I start losing data. But I believe it has more to do with the TX timing out, since there is a max timeout for a transfer and will drop bytes if it takes more than that.

#define TXCHARS 100000

void loop() {
char buf[bufsize];

delay (10000);
uint32_t n = 0;

uint32_t i;
uint8_t x = 85;
uint32_t elapsed = micros();

for (i = 0; i < (TXCHARS); i++) {
Serial.write(x);
n+= Serial.readBytes (buf, bufsize);
}

uint32 endMillis = millis();
while ((n < TXCHARS)){
//while ((n < TXCHARS*2) & ((millis() - endMillis) < 10000)){
n+= Serial.readBytes (buf,bufsize);
}
elapsed = micros() - elapsed;

Serial.println("***");
Serial.print("USB TX speed = ");
Serial.print((1000.0 * TXCHARS) / elapsed, 2);
Serial.println(" KBytes/sec");
Serial.print ("Elapse (us): ");
Serial.println (elapsed,DEC);
Serial.print ("Sent: ");
Serial.println (TXCHARS, DEC);
Serial.print ("Received: ");
Serial.println (n, DEC);

delay(1000);
while (1){
Serial.readBytes (buf,bufsize);
}
}


vitor_boss
Sun Aug 13, 2017 4:36 am
I have changed the code to this:
#include "Arduino.h"

#define TXCHARS 100000
#define bufsize 100

void setup() {
Serial.begin(115200);
delay(2000);
}

void loop() {
uint32_t i;
uint8_t x = 85;
uint32_t elapsed = micros();
uint8_t buf[bufsize];

for (i = 0; i < bufsize; i++) { buf[i] = x; }

while(1)
{
elapsed = micros();
for (i = 0; i < TXCHARS; i++) {
Serial.write(x);
}
elapsed = micros() - elapsed;

Serial.println("***");
Serial.print("USB TX speed = ");
Serial.print((1000.0 * i) / elapsed, 2);
Serial.println(" KBytes/sec");
delay(1000);

elapsed = micros();
for (i = 0; i < TXCHARS; i+=bufsize) {
Serial.write(buf, bufsize);
}
elapsed = micros() - elapsed;

Serial.println("***");
Serial.print("USB buffered TX speed = ");
Serial.print((1000.0 * i) / elapsed, 2);
Serial.println(" KBytes/sec");
delay(1000);
}
}


Pito
Sun Aug 13, 2017 7:19 am
Use 1mil chars as in my original test.
#define TXCHARS 1000000

danieleff
Sun Aug 13, 2017 11:16 am
There is timeout in writes like libmaple so that might contribute.
In SerialUSBClass write, unsigned long timeout=millis()+5;
Try to increase it.

Sorry I was not able to reproduce the error, but do not have time to thoroughly test this.


victor_pv
Sun Aug 13, 2017 1:24 pm
Pito I have not tested with Generic yet, only libmaple F4. I will test with generic when I have time again.

In libmaple the buffers are defined in another line, but same effect. Those buffers are 2KB each by default in the latest repo. I increased those without much effect, except if I get them really slow.
Playing with the buffers in commecho had a more definitive effect, the best results is when the transmission total, each individual chunk, and the commecho buffers were all multiples. I.E. 1milling total bytes, sent in chunks of 100 at a time, and commecho buffers of 1000 each. Still the speed sending 1 byte at a time to 100 bytes at a time is almost the same overall total speed. I think the bottleneck was in commecho receiving and sending back, but at least confirmed it got everything to Windows and back.
If I try to send 1000 bytes a time, I lose data, because I think it fills the libmaple buffer faster than Windows is taking it, and then the TX timeout quicks in and doesn’t send the total it should.
My F4 libmaple is modified so it does not dump RX bytes evers. The current repo copy will dump RX data if the sketch is not reading it at the same speed that comes from the USB bus, and what’s worse, it does so without properly moving the tail in the RX fifo buffer, so you end up with very corrupted data, not just lost, but bytes received later would be read before the older ones by the application.

I’ll try to compile the sketch like it is with the Generic core and see what it does.


Pito
Sun Aug 13, 2017 3:07 pm
@victor: do not use buffers in the sketch, do experiment with CDC_SERIAL_BUFFER_SIZE only plz.

Here it works only with this (Serial.readBytes() timeouts after 1 sec)
for (i = 0; i < (TXCHARS); i++) {
Serial.write(x);
//n+= Serial.readBytes(buf, bufsize);
while(Serial.available()>0) {
char dummy = Serial.read();
n++;
}
}


vitor_boss
Sun Aug 13, 2017 3:15 pm
W10 x64 creators update. I have disabled PERS buffers on COM advanced settings for both tests.

EDIT: I’m using default CDC_SERIAL_BUFFER_SIZE


victor_pv
Sun Aug 13, 2017 9:05 pm
Looks like the generic core dumps RX bytes if the app doesn’t pick them, like libmaple F4, and unlinke Libmaple F1.
The function is here:
https://github.com/danieleff/STM32GENER … B.cpp#L150

In libmaple F4, it would overwrite what’s in the buffer. In the GENERIC core it will just not save the incoming packet to the buffer if there is no capacity, so if the buffer has let’s say 10 bytes free, and a packet of 40 bytes comes in, it will write the first 10 bytes, and return. The packet will get over written with the next incoming packet since the communication is not stopped with the host and the 30 bytes that did not make it to the buffer will be lost forever.
So this will dump RX if the host sends faster than the sketch reads them.

On TX, it will write until the buffer fills up.

But if the buffer has capacity for only part of what we want to send, I believe it will write that part in the buffer, but not return the correct number of bytes that were buffered, instead return 0.
I.E. We want to send 100 bytes. Buffer has capacity for 40. It will buffer 40, and return 0. It should return 40, so the sketch can know what happened and can continue sending the rest.
That is not according to the Arduino API:
https://www.arduino.cc/en/Serial/Write
write() will return the number of bytes written, though reading that number is optional

So that will cause the sketches to end up sending corrupted data to the host if the buffer ever fills during a transmission.

https://github.com/danieleff/STM32GENER … SB.cpp#L98
This is the part of the function doing that (breaking and returning 0 even if some bytes made it to the buffer):
for(size_t i=0; i < size; i++) {

tx_buffer.buffer[tx_buffer.iHead] = *buffer;
tx_buffer.iHead = (tx_buffer.iHead + 1) % sizeof(tx_buffer.buffer);
buffer++;

while(tx_buffer.iHead == tx_buffer.iTail && millis()<timeout);
if (tx_buffer.iHead == tx_buffer.iTail) break;
}


vitor_boss
Mon Aug 14, 2017 4:13 am
[victor_pv – Sun Aug 13, 2017 9:05 pm] – …
I think it could be corrected like this:
while(size--) {

tx_buffer.buffer[tx_buffer.iHead] = *buffer;
tx_buffer.iHead = (tx_buffer.iHead + 1) % sizeof(tx_buffer.buffer);
buffer++;

while(tx_buffer.iHead == tx_buffer.iTail && millis()<timeout);
if (tx_buffer.iHead == tx_buffer.iTail) break;
}
return size;


victor_pv
Mon Aug 14, 2017 4:30 am
[vitor_boss – Mon Aug 14, 2017 4:13 am] –

It could be easier like this:
if( i<size) { return i; }
else { return size; }

vitor_boss
Mon Aug 14, 2017 4:50 am
[victor_pv – Mon Aug 14, 2017 4:30 am] –

[vitor_boss – Mon Aug 14, 2017 4:13 am] –

It could be easier like this:
if( i<size) { return i; }
else { return size; }

Pito
Mon Aug 14, 2017 5:57 am
Frankly, I do not understand how we can “loose” (or we must drop or dump or overwrite) bytes while transmitting or receiving via USB.

The USB communication is based on “packets” where the RX/TX control is done via handshaking (ACK/NACK/STALL handshake packets), where packet sizes for control transfer stuff (like command and status) are 8 to 64 bytes in size, and the payload packet size is max 1024 bytes (actually 8, 16, 32, 64, 512, 1023 or 1024 based on the type of a transfer).

There are no bigger packet sizes, afaik (plz correct me).

When we set the RX buffer to 1024 and TX buffer to 1024 (perfectly feasible sizes for any stm32) we must not drop/dump/overwrite any bytes and thus we cannot loose any bytes when doing TX or RX via USB..
:?


victor_pv
Mon Aug 14, 2017 3:27 pm
[Pito – Mon Aug 14, 2017 5:57 am] –
Frankly, I do not understand how we can “loose” (or we must drop or dump or overwrite) bytes while transmitting or receiving via USB.

The USB communication is based on “packets” where the RX/TX control is done via handshaking (ACK/NACK/STALL handshake packets), where packet sizes for control transfer stuff (like command and status) are 8 to 64 bytes in size, and the payload packet size is max 1024 bytes (actually 8, 16, 32, 64, 512, 1023 or 1024 based on the type of a transfer).

There are no bigger packet sizes, afaik (plz correct me).

When we set the RX buffer to 1024 and TX buffer to 1024 (perfectly feasible sizes for any stm32) we must not drop/dump/overwrite any bytes and thus we cannot loose any bytes when doing TX or RX via USB..
:?

You are correct Pito, but that is how it should work, but not always implemented like that. In the Generic core every packet that the host sent is ACKed inmediately even if it doesn’t fit in the buffer. So if the packet doesn’t fit, it stays in the USB device ram to be overwritten by the next packet sent by the host.
The libmaple F1 will stop ACKing packets when the buffer is full, so the host has to hold up. In the libmaple F4 SerialUSB, which is not really libmaple since it was added by AeroQuad from the Standard Peripheral Library, every packet is ACKed whether it fit in the buffer or not.
I have modified that already (libmaple F4) so it does not ACK when the buffer is full, and as soon as the buffer has capacity for 1 more packet, then it ACKs the previous one and the host can continue sending.

TX in all the cores (libmaple F1, F4 and Generic) has a timeout, if the transmission can’t be completed within X mS, it returns. But the behaviour varies on that too. I need to confirm, but I believe is like follows:
Libmaple F1 & F4: Returns the number of bytes correctly queued for send.
Generic F4: Returns 0 even if some bytes were queue. Additionally, it has a “transmission” variable that as I can’t manage to understand what exactly is intended to do. I think the intention is for it to indicate how many bytes are in the out buffer waiting to be sent, but that should rather be calculated with the head and tail of the queue. Also I am not sure that transmission variable gets the correct value depending on that path the code takes. But that may be me not understanding it correctly, although your confirmation that TX is corrupted and missing bytes I think is confirmation that doesn’t work right.

Given that the Generic core is based in the HAL, it may be a good idea to replace the SerialUSB code with the one from STM.


danieleff
Tue Aug 15, 2017 6:21 am
Have anyone actually tried to just up the timeout from milliseconds to seconds?

replace the SerialUSB code with the one from STM
SerialUSB sits on top of the whole STM code.
And STM CDC code does not have buffered writes (CDC_Transmit_FS() sends immediately, which was the initial problem), which is why I had to hack in USBSerial_Tx_Handler to the STM CDC code, so when current TX is finished, it checks if there are more things to send, and sends it from the USB interrupt, instead of SerialUSBClass::write (and SerialUSBClass::write will not send it if it knows there is ongoin transmission (the transmission variable) ).


victor_pv
Wed Aug 16, 2017 2:07 am
[danieleff – Tue Aug 15, 2017 6:21 am] –
Have anyone actually tried to just up the timeout from milliseconds to seconds?

replace the SerialUSB code with the one from STM
SerialUSB sits on top of the whole STM code.
And STM CDC code does not have buffered writes (CDC_Transmit_FS() sends immediately, which was the initial problem), which is why I had to hack in USBSerial_Tx_Handler to the STM CDC code, so when current TX is finished, it checks if there are more things to send, and sends it from the USB interrupt, instead of SerialUSBClass::write (and SerialUSBClass::write will not send it if it knows there is ongoin transmission (the transmission variable) ).

Can’t we use the head and tail to determine if there is more in the buffer rather than the transmission variable?
If we only have 1 function pulling data (TX_Handler) and 1 function adding data (SerialUSB::write), and we don’t allow the head to hit the tail, then we can always know what’s currently in the buffer even if it we have interrupts and whatnot.

About the code, I didn’t know that’s what STM uses since it shows Vassilis as the author, I thought STM had written their own.


victor_pv
Wed Aug 16, 2017 4:46 pm
Everyone in this thread, can we first agree what’s the desirable behavior for USB TX and RX in case the buffers fill?
My preference:
For TX, should return right away and the return value indicate how many bytes it could queue. If 0 bytes, then return 0. For X bytes, return X. (so this invoves taking out the timeout, leave the timeout or retries for the application).

For RX, if buffer is full, NAK the last host packet so it does not send another one. Once the buffer has enough capacity to receive at least 1 more packet, issue the ACK for the previous packet so the host can send a new one. If the application starts reading bytes from the RX buffer at the point it gets enough space for another packet, then issue the NAK and keep going.

The above is how I have modified Steve’s F4 RX code to work (not the TX as for now).


danieleff
Thu Aug 17, 2017 7:33 am
[victor_pv – Wed Aug 16, 2017 4:46 pm] –
Everyone in this thread, can we first agree what’s the desirable behavior for USB TX and RX in case the buffers fill?
My preference:
For TX, should return right away and the return value indicate how many bytes it could queue. If 0 bytes, then return 0. For X bytes, return X. (so this invoves taking out the timeout, leave the timeout or retries for the application).

The return values are OK, but a small timeout should be there. Nobody will ever ever do retries with Serial.print/write(…).

[victor_pv – Wed Aug 16, 2017 4:46 pm] –
For RX, if buffer is full, NAK the last host packet so it does not send another one. Once the buffer has enough capacity to receive at least 1 more packet, issue the ACK for the previous packet so the host can send a new one. If the application starts reading bytes from the RX buffer at the point it gets enough space for another packet, then issue the NAK and keep going.

The above is how I have modified Steve’s F4 RX code to work (not the TX as for now).

You can try to do this, but that is deeply inside STM CDC code. (I think. I did not check actually)

As for the dropped data, at last I was able to setup a test so I can actually see the problem. (Using `for (i = 0; i < TXCHARS; i++) Serial.write(‘0’ + (i % 10));` plus TeraTerm I do not need to log, and can see the 0123456789 pattern get corrupted.)
BTW The problem persists even if I comment out the timeout, so its not that.


danieleff
Thu Aug 17, 2017 6:06 pm
The following code:
// USBSERIAL TX PROBLEM DEMONSTRATION
// Pito 7/2017

#include "Arduino.h"

void setup() {
Serial.begin(115200);
delay(3000);
}

#define TXCHARS 1000000

void loop() {
uint32_t i;
uint32_t elapsed = micros();

for (i = 0; i < TXCHARS; i++) {
Serial.write('0' + (i % 10));
}

elapsed = micros() - elapsed;
Serial.println("***");
Serial.print("USB TX speed = ");
Serial.print((1000.0 * TXCHARS) / elapsed, 2);
Serial.println(" KBytes/sec");
delay(5000);
}


victor_pv
Thu Aug 17, 2017 6:47 pm
Different applications may pull the RX packets at different rates from the host buffer, so it’s possible that they buffer could be filling up when using Teraterm, and then the TX code will drop bytes, and the Arduino IDE may be pulling them faster and never letting the TX buffer fill up.
We can add some flag in the TX code to see if the buffer ever gets full.

When using the Arduino IDE, do you ever see byte sequences out of order? (probably just replacing 0123456789 for “” or “ok” and seeing what you have left)


danieleff
Fri Aug 18, 2017 5:09 am
This is not about the TX buffer. The following code bypasses the whole SerialUSB class, no Serial.xxx call at all:
#include "Arduino.h"

char buffer[200];

#define TX 1000

void setup() {
Serial.begin(115200);
delay(3000);

memset(buffer, '.', sizeof(buffer));

for(int i=0; i<sizeof(buffer) / 10; i++) {
buffer[i * 10] = '0' + (i % 10);
}

buffer[sizeof(buffer) - 2] = '\r';
buffer[sizeof(buffer) - 1] = '\n';

}

void loop() {
for(size_t i=0; i<TX; i++) {

sprintf(buffer, "[%6d %10lu]", i, micros());

while(CDC_Transmit_FS((uint8_t*)buffer, 200) != USBD_OK);

USBD_CDC_HandleTypeDef *hcdc = (USBD_CDC_HandleTypeDef*)hUsbDeviceFS.pClassData;
while (hcdc->TxState != 0); // Wait for USB transfer to finish

}

delay(5000);
}


stevestrong
Fri Aug 18, 2017 7:00 am
I agree, too, that no one will ever try to resend data when outputting over serial.
That is I think sending over serial shall be kept as simple as possible, thus be either blocking or not blocking.

The F4 serial USB was originally configured to non-blocking, but I have changed this because I missed some data on the host side. After changing that reception on host side was ok.

The chip should send data as fast as it can.
Also, the host should read that data as fast as it can. Failing to do that will result of course in data loss.
If at least one host application is able to read all data, this means that the chip is working fine.

It looks like teraterm has problems on Win10?


Pito
Fri Aug 18, 2017 6:22 pm
CommEcho is loosing data too (see above from my Win7_64bit).
Could you try on your machines – the CommEcho – whether it returns 1mil chars ok, plz?

Update:
with latest libmaple F1 I get 1mil chars from Teraterm when logged into file (217kB/s when not logged, 102kB/s when logged into file).
with latest libmaple F4 I get 1mil chars from Teraterm when logged into file (170kB/s when not logged, 102kB/s when logged into file).


victor_pv
Fri Aug 18, 2017 8:50 pm
The problem with blocking TX, is that in a serial port if there is nothing in the other end, there is a finite amount that it will take to send the data and the set baud rate, and then it will return, so it will not block permanently.
But in USB, if the other end is not receiving it could block forever unless we have the timeout. So larger or smaller, but I think some timeout is needed. We could set the timeout as a multiple of the time per bytes for a certain minimum rate, so the timeout is not same when sending 1 byte as when sending 1000. That would resemble more what the UART driver would do.

On the other hand with the timeout, if the host is slow getting data, we may lose bytes. But I think anyone not wanting to lose bytes should check on the returned value to confirm it was sent. If it is not critical, the ignore the return.
For example if we decide to use a timeout to simulate a minimum 100KB/s rate, then it would be 10uS per byte. If the host is slower than that (or disconnected) then transmissions may timeout, but if the host keeps at least that rate, they will complete within the allowed timeout period.

But on RX unless we stop sending ACKs to the host until there is room in the buffer, is very possible that data will be lost not matter what the application does, since the host is potentially much faster.
The F1 core does that, and in the libmaple F4 I have it modified locally to that, I will send a PR so more people can test that RX.

On Generic TX I will try to repeat some of the tests Pito has done and see if there is any difference, but I am not convinced is just teraterm.


Pito
Fri Aug 18, 2017 10:01 pm
http://www.beyondlogic.org/usbnutshell/usb1.shtml

The host is responsible for managing the bandwidth of the bus. This is done at enumeration when configuring Isochronous and Interrupt Endpoints and throughout the operation of the bus.


victor_pv
Fri Aug 18, 2017 10:44 pm
Yes, but keep reading:
http://www.beyondlogic.org/usbnutshell/ … sochronous

OUT: When the host wants to send the function a bulk data packet, it issues an OUT token followed by a data packet containing the bulk data. If any part of the OUT token or data packet is corrupt then the function ignores the packet. If the function’s endpoint buffer was empty and it has clocked the data into the endpoint buffer it issues an ACK informing the host it has successfully received the data. If the endpoint buffer is not empty due to processing a previous packet, then the function returns an NAK. However if the endpoint has had an error and its halt bit has been set, it returns a STALL.

Also these I think are important to note:
Bulk Transfers
Used to transfer large bursty data.
Error detection via CRC, with guarantee of delivery.
No guarantee of bandwidth or minimum latency.
Stream Pipe – Unidirectional
Full & high speed modes only.

We are supposed to guarantee delivery with the handshake, but there no guarantee of bandwidth, so we should not drop packets just because the application in the host or the MCU is not as fast the other end.


Pito
Fri Aug 18, 2017 11:18 pm
My naive “user” observation (based on the above results with libmaple F1 and F4) is following:

The libmaple’s USB TX “understood the TeraTerm’s (the Host) handshaking commands”, as it had transferred 1mil chars ok with 102kB/s for both F1 and F4 while TT had logged the data into a file.

It is obvious the libmaple’s TX had been orchestrated by the Host, as the total TX speed achieved is the same for F1 and F4 (the packet’s speed is the same because the usb clock is the same, but the overhead F4/F1 would have made a difference in total TX speed when the packets were not synced by the Host).

With Daniel’s TX we can achieve Nx higher speeds (with larger CDC buffer sizes) but we loose say 70% of data – thus it seems the stm32generic TX ignores the TT Host’s handshake commands..

The Arduino’s serial monitor is perhaps much faster than TT, therefore it captures 1mil chars without proper handshaking, or, it uses a different handshaking model the stm32generic TX understands better..


RogerClark
Sat Aug 19, 2017 1:09 am
Libmaple has code that checks one of the handshaking signals (DTR I think), and if it can’t send data to the host, then the code “blocks” in the write() function

It has been noted that this differs from the Due which don’t seem to check if the Host is ready for the data


victor_pv
Sat Aug 19, 2017 1:09 am
@Pito, I pretty much agree with you on that theory.

victor_pv
Sat Aug 19, 2017 1:18 am
[RogerClark – Sat Aug 19, 2017 1:09 am] –
Libmaple has code that checks one of the handshaking signals (DTR I think), and if it can’t send data to the host, then the code “blocks” in the write() function

It has been noted that this differs from the Due which don’t seem to check if the Host is ready for the data

Roger, I looked a lot at the F1 code when writing readbytes to try to achieve the max speeds. The Libmaple F1 will work with the host with the ACK and NACK in both TX and RX, so when the buffer in one side is full, the other side will hold on further packets. when the buffer gets space, an ACK is sent to the other end, and continues the transmission, until a buffer fills and that end sends NACK again.
As far as I what I saw with DTR, is used only to detect the reset magic word, but with for normal handshaking, but the handshaking with ACK and NACK is what’s supposed to happen, and seems to be a good implementation.

The F4 in RX did not have that. It would always send ACK, even if a buffer was full. I changed that and works pretty good now. I can get 500KB/s, similar to the F1 (the MCU is faster, but the code is based on the SPL, mush more overhead than libmaple).
The F4 TX did work fine in my tests and I didn’t make any change to it.

I haven’t had a chance to dig deep in the one used by the Generic core to see what exactly does, but I suspect the handshaking is not right as Pito suspects.

Personally I think that an implementation that loses bytes in either direction is just pointless. If I care to send something one way is because I want to receive it in the other end.
I understand for sprinkling serial debug prints here and there may not matter much, but for transfering images, tables, sensor data, whatever else is sent, I wouldn’t want to lose a single byte unless the link is down or one end is not responding for too long.
I’d rather have lossless 100KB/s that 500KB with loses.


vitor_boss
Sat Aug 19, 2017 4:26 am
I’m not a programmer, just a curious, this is what I could find:
...
CDC_Transmit_FS(&tx_buffer.buffer[tx_buffer.iTail], transmitting); //Set buffer and begin tranmission
...

danieleff
Sat Aug 19, 2017 6:05 am
Libmaple latest from repo, Maple Mini, TeraTerm log, test code from first post (GCC 6-2017-q1-update), the USB seems to drop data (All ***’s should be on same column):

teraterm3.png
teraterm3.png (48.29 KiB) Viewed 664 times

Pito
Sat Aug 19, 2017 7:12 am
Hmm, what I’ve done this morning – installed Wireshark – free open network analyzer, which includes “USB Packet capture” analyzer.

https://www.wireshark.org/

Testing via the loop sending “U” (0x55) to TeraTerm (Win7_64b), from Black F407:

Libmaple F4 – the data packets identified as “5555” and Ethernet II – “‘Malformed packet”, always data packet protocol “5555” with some header data and then containing 2048x “U”, and a small “Ethernet packet” protocol type assigned “28 URB Bulk in Malformed”, containing 1x U in between, no handshake visible (unless the malformed is the handshake packet)

32Generic – the data packets identified as 5555 or ‘Ethernet II”, single packet containing 16-100 “U”s, no handshake visible

As a proof it somehow works I’ve tried for fun:
I saved the file with UUUs to corsair flash drive, and read it into an editor.
It is a different device (mass storage) but – all data packets assigned “Good”, up to 65535 large each, with handshake (2 <-> transactions between Host an device) between the data packets. The packet protocols were USB related.

Libmaple:

Libmaple USB capture.JPG
Libmaple USB capture.JPG (194.53 KiB) Viewed 649 times

stevestrong
Sat Aug 19, 2017 7:57 am
Pito, very interesting.
The libmaple F4 USB CDC is St SPL based.
Can you please test libmaple F1?

Pito
Sat Aug 19, 2017 8:05 am
Libmaple F1 – similar to 32Generic but much more UUs inside a packet:

Libmaple F1 capture.JPG
Libmaple F1 capture.JPG (227.8 KiB) Viewed 371 times

Pito
Sat Aug 19, 2017 10:15 am
Libmaple F1 against HyperTerminal (227kB/s)

Libmaple F1 against HyperTerminal.JPG
Libmaple F1 against HyperTerminal.JPG (245.55 KiB) Viewed 350 times

zmemw16
Sat Aug 19, 2017 12:04 pm
if last packet isn’t full maybe try that ?
wondering if sending packets all 1 byte below max size would change something ?
srp

Pito
Sat Aug 19, 2017 12:38 pm
FYI: Upload via DFU (maple_loader v0.1, host, from Sloeber) into MapleMini (dev 2.39.0) – the payload packets (1052bytes) with some handshaking (only a small chunk of data shown, no anomalies there till the end, all clean, only 2 Malformed packets out of about ~60 during enum/init phase):

MapleMini DFU BIN UPLOAD.JPG
MapleMini DFU BIN UPLOAD.JPG (240.49 KiB) Viewed 328 times

victor_pv
Sat Aug 19, 2017 2:59 pm
Guys, with all the cores currently there is a TX timeout, if the host is not pulling packets as fast as the MCU can send them, there will be loses since the timeout will expire and the TX function will return before it sent everything.

That’s one of the reasons I modified Pito test sketch to send blocks rather than 1 byte write, so I was doing bigger transactions rather than many very small ones.

I think for testing we could comment out the timeout checking on TX, so we know that’s not a factor affecting any packet drop.

EDIT:
I also confirmed that the GENERIC RX code will dump bytes if the sketch doesn’t read them as the same rate as the host is sending. It acknowledge every packet back to the host no matter what.

Easy test:
open the port with Serial.begin();
Then loop never reading from the port.
Then send from the host to the host, the host will never block because the MCU will acknowledge all packets even if not reading them.

You can also test by opening the port with Serial.begin().
Then delay for a long enough time.
Then print what’s read.
Then from the host send anything longer that the GENERIC buffer size. The host will act like everything was sent, but the MCU will only print what fit in the buffer size. Everything else after that was dumped without the host knowledge.

If you are trying to send something large, like an image, as fast as possible, it’s likely that you will lose data unless you sketch can process it faster than the host can send.

From my point of view this is not desirable, since in the host the serial port baud rate has no effect on how fast the application will try to send data, either you implement some handshaking within your applications (in the MCU and the computer) or you never know when the buffer may be full and dumping data.

EDIT2:
I modified the code to behave like libmaple F1, in which if the buffer is full it will not acknowledge packets back to the host. It’s in this branch. Is in sync with Daniel other than those changes.
https://github.com/victorpv/STM32GENERI … imizations

I still need to do further test to confirm there is no corruption with the data anywhere, but I have confirmed that if the sketch is bussy doing something and doesn’t pull data from the buffer, it will pause the host when it’s full, and resume when it has capacity.


Pito
Sat Aug 19, 2017 6:32 pm
TX: What if the Serial.write (and friends) will check whether the Host is NAKing, and it timeouts while returning 0 ??

victor_pv
Sat Aug 19, 2017 7:20 pm
[Pito – Sat Aug 19, 2017 6:32 pm] –
TX: What if the Serial.write (and friends) will check whether the Host is NAKing, and it timeouts while returning 0 ??

You mean without buffering even if there is capacity in the buffer?

So the TX code currently is supposed to wait for NAK before pulling more data from the TX buffer and sending it. And the write() function, as it writes to the buffer and not the USB peripheral, will return 0 if the buffer is full, but not until that point.
Whether this ir working right, or the timeout period is long enough, that’s up for debate.
I would favor a timeout that’s proportional to what the app is trying to send. Is not the same waiting for buffer space for 1 byte than for 30 or 300.


Pito
Sun Aug 20, 2017 10:47 am
From user’s point of view I do not see any buffers (the internal one in the stm core or in the Host) and I “do not care about what buffers are inside the black box(es)”.

I have the Host, which can be unable to receive fast (for any reason).
I do Serial.something (1byte or 1kB or 1MB) from my sketch.

When the Serial.something returns 0, or returns a number which is less than the amount the user has intended to send, it means for the user the data were not “received” by the Host (the other side has not accepted them all, or accepted just a part of it).
Me – the user – I have to care at my sketch level what should happen in such a situation.
In that way you cannot loose any TXed data.

Imagine you are going to upload a picture out of your 100kB array (the array in your sketch, the user_array) via usb Serial to PC Host. It should work such when the PC Host will be able to receive say 1kB per minute via USB you will upload that user_array in 100 minutes successfully to the Host without any data lost..


victor_pv
Sun Aug 20, 2017 4:21 pm
But then you have other people thinking differently:
viewtopic.php?f=51&t=2354&start=30#p33133

The way I see it, we have two options:
1.- Guaranteed delivery with blocking.
2.- Non guaranteed delivery, after a certain timeout fail to send and return the number of bytes that were succesfully sent (or buffered)

But looks like we can’t settle on one of the other.
My opcion is 2, and have the application check the return value. That would resemble what happens with a physical USART. The code will not block and just output the data at the given rate.

Also as more advanced feature, we could use the baud parameter that we currently ignore to manage the timeout.
So if we do a Serial.begin(115200), then the timeout should be around 70uS per byte.
That way if you have a slow application to which you need to send no faster than at a 57kbps, you can set the baud rate to that, and the usb write function will block only for enough time for that rate, and return with the number of bytes actually sent if it takes longer than that.


martinayotte
Sun Aug 20, 2017 4:54 pm
Seemy post here : viewtopic.php?f=3&p=33241#p33241

Pito
Mon Aug 21, 2017 9:43 am
The Serial.something comes historically from RS232-like serial UART communication, where the handshaking with the Host is managed by RTS/CTS, DTR/DSR etc., or via XON/XOFF, or done at application level via XYZMODEM/KERMIT etc.
Without such handshaking it could come to data loss with RX/TX. But that handshaking is usually not used with duinos. So Serial.something can loose data when internal ringbuffers full or slow reading them, or something like that.

SerialUSB uses a sophisticated USB protocol, which inherently includes handshaking. So the handshaking at “packets” level is there. It needs to be used and then you cannot “loose” any information while RX/TX.
The people may discuss what should happen when we use USB layer for “Serial” emulation – whether we shall propagate RTS/CTS or that kind of signals, etc. What I’ve seen in various discussions on this topic – even they say USB CDC includes this kind of flags (OUT command 0x22 (SET_CONTROL_LINE_STATE) RTS/DTR bits in CDC_ACM) – it is not necessary with USB, as you can tell the other side to “stop sending I cannot read the new packet” by NAK’ing, the other side will repeat again till ACKed.

FYI: Similar discussion at PaulS forum – an .ino sketch and Python script for testing and monitoring the CDC serial..
https://forum.pjrc.com/threads/33167-US … -detection
maybe we can reuse that somehow.

About NAK’ing IN/OUT USB packets:
http://nuttx.org/doku.php?id=wiki:nxint … sb-out-nak


Pito
Mon Aug 21, 2017 6:24 pm
FYI – this is how the “USBlyzer” (not free, 33d eval) shows the transfer between MMini and TeraTerm (Libmaple F1).
The USBlyzer provides huge amount of info, so I had to cut off a small chunk only – which fits as the attachment.
You may see the payload UUUU data are fragmented into 1-16/414-426bytes large chunks. No Idea what “4096 buffer” means.
The transactions are marked successful.

USBlyzer 3.JPG
USBlyzer 3.JPG (136.28 KiB) Viewed 563 times

Rick Kimball
Mon Aug 21, 2017 7:15 pm
@Pito have you tried it with “putty” instead of TeraTerm?

victor_pv
Tue Aug 22, 2017 5:19 am
Pito the Windows driver is definitely doing some buffering on it’s end, since the bulk transfer is only for 64bytes, hard coded in the libmaple driver.
I wonder is the 4096 is some information the host sends to the board to indicate how much buffer space it currently has.

Pito
Tue Aug 22, 2017 9:08 am
@Rick: I tried with HyperTerminal (the same result – see above) and with minicom (ubuntu in virtualbox). With minicom I did not make any captures.

@Victor: the “4096b buffer” could be some misinterpretation of data by the USBlyzer, or handshake..
UPDATE: it seems the usbser.sys requests 4kB bulk IN..

The Seq. numbers always refer packets like 128-127 where the 127 was the “4096b buffer” packet so it could be the 128th packet responded to the 127th..

What is interesting is the result – the payload chunks sizes – are similar to what I got from Wireshark.
It sends in ~1ms (the packet period) ~16 or ~420bytes (“random size”). That is something which needs to be understood.

While doing Serial.write(‘U’) in a loop you fill in the buffers fast, so I would expect the outgoing payload packets will always contain the amount of UUUs equal to the lowest layer buffer size.

UPDATE: MMini Libmaple against Putty (Win7_64b) – the same results as the above with Wireshark and USblyzer.

Putty.JPG
Putty.JPG (226.03 KiB) Viewed 525 times

Pito
Tue Aug 22, 2017 1:19 pm
This is with MMini, Libmaple F1, with this vitor_boss’ mod of my sketch (Serial.write(buf..):
void loop() {
uint32_t i;
uint8_t x = 85;
uint32_t elapsed = micros();
uint8_t buf[bufsize];

for (i = 0; i < bufsize; i++) { buf[i] = x; }

elapsed = micros();
for (i = 0; i < TXCHARS; i+=bufsize) {
Serial.write(buf, bufsize);
}
elapsed = micros() - elapsed;


Pito
Tue Aug 22, 2017 1:52 pm
Problem: STM32Generic with CDC_BUF 128, and bufsize = 256 (speed ???)
3seconds between packets

STM32Generic CDC_BUFF 128 bufsize 256.JPG
STM32Generic CDC_BUFF 128 bufsize 256.JPG (134.05 KiB) Viewed 488 times

victor_pv
Tue Aug 22, 2017 2:28 pm
Why did you mark this as problem?
Problem: STM32Generic with CDC_BUF 128, and bufsize = 256 (speed ???)
3seconds between packets

Pito
Tue Aug 22, 2017 4:48 pm
Because of 3secs between the packets when the bufsize in the sketch is 256bytes (or 512, 1024 etc)..
That is even visible in TT as the 4kB chunks always nap for 3secs..

Another test – counting bytes – logged into TT file.
To make it simple I’ve TXed 1.280.000 bytes
#define bufsize 64
#define TXCHARS bufsize*20000


danieleff
Wed Aug 23, 2017 4:39 pm
[Pito – Tue Aug 22, 2017 4:48 pm] –
Libmaple F1, bufsize 64 in sketch, logged 1.280.000 bytes -> OK.

My problem is that I do get losses with libmaple maple mini TeraTerm (4.95 (SVN# 6761) May 31 2017 20:25:51) (Win10, usbser.sys 10.0.14393.0). Data received in log file from 1272912 to 1279552.

Edit: If I run other CPU intensive program simultaneously (https://www.mersenne.org/download/), the problem goes away, and I get all the data from the above test.


Pito
Thu Aug 24, 2017 7:06 pm
@daniel: Interesting! My hypothesis: by running the above test (mersenne CPU stress test) you are heating up the internal temperature in your box such your usb subsystem works better (ie usb clock??, voltages??).

danieleff
Fri Aug 25, 2017 5:11 am
Using this code https://stackoverflow.com/a/6037377/834966, with larger `buffer[1024]`
* original tight loop: all data received
* `Sleep(10/100/1000)` in the loop: losing same amout(!) of data. (The TX timeout in arduino code is commented out, blocking all the way!)

Code from first post on Tennsy 3.5 in TeraTerm: losing some data.

I start to suspect it is usbser.sys. Are there any alternative drivers? I will need to run a linux live CD on this machine.

(I also get 128/256 buffer 3sec problem, but that will be something else entirely.)


Pito
Fri Aug 25, 2017 6:43 am
In a discussion I’ve seen they recommend WinUSB instead of usbser.sys..

Leave a Reply

Your email address will not be published. Required fields are marked *