Very lean Serial.printf()

Slammer
Wed Apr 13, 2016 11:58 pm
I miss very much the old-school printf() function….
I am tired to write again and again Serial.print() commands to display some values. Unfortunately the typical printf function is very big, the libc version can take almost the half or more memory of a F103C8. It is not only the function itself, is the number of functions that the linker attaches when this library is called (floating point, helpers, stdio stuff, etc)
I remember, years ago there were very optimised versions of printf() for use with 8051s taking no more than 2KB. For this reason I opened my old notebooks and I found some versions of these functions. I tried to adopt them in STM32 environment, inside Print class. I tested over 6-8 functions and I end-up with the version of sdcc compiler. I removed all mcu specific stuff and after some changes the code is running on STM32.

What I have until now:
– A printf function that supports integer (long, short, byte), string, character and pointer variables.
– ‘ ‘, -, +, b, l, c, s, p, d, i, o, u, and x format specifiers
– No static variables, no dynamic allocation, no globals, just plain functions running with local variables.
– Print.printf member function
– Very small size : 1080 bytes (for the 3 core functions) and 64 bytes for Print.printf() encapsulation.
– Until now I cant rid off the declaration of a temporary buffer inside Print.printf(), but I am sure that the buffer is not needed (WIP…). For this reason the output of the function is limited by the size of this buffer, now is 64 bytes (no test for overflow!!!).
– No floating point but it is possible to include this functionality, I have to examine the impact of this.

I think that code needs some cleanup, but it is a good start.

The code is included in one file called print_format.c, this includes the main function and some helpers.
Two files of our core files must be changed.
First the Print.h for the declaration of the function, Roger has already the function with header guards, so I wrote a new line
int printf(const char * format, ...);


ahull
Thu Apr 14, 2016 11:52 am
That looks much more compact than any of the other printf() variants I have seen in arduinoland. :D Good work.

Slammer
Thu Apr 14, 2016 12:22 pm
I am thinking some optimizations to reduce the memory more… eg.
– remove of dedicated pointer printing ( it is possible to print pointers as long unsigned integers with integer specifiers)
– direct call back to Print.write functions without buffering (very important for Serial or LCD printing)

I am studying also some algorithms to convert a float point number to string without float functions. This is essential for floats support without the overhead of floating point libraries. Of course, there are some limitations because these stripped functions do not support scientific notation, have fixed number of precision, etc
By best until now is about 1500 bytes more… but this is WIP….


mrburnette
Thu Apr 14, 2016 12:27 pm
Looks good! On the STM32F103, 1K flash is not an inordinate quantity of storage to lose to a useful function. I’ll have to play around with this later and get the SRAM component, but with bootloader 2.0, 20K of SRAM is usually an ample amount even when large demands are placed on the uC.

But, when SRAM and flash are very tight (maybe a UNO, Nano…) then Mikal Hart’s Streaming macro is a no-load way of handling print output formatting. http://arduiniana.org/libraries/streaming/
One can even pull off simple “logic” within the stream:

Example by Rob Tillaart:
#include <Streaming.h>
// .....
int h = 14;
int m = 6
Serial << ((h<10)?"0":"") << h << ":" << ((m<10)?"0":"") << m << endl;


Slammer
Thu Apr 14, 2016 12:39 pm
Except the buffer of 64 bytes in Print.printf() member function, the core function use local variables in stack. If I am measuring the bytes correctly, this is about 26 bytes…
Furthermore, if only Serial.printf is used in the program without different versions of Serial.print()/Serial.println() it is possible the total size of program to be smaller.

mrburnette
Thu Apr 14, 2016 12:55 pm
Slammer wrote:Except the buffer of 64 bytes in Print.printf() member function, the core function use local variables in stack. If I am measuring the bytes correctly, this is about 26 bytes…
Furthermore, if only Serial.printf is used in the program without different versions of Serial.print()/Serial.println() it is possible the total size of program to be smaller.

martinayotte
Thu Apr 14, 2016 1:45 pm
Congrat ! you should submit a PR for that !

(On ESP, it is there since awhile, but it was much simpler to implement, since there was enough space to use vsnprintf() …)


Rick Kimball
Thu Apr 14, 2016 2:17 pm
The down side to printf is that formatting is done using runtime code. If you use the streaming approach or just plain Serial.print all the decisions on how to format are done at compile time. If you are looking for just that little bit extra speed, you might want to stick to the standard stuff.

Over on forum.43oh.com we had long discussions about printf. One user there, Opossum, had a unique approach that resulted in small code that used no buffering. See this post: http://forum.43oh.com/topic/1289-tiny-printf-c-version/


Slammer
Thu Apr 14, 2016 2:51 pm
Thanks for the info
I evaluate almost 6-8 different implementations of small printf() functions with additional modifications.

I want something with small footprint (about 1K is OK), no static variables (we need reentrancy), long int support, 0 and (space) specifiers, full support of all integer types, optional float support but without linking of floating point libraries (ok, I can give 1-1.5 more KB for that), integration in Print class.


Slammer
Thu Apr 14, 2016 11:20 pm
Trying to solve the problem with the buffer in printf(), I realised that I have to move to a different direction.
Instead of trying to encapsulate the C function and the callback inside Print Class (which actually does not have write() function, is only a virtual), it is better to write the printf() as native C++ member function of Print (aka Arduino style). The same approach is used anyway for the other functions of Print.

Now there is no C code, all printf functionality is in a function inside Print class, there is no reason to make callback to write something, it is very easy by calling the virtual function write() of Print. I also need a small internal function to calculate the digits of numerical values.

The total size of these 2 functions is 0x22+0x318 = 0x33A = 826 bytes (No buffers, no static variables)

To use add this code at the end of Print.cpp

//------------------------------------------------
#ifdef toupper
#undef toupper
#endif
#ifdef tolower
#undef tolower
#endif
#ifdef islower
#undef islower
#endif
#ifdef isdigit
#undef isdigit
#endif

#define toupper(c) ((c)&=0xDF)
#define tolower(c) ((c)|=0x20)
#define islower(c) ((unsigned char)c >= (unsigned char)'a' && (unsigned char)c <= (unsigned char)'z')
#define isdigit(c) ((unsigned char)c >= (unsigned char)'0' && (unsigned char)c <= (unsigned char)'9')

typedef union {
unsigned char byte[5];
long l;
unsigned long ul;
float f;
const char *ptr;
} value_t;

size_t Print::printDigit(unsigned char n, bool lower_case)
{
register unsigned char c = n + (unsigned char)'0';

if (c > (unsigned char)'9') {
c += (unsigned char)('A' - '0' - 10);
if (lower_case)
c += (unsigned char)('a' - 'A');
}
return write(c);
}

static void calculateDigit (value_t* value, unsigned char radix)
{
unsigned long ul = value->ul;
unsigned char* pb4 = &value->byte[4];
unsigned char i = 32;

do {
*pb4 = (*pb4 << 1) | ((ul >> 31) & 0x01);
ul <<= 1;

if (radix <= *pb4 ) {
*pb4 -= radix;
ul |= 1;
}
} while (--i);
value->ul = ul;
}

size_t Print::printf(const char *format, ...)
{
va_list ap;
bool left_justify;
bool zero_padding;
bool prefix_sign;
bool prefix_space;
bool signed_argument;
bool char_argument;
bool long_argument;
bool lower_case;
value_t value;
int charsOutputted;
bool lsd;

unsigned char radix;
unsigned char width;
signed char decimals;
unsigned char length;
char c;
// reset output chars
charsOutputted = 0;

va_start(ap, format);
while( c=*format++ ) {
if ( c=='%' ) {
left_justify = 0;
zero_padding = 0;
prefix_sign = 0;
prefix_space = 0;
signed_argument = 0;
char_argument = 0;
long_argument = 0;
radix = 0;
width = 0;
decimals = -1;

get_conversion_spec:
c = *format++;

if (c=='%') {
charsOutputted+=write(c);
continue;
}

if (isdigit(c)) {
if (decimals==-1) {
width = 10*width + c - '0';
if (width == 0) {
zero_padding = 1;
}
} else {
decimals = 10*decimals + c - '0';
}
goto get_conversion_spec;
}
if (c=='.') {
if (decimals==-1)
decimals=0;
else
; // duplicate, ignore
goto get_conversion_spec;
}
if (islower(c)) {
c = toupper(c);
lower_case = 1;
} else
lower_case = 0;

switch( c ) {
case '-':
left_justify = 1;
goto get_conversion_spec;
case '+':
prefix_sign = 1;
goto get_conversion_spec;
case ' ':
prefix_space = 1;
goto get_conversion_spec;
case 'B': /* byte */
char_argument = 1;
goto get_conversion_spec;
// case '#': /* not supported */
case 'H': /* short */
case 'J': /* intmax_t */
case 'T': /* ptrdiff_t */
case 'Z': /* size_t */
goto get_conversion_spec;
case 'L': /* long */
long_argument = 1;
goto get_conversion_spec;

case 'C':
if( char_argument )
c = va_arg(ap,char);
else
c = va_arg(ap,int);
charsOutputted+=write(c);
break;

case 'S':
value.ptr = va_arg(ap,const char *);

length = strlen(value.ptr);
if ( decimals == -1 ) {
decimals = length;
}
if ( ( !left_justify ) && (length < width) ) {
width -= length;
while( width-- != 0 ) {
charsOutputted+=write(' ');
}
}

while ( (c = *value.ptr) && (decimals-- > 0)) {
charsOutputted+=write(c);
value.ptr++;
}

if ( left_justify && (length < width)) {
width -= length;
while( width-- != 0 ) {
charsOutputted+=write(' ');
}
}
break;

case 'D':
case 'I':
signed_argument = 1;
radix = 10;
break;

case 'O':
radix = 8;
break;

case 'U':
radix = 10;
break;

case 'X':
radix = 16;
break;

default:
// nothing special, just output the character
charsOutputted+=write(c);
break;
}

if (radix != 0) {
unsigned char store[6];
unsigned char *pstore = &store[5];

if (char_argument) {
value.l = va_arg(ap, char);
if (!signed_argument) {
value.l &= 0xFF;
}
} else if (long_argument) {
value.l = va_arg(ap, long);
} else { // must be int
value.l = va_arg(ap, int);
if (!signed_argument) {
value.l &= 0xFFFF;
}
}

if ( signed_argument ) {
if (value.l < 0)
value.l = -value.l;
else
signed_argument = 0;
}

length=0;
lsd = 1;

do {
value.byte[4] = 0;
calculateDigit(&value, radix);
if (!lsd) {
*pstore = (value.byte[4] << 4) | (value.byte[4] >> 4) | *pstore;
pstore--;
} else {
*pstore = value.byte[4];
}
length++;
lsd = !lsd;
} while( value.ul );
if (width == 0) {
// default width. We set it to 1 to output
// at least one character in case the value itself
// is zero (i.e. length==0)
width = 1;
}
/* prepend spaces if needed */
if (!zero_padding && !left_justify) {
while ( width > (unsigned char) (length+1) ) {
charsOutputted+=write(' ');
width--;
}
}
if (signed_argument) { // this now means the original value was negative
charsOutputted+=write('-');
// adjust width to compensate for this character
width--;
} else if (length != 0) {
// value > 0
if (prefix_sign) {
charsOutputted+=write('+');
// adjust width to compensate for this character
width--;
} else if (prefix_space) {
charsOutputted+=write(' ');
// adjust width to compensate for this character
width--;
}
}
/* prepend zeroes/spaces if needed */
if (!left_justify) {
while ( width-- > length ) {
charsOutputted+=write( zero_padding ? '0' : ' ');
}
} else {
/* spaces are appended after the digits */
if (width > length)
width -= length;
else
width = 0;
}
/* output the digits */
while( length-- ) {
lsd = !lsd;
if (!lsd) {
pstore++;
value.byte[4] = *pstore >> 4;
} else {
value.byte[4] = *pstore & 0x0F;
}
charsOutputted+=printDigit(value.byte[4], lower_case);
}
}
} else {
charsOutputted+=write(c);
}
}
va_end(ap);
return (size_t)charsOutputted;
}


mrburnette
Thu Apr 14, 2016 11:32 pm
Slammer wrote:
<…>
No other implementation is so small, I tried almost everything , there are smaller implementations but they don’t support all types of integers or width specifiers or they use static variables….

Slammer
Fri Apr 15, 2016 12:21 am
And for those that saying that STM32 wastes more memory than AVR, the same function in Uno needs 0x40C+0x2A = 1078 bytes

mrburnette
Fri Apr 15, 2016 2:31 pm
mrburnette wrote:
<…>
Now, how fast is it?

Ray


Slammer
Fri Apr 15, 2016 9:21 pm
Actually this method does not measure the printf itself…. but the time that MCU needs to write the characters to uart.
Our implementation of write to serial (the usart_putc) is not interrupt based, neither support buffer, as result of this, the MCU during writing of a character to serial just waiting to end the transmission ( I dont know the internals of STM32 but either waiting the current char to push out or the previous… but the result is almost the same if you want to push multiple bytes to uart)
In a typical application a ring buffer must be used, the usart_putc normally is the entrance point to ring buffer but is not blocking. The tx interrupt triggers the sending of the next character until the buffer gets empty.
The time that a character needs to leave uart is not small. At 115200 a character needs almost 1/11520 sec = 87 usec, it is really long time for a 72MHz MPU (at 9600 is an eternity….)

mrburnette
Fri Apr 15, 2016 9:36 pm
Slammer wrote:Actually this method does not measure the printf itself…. but the time that MCU needs to write the characters to uart.
Our implementation of write to serial (the usart_putc) is not interrupt based, neither support buffer, as result of this, the MCU during writing of a character to serial just waiting to end the transmission ( I dont know the internals of STM32 but either waiting the current char to push out or the previous… but the result is almost the same if you want to push multiple bytes to uart)
In a typical application a ring buffer must be used, the usart_putc normally is the entrance point to ring buffer but is not blocking. The tx interrupt triggers the sending of the next character until the buffer gets empty.

Slammer
Fri Apr 15, 2016 9:56 pm
The NodeMCU is really the beast of 2$
The program memory is so huge for MCU applications that a full 50K-60K version of printf is almost nothing…. that’s why the Print.printf() is included on the core of esp8266.
From the other side, the most used MPUs in last 25 years of my professional life, are ATmega8 and 89C52… In these machines, even the Print class is a luxury… you have to live with basic itoa and ltoa….
Anyway may be more usefull for our community to try to improve some core functions like buffered transmit on uart… (lol, I want something to keeps my nights busy….)

PS: I am afraid that the measuring of transmit time with Serial.write is more complex. My previous post about UART is technically correct but the Serial.XXX functions are not using a real UART but an emulated device through USB that acts as UART. The timing of this device is not an easy task…
The concept is the same because of the blocking nature of uart_putc but the timing is unknown.


mrburnette
Fri Apr 15, 2016 10:10 pm
Slammer wrote:
<…>
From the other side, the most used MPUs in last 25 years of my professional life, are ATmega8 and 89C52… In these machines, even the Print class is a luxury… you have to live with basic itoa and ltoa….
Anyway may be more usefull for our community to try to improve some core functions like buffered transmit on uart… (lol, I want something to keeps my nights busy….)

Slammer
Fri Apr 15, 2016 11:00 pm
Look the results of this code (only loop):

long uScount;
digitalWrite(BOARD_LED_PIN, HIGH); delay(500); // LED_on + half-second
uScount = micros();
Serial1.println("This is my big fat text.... 50 characters long....");
uScount = micros() - uScount;
Serial.print("\t\t\t\t printf Serial1 : uS="); Serial.println( uScount);
uScount = micros();
Serial.println("This is my big fat text.... 50 characters long....");
uScount = micros() - uScount;
Serial.print("\t\t\t\t printf USB : uS="); Serial.println( uScount);
digitalWrite(BOARD_LED_PIN, LOW); delay(500); // LED_off + half-second


mrburnette
Sat Apr 16, 2016 2:12 am
The result reveals the truth about Uart/USB timing

Yes, agree.

Ray


victor_pv
Sun Jan 15, 2017 4:44 pm
Slammer, I just used your printf function in a small SWO library, I hope you don’t mind, I gave you credit ;)

Now, about printf in the print class, I understand Ray’s point, that adding 1KB of code here, 1 there, eventually adds up to a good amount, and people may not need it, but I thought if a sketch doesn’t use printf, that printf would not be included by the compiler/linker, so unless it’s actually used by the sketch, there is no difference whether the function is present in the print class or not.
Am I missing something?

Besides that, one other observation: I tested using sprintf to convert a message to a string to them be able to print it with println, in a test sketch for SWO, and it increases the sketch by 15KB of flash and 1.5KB of RAM +/-. By comparison Slammers printf increases the size by about 1KB or so only.
Given that, if printf doesn’t take space when not used, and adding it to the core may save people from having to use sprintf, my vote would go to include it in the core.

EDIT: I compiled my SWO test sketch using println and using printf. In both cases printf is part of the SWO class. It really seems it doesn’t add to the code unless used.
Then I went 1 step further and include it in the print class instead, and I get a similar result, the code size does not grow when not used. So I see no harm in adding it to the core. I am not sure why Ray’s test was showing increased size when not used.
printf in SWO class but not used in the sketch (only println):
Sketch uses 15,940 bytes (3%) of program storage space. Maximum is 524,288 bytes.
Global variables use 2,952 bytes of dynamic memory.

printf in SWO class and used:
Sketch uses 16,900 bytes (3%) of program storage space. Maximum is 524,288 bytes.
Global variables use 2,952 bytes of dynamic memory.

printf in print class and used:
Sketch uses 16,908 bytes (3%) of program storage space. Maximum is 524,288 bytes.
Global variables use 2,952 bytes of dynamic memory.

printf in print class and not used:
Sketch uses 15,940 bytes (3%) of program storage space. Maximum is 524,288 bytes.
Global variables use 2,952 bytes of dynamic memory.


Leave a Reply

Your email address will not be published. Required fields are marked *