ATtiny85 flash memory unknown usage

ridoluc · May 25, 2020, 12:23pm

I’ve been testing some buzzer functions on an ATtiny85 and I noticed a weird flash memory usage fluctuation when changing constant arguments of a function.

In more detail, compiling the following code the memory occupied is 162 bytes

int main(){
  beep_once(1000,500);
  silence(1000);
  beep_once(1000,500);
  silence(200);
  for(;;){ }
}

While just changing the first beep_once(1000,500); to beep_once(1500,500); the program size becomes 1378 bytes.

Using the memory inspection by PlatformIO, I can see that c. 1kb of memory is due to unknown. What does it mean?

Hope someone can help me understand what is going on and what side is the problem: PlatformIO, compiler or it’s me.

Here the full code used for the test:

#include <avr/interrupt.h>

#define F_CLK 1000000

void silence(unsigned int);
void beep_once(unsigned int,  unsigned int);

int main(){
  beep_once(1500,500);
  silence(1000);
  beep_once(1000,500);
  silence(200);
  
  for(;;){ }
}


void silence(unsigned int duration)
{
    TCCR1 = 5; // Set the prescaler to 16 bit.
    OCR1A = 61; // Tops at 1ms with prescaler of 16x 
    unsigned int counter=duration;
    do
    {
        if ((TIFR >> OCF1A) & 0x01)
        {
            counter--;
            TCNT1 = 0;  // Reset counter to 0
            TIFR |= 1 << OCF1A; // Clear the flag
        }
    } while (counter);
    TCCR1 &= ~((1 << CS12) | (1 << CS11) | (1 << CS10));    // Stop the timer
}

void beep_once(unsigned int frequency /*Herz*/, const unsigned int duration /*milliseconds*/)
{
    /*
    *  The frequency is given by:
    *  Freq = F_CLK / (prescaler * (1 + OCR1C))
    */

    TCCR1 |= (5 << CS10) |(1 << CTC1);   // Set the prescaler to 16 bit (for 1MHz)(freq range from 31kHz to 250 Hz)
    OCR1C = (uint8_t)((F_CLK / frequency) >> 5); // divided by the prescaler (16)
    TIMSK &= ~(1 << OCIE1B); // Disable timer compare interrupt
    GTCCR |= 1 << COM1B0; // Timer Counter Comparator B connected to output pin OC1B.
    unsigned int counter  = duration * ((float)frequency / 1000.) * 2;;
    do
    {
        if ((TIFR >> OCF1B) & 0x01)
        {
            counter--;
            TIFR |= 1 << OCF1A; // Clear the flag
        }
    } while (counter);
    TCCR1 &= ~((1 << CS12) | (1 << CS11) | (1 << CS10));    // Stop the timer
    GTCCR &= ~(1 << COM1B0);           // Timer Counter Comparator B disconnected from output pin OC1B.
}

maxgerhardt · May 25, 2020, 4:21pm

My best guess: Compiler optimization. If you only do beep_once(1000,500); then frequency = 1000 and the compiler can precompute that 1000 / 1000.0 = 1.0 and just substitutes that known value in. If you do it however with 1500 as a new different argument it can’t precompute it anymore. It has to now synthesize a floating-point division operation in software, aka something expensive as hell in code size for an 8-bit microcontroller without an FPU. Oh, and not only a divide: Since you also multiply a float result by 2, you need a float multiply operation as well You can however just write (float)frequency/500 instead of that /1000 * 2.

I’m taking another look at the compiled output but that’s pretty much by best guess.

You can ask yourself: Do you really need a floating point division in this place? Can’t you do it with integer divisions? Divisions / multiplications by powers of two is cheap since it’s left or right-shifts (x * 4 == x >> 2 and x / 512 == x << 9)

maxgerhardt · May 25, 2020, 4:39pm

Yeah as I thought. If you keep the firmware at only 1 possible argument value for frequency, you get a firwmare which has these functions:

avr-objdump.exe -d C:\Users\Maxi\Desktop\Programming_stuff\playground\.pio\build\attiny85\firmware.elf

C:\Users\Maxi\Desktop\Programming_stuff\playground\.pio\build\attiny85\firmware.elf:     file format elf32-avr


Disassembly of section .text:

00000000 <__vectors>:
<vector table>

0000001e <__ctors_end>:
<low-level-init> 
0000002e <__bad_interrupt>:
  2e:   e8 cf           rjmp    .-48            ; 0x0 <__vectors>

00000030 <beep_once.constprop.0>:
  30:   80 b7           in      r24, 0x30       ; 48
  32:   85 68           ori     r24, 0x85       ; 133
  34:   80 bf           out     0x30, r24       ; 48
  36:   8f e1           ldi     r24, 0x1F       ; 31
  38:   8d bd           out     0x2d, r24       ; 45
  3a:   89 b7           in      r24, 0x39       ; 57
  3c:   8f 7d           andi    r24, 0xDF       ; 223
  3e:   89 bf           out     0x39, r24       ; 57
  40:   8c b5           in      r24, 0x2c       ; 44
  42:   80 61           ori     r24, 0x10       ; 16
  44:   8c bd           out     0x2c, r24       ; 44
  46:   88 ee           ldi     r24, 0xE8       ; 232
  48:   93 e0           ldi     r25, 0x03       ; 3
  4a:   08 b6           in      r0, 0x38        ; 56
  4c:   05 fe           sbrs    r0, 5
  4e:   fd cf           rjmp    .-6             ; 0x4a <__SREG__+0xb>
  50:   28 b7           in      r18, 0x38       ; 56
  52:   20 64           ori     r18, 0x40       ; 64
  54:   28 bf           out     0x38, r18       ; 56
  56:   01 97           sbiw    r24, 0x01       ; 1
  58:   c1 f7           brne    .-16            ; 0x4a <__SREG__+0xb>
  5a:   80 b7           in      r24, 0x30       ; 48
  5c:   88 7f           andi    r24, 0xF8       ; 248
  5e:   80 bf           out     0x30, r24       ; 48
  60:   8c b5           in      r24, 0x2c       ; 44
  62:   8f 7e           andi    r24, 0xEF       ; 239
  64:   8c bd           out     0x2c, r24       ; 44
  66:   08 95           ret

00000068 <silence>:
...

0000008c <main>:
  8c:   d1 df           rcall   .-94            ; 0x30 <beep_once.constprop.0>
  8e:   88 ee           ldi     r24, 0xE8       ; 232
  90:   93 e0           ldi     r25, 0x03       ; 3
  92:   ea df           rcall   .-44            ; 0x68 <silence>
  94:   cd df           rcall   .-102           ; 0x30 <beep_once.constprop.0>
  96:   88 ec           ldi     r24, 0xC8       ; 200
  98:   90 e0           ldi     r25, 0x00       ; 0
  9a:   e6 df           rcall   .-52            ; 0x68 <silence>
  9c:   ff cf           rjmp    .-2             ; 0x9c <main+0x10>

0000009e <_exit>:
  9e:   f8 94           cli

000000a0 <__stop_program>:
  a0:   ff cf           rjmp    .-2             ; 0xa0 <__stop_program>

Notice how there’s a function called

00000030 <beep_once.constprop.0>:

“constant propagation” is exactly the optimization step which I talked about earlier: Just use the known constant, precompute the needed values in the function. The compiler did that here for you.

Now let’s use one argument with 1000 and one 1500 and the firmware now has these functions:

>avr-objdump.exe -d C:\Users\Maxi\Desktop\Programming_stuff\playground\.pio\build\attiny85\firmware.elf  | grep ">:"
00000000 <__vectors>:
0000001e <__ctors_end>:
0000002e <__bad_interrupt>:
00000030 <beep_once.constprop.0>:
000000c0 <silence>:
000000e4 <main>:
000000fe <__subsf3>:
00000100 <__addsf3>:
00000122 <__addsf3x>:
000001c8 <__divsf3>:
000001e2 <__divsf3x>:
000001e6 <__divsf3_pse>:
00000298 <__fixunssfsi>:
000002f0 <__floatunsisf>:
000002f4 <__floatsisf>:
0000036a <__fp_inf>:
00000376 <__fp_nan>:
0000037c <__fp_pscA>:
0000038a <__fp_pscB>:
00000398 <__fp_round>:
000003ba <__fp_split3>:
000003ca <__fp_splitA>:
000003fe <__fp_zero>:
00000400 <__fp_szero>:
0000040c <__mulsf3>:
00000422 <__mulsf3x>:
00000426 <__mulsf3_pse>:
000004e2 <__divmodsi4>:
000004fa <__divmodsi4_neg2>:
00000508 <__divmodsi4_exit>:
0000050a <__negsi2>:
0000051a <__udivmodsi4>:
00000526 <__udivmodsi4_loop>:
00000540 <__udivmodsi4_ep>:
0000055e <_exit>:
00000560 <__stop_program>:

You can clearly see how there are a trillion compiler-integrated functions for floating point operations for add, sub, div, mul, negate, NaNs, infinity… all the required stuff to do floating point operations that is.

And that’s why your firmware size explodes.

maxgerhardt · May 25, 2020, 4:44pm

Now, regarding a solution: Try to avoid general floating point operations during runtime at all costs.

It’s totally okay if the constant can be computed during compile-time:

#define COUNTER_VALUE_1000_HZ ((int)(1000.0 / 1000.0 * 2.0))

can be computed at compile time (if optimizations are turned on) and the MCU doesn’t need to do that.

For example, why not have a lookup-value of only permitted frequencies which you actually need? You can put those in a lookup table:

typedef enum {
   FREQ_1000_HZ = 0,
   FREQ_1500_HZ = 1
   //...
} freq_t;

const unsigned int counter_lookup[] = {
    (unsigned) (1000.0 / 1000.0 * 2),
    (unsigned) (1500.0 / 1000.0 * 2)
}

//... in function 
void beep_once(freq_t frequency /*enum value*/, const unsigned int duration /*milliseconds*/) {
// ...
unsigned int counter  = duration * counter_lookup[frequency];
//..

and the runtime floating-ops are gone.

ridoluc · May 25, 2020, 5:13pm

Wow! Thanks for the exhaustive reply. There are many valuable suggestions to think of (for me).
I understand what is happening now: the compiler is including various routines to perform the floating-point calculations.
I was aware this is a demanding task for this 8-bit MCU but never expected that could take around 1kb of firmware!
I will definitely keep in mind the look-up table suggestion.
I have a lot to learn.

maxgerhardt · May 25, 2020, 7:22pm

Ah indeed there is a better way to optimize it. I see that you’re doing a float-division but the resulting variable is an integer type – thus the decimal part of the result is not needed anyways. Then you could also try to write it

//use 32-bit integer divide and multiply instead of floating point divide and multiple
//32-bit instead of 16-bit because the multiplication of two 16-bit values can quickly overflow
//a 16-bit variable.
//also simplify / 1000 * 2 to / 500.
unsigned int counter = (unsigned int)((unsigned long)duration * frequency / 500UL);

32-bit integer math is easier than 32-bit floating point math.

The result if only 1 possible argument value of frequency is used is still

RAM:   [          ]   0.0% (used 0 bytes from 512 bytes)
Flash: [          ]   2.0% (used 162 bytes from 8192 bytes)

but with two different arguments we have

RAM:   [          ]   0.0% (used 0 bytes from 512 bytes)
Flash: [=         ]   5.2% (used 422 bytes from 8192 bytes)

which is indeed better than 1378 bytes.

ridoluc · May 25, 2020, 7:49pm

Understood and very helpful.
Are you able to give me a sense of the difference in computational times between integer and floating 32bit math? (i.e. clock cycles)
My thought is: let’s say I need to use a floating-point operation in another part of the code. So the additional space required for the floating-point routines is there anyway.
Is it still worth using the 32-bit integer operation or I just go for the float division?
I can guess the answer but don’t have evidence to justify it.

pfeerick · May 26, 2020, 8:49am

Oh boy… this takes me back to some code I was working on when I first tried out the cppcheck code checker tool that was added to PIO… and it told me I should be using constexp instead of const … and I was going… WTH is constexp … I’m familiar with const, but constexp??? … well, turns out it was added to C++ in 2011 (hence C++11), and allows for a constant (unchanged value) to be determined at compile time… rather than at run time… (why would you want it const? Because it saves SRAM, and can’t be changed accidentally - perfect for formula ‘magic numbers’ etc.)

When I learnt C++ it was probably with the C++98 rule book… and life was much simpler!

maxgerhardt · May 26, 2020, 5:06pm

The AVR sadly lacks a clock counter in hardware but you can either

analyze it statically, i.e. the number of instructions that are executed in the program flow given some values, plus the number of cycles used for each instruction
anaylze it dynamically by executing the function a few thousand times and measuring the time taken (either by toggling a GPIO which can be measured externally with a logic analyzer, or use internal timers / millis()) to get a time value

But just by the size difference it can be seen that there’s a lot more code now, which is, if all of it is executed, must take a longer time. Benchmarking is however the only true way to verify it.

That’s really dependent on the project and the specific functions: Remember, if you can avoid a floating point operation and do integer math instead, it will pretty much garuanteed be faster. That’s especially important in code which must be responsive or is expected to have finished in under a certain amount of time (real-time systems, interrupt routines, …)

If the execution time is negligable and it’s more complicated and unreadable to rewrite it approximate integer math, then the floating point math can be used, but should be done at a minimum. Certain optimization rules like “instead of dividing, multiply by the inverse, so no div operation is neeed” still apply here.

Btw, you can also choose a new compiler or use different flags. See e.g. docs for selecting certain toolchain-atmelavr versions for the avr-gcc. For example, adding

platform_packages = 
    toolchain-atmelavr @ 1.70300.191015

in the platformio.ini makes it use GCC 7.3.0 intead of 5.4.0.

Another neat trick: Fast-math optimization. If you are really, really, really sure that certain operations don’t produce floating-point edge cases like NaNs (not-a-number), positive or negative infinity and certain rounding edge cases etc, you can give the compiler the -ffast-math option, which is an alias for -fno-math-errno, -funsafe-math-optimizations, -ffinite-math-only, -fno-rounding-math, -fno-signaling-nans, -fcx-limited-range and -fexcess-precision=fast – as you can see from the names, this can break code if not used carefully.

E.g. using this in the platformio.ini by adding

build_flags = -ffast-math

reduces your initial code from

RAM:   [          ]   0.0% (used 0 bytes from 512 bytes)
Flash: [==        ]  16.8% (used 1378 bytes from 8192 bytes)

to

RAM:   [          ]   0.0% (used 0 bytes from 512 bytes)
Flash: [          ]   4.1% (used 334 bytes from 8192 bytes)

very siginificantly