FW built around Zephyr, linking to C++ static library - unexpected crash at runtime

Hey there,

As my system is growing in many directions (HW, SW, features…), I am currently considering moving to PlatformIO to add more flexibility/stability in my system infrastructure without changing the code (C code for FW, C++ for libraries) which is building/working fine on my current environment.

Unfortunately I experiment some crash (arm_fault) when running libraries after a while (few seconds to some minutes). This seems to be related to my C++ libraries, eventually dynamic allocation, eventually memory alignment problems, … I’m running out of ideas on directions to investigate.
More details below.

-Have you any clue on what could be causing the problem? I know it’s difficult without a reproducible example at handle but at least suggesting any other directions might help.
-I know that Zephyr C++ support is not fully comprehensive yet and so I could be hitting some limits here. Have you any experience doing similar things on your own project and could you point me in the right direction?
-As far as I investigated I have the feeling it’s related to dynamic memory allocation/reallocation or memory alignment problems. Given that I use the default linker file and default libc-hooks (for Newlib C), do you think I should provide these on my own as well? How did you do yourself?

Environment description
My stable environment (on the left) → PlatformIO environment (not stable on the right)

More or less I considered the following porting:

  • Custom Makefiles → PlatformIO (using platformio.ini)
  • FreeRTOS - > Zephyr RTOS (using CMSIS rtos v2 to abstract differences)
  • Custom HW description/Init → Zephyr board abstraction + drivers
  • C++ libraries built using CMake → Same libraries build triggered by PlatformIO

My current MCU is an ARM Cortex-M7, very standard and I am able to proceed to building, uploading, debugging and unit-testing.
The board I use is custom but very similar to Nucleo-H743ZI.

platform = ststm32
board = nucleo_h743zi
framework = zephyr

The current toolchain (from platformIO packages) used to build both FW and Lib is toolchain-gccarmnoneeabi@1.80201.0 (8.2.1), I reproduced the same faulty behavior with 9.2.1 or even external arm-gcc toolchain 10.2.

The C++ libraries are custom C++ archive (.a), statically linked to my FW.
I cannot disclose the real usage but they are computer-vision library like Eigen, Zxing using FPU, dynamic memory allocation eventually C++ exceptions but no HW access, nor threading or anything. Only scientific computation.
I am aware it’s not generally recommended to use such C++ features in embedded context but so far on the stable setup I have encountered no problems with them.

Problem description
I reduced the problem to a minimal example running in a unit-test (using Unity framework).
A single main thread simply calling my library foo in a while loop with same known inputs and waiting for a crash. Which happens fast but usually after hundreds of iterations.
The custom transportation for the test is using USB_CDC_ACM to be able to show traces.
The problem happens both in release and debug mode. Console activated or not.

#include <libfoo.hpp>
Inputs_t inputs;
foo(inputs); // After some iterations, arm_fault is called → MCU spins endlessly

I actually discovered the problem when raising the main stack size (CONFIG_MAIN_STACK_SIZE in Zephyr prj.conf) from 30KB (Everything runs fine without any problems):
to > 100KB → Crash as described.

The stack size is located at the end of my zephyr_prebuilt.map file after all code/data symbols. Only the heap memory goes after and fills up the current SRAM section (512KB).
Which leads me to the conclusion that beyond a certain stack size, some sort of corruption was becoming fatal for my system. But I have no clue where it happens.

Basically my SRAM section is divided into:

  • starts @0x24000000
  • ~30KB of code
  • 30KB (no problem) → >100KB of stack memory (faulty)
  • rest up to 512KB of heap memory.
  • __kernel__ram_end @0x24080000

The heap usage is basically around 20KB which is totally fine in my setup with ~350KB of available heap memory.

I have tried to investigate the arm fault by accessing exception stack frames and dedicated registers. It showed me that the fault was mainly categorized as BUS_FAULT (Imprecise error). When I accessed the stack pointer at execption usually it shows me symbols related to memory allocation or eventually reentrancy like malloc/malloc_r/free/free_r but I have not been able to reproduce the problem in a separated test case dealing with intensive usage of such symbols. I also don’t use thread at all in this context (single main thread) while I have multithreading activated in the regular full FW.

I was not able to get more stack trace to be able to point to the code causing the error but I assume it’s related to memory allocation on some sort, like reallocating/resizing a matrix or so. I have not been able to reproduce with specific code smaller than what I showed (1 library call).

Here are a bunch of meaningful additional configuration properties, compiler flags, options that I use (most are auto-generated by platformio/zephyr). I gathered them from a custom script accessing Environment definition (using Scons python module).

I compile FW and libraries with the exact same set of options and flags.

I trigger the library compilation (using CMake) from a custom target for my board and bridge every options to CMake, toolchain included.

-std=gnu99 -std=gnu++11 // Using this to make my libraries compiling
-std=c99 -std=c++11

Zephyr prj.conf

CONFIG_MAIN_STACK_SIZE=130000 // -> faulty







Interesting. Initial thoughts that the problem may have to do with the lesser amount of heap memory that results from increasing the heap, but if you say you only use ~20kB peak that’s unlikely. Or, due to the stack being larger and if there are bugs where out-of-bouds array writes are made, these writes no overwrite some variables on the stack or global variables (which were shifted to due a change in the stack memory layout).

An imprecise error can be made precise by disabling a cache here, *(uint32_t*)0xE000E008=(*(uint32_t*)0xE000E008 | 1<<1);.

All in all it’s extremely hard to say something meaningfull about an error where the code is not known for reproduction.

A general tip would be though to

  1. Check that all versions are equal. The latest ST STM32 platform uses Zephyr 2.6.0 and arm-none-eabi-gcc 8.2.1.
  2. Check that the used Zephyr board variant is equal. With board = nucleo_h743zi in the platformio.ini that will be nucleo_h743zi.
  3. Compare all compiler flags from the working and non-working example. The project task “Advanced → Verbose Compilation” can be used in PlatformIO to get all compiler invocations. I assume for the west build tool, a similiar thing exists

Sure, I also think it’s very unlikely. We haven’t observed any corruption heap>stack or stack>heap.

  1. Yes it’s the case. Zephyr 2.6.0 and latest STSTM32. Latest PIO also.

  2. This is interesting, and might be our best guess here. I actually retried to run the same code on my Nucleo_H743Zi and it actually runs fine the two configurations (stable and faulty).
    No crash observed. Still crashing on custom board of course.
    I have seen no differences in the compilers flags used, nor defconfig files. No differences also on the autogenerated autoconf.h from zephyr
    Only the main clock frequency seems different and probably the sensor connected even if not used in the particular example.
    My custom board is running at 400MHz while the Nucleo is working at 96MHz. Same CPU but two differents revisions though (not that relevant I think)

I have tried modifying the DTS files in order to align the two board configurations. But even when raising both to 240MHz still the custom board is faulty, Nucleo OK.

I am wondering if a sensor in default state (wrongly set or not set) could cause this kind of corruption causing the problem. Because it’s the biggest difference between the two boards. One has a camera sensor connected to it. The Nucleo not. For your reference my custom board is an OpenMV Cam H7 board.

  1. I mostly did between my two projects.
    At first I actually, replicated the exact same flags I wad using from my custome Makefile. And after I realised that actually Platformio/Zephyr was generating a bunch more (using Import("env) → env[“CFLAGS”]) and not always accounting for the ones I set in platformio.ini
    I finally removed mine and now use only the default generated ones.
    I have not tried using Zephyr directly with west (without platformio), only checked a couple of samples projects to tackle CMSIS, threading and USB console. Usually it worked correctly on Nucleo board. I was missing board description to try on the custom board but I could give a try eventually now.

If you are using a custom Zephyr variant, best to make PlatformIO use that too. Carefully read through Enabling PlatformIO and Zephyr on custom hardware | PlatformIO Labs on how to do that. In the most minimalistic way (not creating a custom board definition), you’d need to copy the Zephyr board folder into zephyr/boards/arm/<your zephyr board folder> with a unique name and add board_build.zephyr.variant = <unique zephyr board name> in the platformio.ini.

Thanks @maxgerhardt for your suggestions.
I will try testing these “intermediate steps” (in-between of the two complete boards descriptions that I have), hoping I can narrow down to the problem/conflict/difference(s) between the two boards.
FYI I have followed the tutorial you pointed to create my boards. Maybe I created some conflicts or failed to change

I was not aware of build_board.variant, it seems useful (at least for the case I’m facing) but I doubt I can use it further since my board custom board is actually using not exactly the same CPU, not pinmuxing, … (H743ZI for Nucleo, H743VI for Custom board). I have also some shields overlays. So I’ll probably stick to two differents boards structure in the future.

Something I tried as well is to flash the firmware.elf generated by NucleoH743ZI environment (working fine on Nucleo board) to the custom board. It still crashes at runtime the same way. Indicating there is definitely something bad at HW/DTS level, eventually at USB/CDC level. I’ll check the errata for the two MCU as well maybe to spot something.

Found the reason of the problem and also the workaround for it in the errata for STM32 for the particular revisions. And that explains why on Nucleo board (H743ZI - rev V) the problem does not show up. While on OpenMV board (H743VI - rev Y) the problem is there.

Section 2.2.9 Reading from AXI SRAM may lead to data read corruption

I implemented a fast workaround as stated as is:
((uint32_t)0x51008108) = 0x000000001; // Set READ_ISS_OVERRIDE bit in the AXI_TARG7_FN_MOD register

This solves the problem on both boards and also explains why the problem was showing as Bus fault.
I haven’t checked the performances impact yet.
I can also consider moving most of my code/data out of SRAM1 (AXI SRAM) such as HEAP section using a specific ldscript.

Can you eventually recommend a way to submit this workaround/patch to PlatformIO and/or Zephyr?
I think it could be of an interest to patch the hal stm32 for the particular variant revision

Thanks for your help though

Primarily in Zephyr please (GitHub - zephyrproject-rtos/zephyr: Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.), PlatformIO just takes released Zephyr versions and bundles them with the PlatformIO builder script. Usually not code fixes are made.

Nice find.

This totally makes sense.
I’ll try to post there then.

Thanks again for your help.
I close the issue.