Watchdog trips in async_tcp

Well it looks like the ESPAsyncWebServer is not very asynchronous. I have a project that uses a six-page web server hosted in an ESP32 that sets up and controls a machine. It uses a top navigation bar so the user can move between any page at will. Each page uses an AsyncWebServerRequest to serve (send) each page from SPIFFS. Each page contains a mix of numeric and textual inputs, numeric outputs and buttons. The button states are serviced from a websocket NotifyClients function, while the input and output states are serviced from a single AsyncWebServerRequest send and received by a window.addEventListener “getValues” function in each page’s javascript file.
It all works fine when changing parameters in a particular window, but has the following fatal flaw which occurs occasionally after switching between pages and typically after a NotifyClients call:

E (1202639) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
E (1202639) task_wdt: - async_tcp (CPU 0/1)
E (1202639) task_wdt: Tasks currently running:
E (1202639) task_wdt: CPU 0: IDLE
E (1202639) task_wdt: CPU 1: loopTask
E (1202639) task_wdt: Aborting.
abort() was called at PC 0x400f11a1 on core 0
Backtrace: 0x40083db1:0x3ffbee7c |<-CORRUPTED
ELF file SHA256: 657a305a3c17c4f0
Rebooting…

The code that sends the variables is:

   server.on("/values", HTTP_GET, [](AsyncWebServerRequest *request)
   {
      Serial.println("/values");
      String json = Control.GetCurrentInputValues();
      json.concat(GetCurrentVariableValues());
      json.replace("}{", ",");
//      Serial.println(json);
      request->send(200, "application/json", json);
   });

The button states are sent as a JSON string using:

ws.textAll(state);

I had thought to use a mutex to ensure that the transfer of one update was comple before sending the next, but could not see how to tell a ws.textAll(0 or request->send(0 had been completed.

Is there something I am missing here? Is there something I am doing wrong?

Regards, Ron

Looks for me like the problem described in this post on stackoverflow or this one

Thanks for that suggestion, however I have read that answer and although it had the same problem (a watchdog trip), I don’t think it is the same.

In my case I can watch the data transfers both at the ESP32 and in the client code. All transactions have completed and it is several seconds after that the freertos watchdog trips. What I do not understand is what bit of code is hanging so the esp_task_wdt_reset(0 is not executed.

Can you share a smallest possible project to reproduce the error?

It’s over 4000 lines of code, but I’ll try and encapsulate just the page transactions. May take a day or two.

Still creating a small sample, but further testing seems to show that if a new page is selected (and of course starts sending its data) before a previous page has completed its update, this could possibly be the cause of the Freertos watchdog trip. This is a quite possible user action.
It could be prevented by serialising data to the client, but I cannot find any way to tell whether a previous request->send() or ws.textAll() has completed, a necessary requirement for implementing serialisation.
Any suggestions welcome.

Control.GetCurrentInputValues();

How is this implemented? Maybe this function takes to long and triggers the watchdog.
If there is a loop inside this function maybe calling yield() (inside the loop) solves your issue.

I am working my way through the code deleting bits in order to find the minimum that causes this failure. The GetCurrentInputValues function is called from within a freertos task waiting on a queue that is normally written once the sample code that interrogates various sensors in the machine has completed. I have turned the sample code off, so presumably that task only runs for as long as it takes it to sense the queue is still empty.
I should mention that all tasks (there are 4) all have priority 0.

??

But the function is called from within the AsyncWebServer-Callback ?!

   server.on("/values", HTTP_GET, [](AsyncWebServerRequest *request)
   {
      Serial.println("/values");
      String json = Control.GetCurrentInputValues();
      json.concat(GetCurrentVariableValues());
      json.replace("}{", ",");
//      Serial.println(json);
      request->send(200, "application/json", json);
   });

Yes. My mistake. I remembered that about 5 mins after posting my last reply.
The GetCurrentInputValues function consecutively reads 25 entries from SPIFF via ReadFile, and following up on sivar2311’s suggestion, I have interleaved these with some vTaskDelay calls. At first this seemed to be the fix. However, I still have the occasional async_tcp trip.
After much more testing I think I have found the situation. The timed task that runs the sample code every 20 secs eventually sends a queue message to a task waiting on that queue. It wakes up and executes this code:

  jsonString = GetCurrentVariableValues();
  Serial.print(" Sending to web site: ");
  events.send("ping", NULL, millis());
  events.send(jsonString.c_str(), "new_values", millis());  // send to client
  Serial.println("Sent");

The getCurrentVariableValues assembles the json string directly from the sensor readings. After the code above runs another task writes the values to the SDCard (but that is currently commented out).
It is important the user sees these updated readings.
Normally this works fine. The trouble seems to occur if this event occurs during the time the user is stepping to another page which is also being updated from the aforementioned AsyncWebServer callback. I suppose I could hack this by not letting the above code run until a second or two after the update, but this situation cannot be that rare?
By the way, how do you format the code?

It sounds to me that, in the worst case, the file is being accessed both read and write by several tasks!

Without knowing the code in more detail and not knowing the dependencies of the processes, it is very hard to answer your question.

Edit: I would keep the current samples in memory. These can then be read directly by the AsyncWebServer callback and elsewhere. This avoids having to re-read the file.

The samples can then be written to a file from time to time.

Please see How to post logs and code in PlatformIO Community Forum

Thanks Folks for your suggestions. However, async_tcp still fails.
I have been chopping bits out of the various files to get to a minimum where it still fails. The SDCard is no longer required along with all the classes that managed the machine and data gathering. It is now a classic Arduino with everything in the one file and no longer with lots of white space. I will include the five web pages and associated javascript files as it is while moving between these pages the async_tcp module triggers the watchdog.
It can take quite a few minutes of browsing between the pages to get it to fail, but nothing drastic - just cycling between pages at a reasonable speed. It typically fails a few seconds after I stop browsing the pages but not always. This is fatal for my project as the program must run for many hours monitoring the machine and recording its status. The time continuum is important.
The code is down to 800 lines plus the few hundred in the html and js, so a hefty lump of code. It is running in Platformio in an Arduino framework. The board is a sparkfun_esp32_iot_redboard but it also runs happily in a MicroMod.
I greatly appreciate this assistance. What should I do? Zip it up and post it somewhere. Please let me know the next steps.
Ron

.

Download link to ZIP is fine :slight_smile:
GitHub repo is perfect :smiling_face_with_three_hearts:

Many thanks for having a look at this. The bones of a Platformio folder containing the sample code is in the debug.zip file under the AsyncWebserver_bug repository on my github “ronkrem”.

Can you share the project itself as a repo, not a zipped project?
My antivirus complains about “basic.js”, maybe a false positive.

Never mind, I was able to download the zip file.

I’m unable to reproduce the error.

The sketch has been running for more than an hour on my ESP32-DevkitC (board=esp32dev) without any problems.

I have switched back and forth between pages, clicked on buttons, closed the browser, opened the pages again, quickly switched between pages…

The only thing I notice is that the pages load very slowly.
I suspect that this could be the reason for your issues.

Unfortunately, the test sketch is not as trivial as expected (absolute minimum example to reproduce the error)

No specific version of the espressif32 platform is specified in the platformio.ini file.
Which platform version do you have installed on your system?