ESP32, Ethernet.h (W5500): connected() not correct

I got another issue with the Ethernet.h library.

I got two ESP32 DevkitC V4 connected to the same network by two identical W5500 modules. Both are using the Ethernet library gracefully provided by @maxgerhardt.

While working okay most of the time, things get wrong when one side uses stop() to close the socket. After that, the partner’s connected() call still is returning true. I know that by law connected() still will report true. if there is still data waiting. I made sure that this is not the case, but still the call returns true.

If I do a stop() myself, everything is aligned, but the intention was to check if the connection is stiill open to use it again.

I had a look at the EthernetClient::stop() code and saw it first will send a FIN and wait for the connection to be closed on the opposite side. This FIN signal seems to get lost and finally the socket is closed hard.

Is this behaviour known/intended?

This is for TCP. Or are you using UDP?

Can you see with 100% certainty what TCP packets are being sent with the FIN flag, e.g. with a wireshark / tcpdump on the router?

In the function

what value does s have? The W5500 chip itself keeps track of connection states and the library just queries it. If it has received the FIN, it must have SnSR::CLOSED here as status.

Sorry, this is going to be lengthy now :man_shrugging:

Tried first to print out the socketStatus in EthernetClient::connected(). I had a lot of application stuff in the Serial output - I edited the log to only show the relevant. Lines starting >>> are remarks I added:

rst:0x1 (POWERON_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0018,len:4
load:0x3fff001c,len:1044
load:0x40078000,len:8896
load:0x40080400,len:5828
entry 0x400806ac
__ OK __
Resetting Wiz W5500 Ethernet Board...  Done.
My IP address: 192.168.178.73

>>> Boot sequence finished. I pressed a button to fire off the first two requests. Note. No connection established yet
TCP request bridge, 20, 0x03, 0/10
TCP request bridge, 4, 0x03, 1/10 

Target: 192.168.178.77/6502       
Connect returns 1

>>> Connection was established, the following line stems from EthernetClient::connected() - ESTABLISHED
socketStatus=23

>>> Here followed a print-out of the returned data

Target: 192.168.178.77/6502

>>> Next request, same target. Note the missing "Connect returns...", since the app knows the previous had the same target
>>> connected() is called twice here to check the result of a potential re-connect: ESTABLISHED still
socketStatus=23
socketStatus=23

>>> Again data output removed by me

>> Next two requests sent - #3/#4:
TCP request bridge, 20, 0x03, 0/10
TCP request bridge, 4, 0x03, 1/10

>>> Target is still the same
Target: 192.168.178.77/6502

>>> A different socketStatus now: CLOSE_WAIT
socketStatus=28

>>> As reaction, another connection is opened, resulting in socketStatus==ESTABLISHED 
Connect returns 1
socketStatus=23

>>> Data output removed

>>> Request #4, connection is found open (ESTABLISHED)
Target: 192.168.178.77/6502
socketStatus=23
socketStatus=23

>>Data output removed

>>> Two more requests - #5 and #6
TCP request bridge, 20, 0x03, 0/10
TCP request bridge, 4, 0x03, 1/10

>>> Request #5. Target remains the same, but socketStatus is CLOSE_WAIT again, so I open another connection
Target: 192.168.178.77/6502
socketStatus=28
Connect returns 1
socketStatus=23

>>> Data output removed

>>> Request #6, socketStatus=ESTABLISHED
Target: 192.168.178.77/6502
socketStatus=23
socketStatus=23

>>> Data output removed

>>> Next request pair #7/#8:
TCP request bridge, 20, 0x03, 0/10
TCP request bridge, 4, 0x03, 1/10

>>> Request #7: socketStatus CLOSE_WAIT
Target: 192.168.178.77/6502
socketStatus=28

>>> Re-connect is failing, as all following are.
Connect returns 0
   Error: IP connection failed Token: 7
Target: 192.168.178.77/6502
Connect returns 0
   Error: IP connection failed Token: 8
TCP request bridge, 20, 0x03, 0/10
TCP request bridge, 4, 0x03, 1/10
Target: 192.168.178.77/6502
Connect returns 0
   Error: IP connection failed Token: 9
Target: 192.168.178.77/6502
Connect returns 0
   Error: IP connection failed Token: A

The addressed target ESP32 has two connections blocked for internal reasons, so obviously the communication comes to an end as soon as the remaining 6 sockets of the W5500 are consumed.

Next I tried it with a WiFiClient instead of the EthernetClient. That works on and on and on.

So it really seems to be caused by the EthernetClient not reacting correctly on remote socket closes.

I was not able to do a tcpdump unfortunately. I only have a locked FritzBox owned by my Internet provider… :sob:

Since the issue still lasts, I am digging the Internet for similar cases and found this: url(What does client.stop() do and how does if(client) work from the server side? - Programming Questions - Arduino Forum)[]. While it seems to match my issue, the explanation is somewhat less satisfying. This should not be a matter of user code but handled in the lib IMHO.

You made that sure by explicitly emptying the RX buffer with a while(client.available()) client.read(); loop?

Have you tried the recommendation in the thread about not doing a pure .stop() but sending something before the stop?

    client.print('A');
    client.stop();

Also calling .flush() might be a good idea.

  • Yes, I cleared the RX buffer exactly like you suggested to no avail.
  • Yes again, it indeed does work with the redundant write - but see my remarks below.
  • Yes for a third time, that is why I have a client.flush() in the code :wink:

Although the client.write('A'); client.stop(); pattern seems to work*(see foot note added) for now, it is a clumsy workaround for a problem lying below. Why does the stop require a previous write? There is no logic behind that, but it rather looks like the stop action is not correct nor complete.

It may be this is already buggy in the W5500 firmware, but why should the user of Ethernet.h care at all?

The library code in EthernetClient.cpp is not that complicated (you will know of course):

void EthernetClient::stop()
{
	if (sockindex >= MAX_SOCK_NUM) return;

	// attempt to close the connection gracefully (send a FIN to other side)
	Ethernet.socketDisconnect(sockindex);
	unsigned long start = millis();

	// wait up to a second for the connection to close
	do {
		if (Ethernet.socketStatus(sockindex) == SnSR::CLOSED) {
			sockindex = MAX_SOCK_NUM;
			return; // exit the loop
		}
		delay(1);
	} while (millis() - start < _timeout);

	// if it hasn't closed, close it forcefully
	Ethernet.socketClose(sockindex);
	sockindex = MAX_SOCK_NUM;
}

plus in socket.cpp:

// Gracefully disconnect a TCP connection.
//
void EthernetClass::socketDisconnect(uint8_t s)
{
	SPI.beginTransaction(SPI_ETHERNET_SETTINGS);
	W5100.execCmdSn(s, Sock_DISCON);
	SPI.endTransaction();
}

One thing I noticed is the “wait up to a second for the connection to close” bit - it may occur in my code to try again connecting in shorter intervals. What will happen then? It does work, though, with the client.write('A'); in.

On WizNET I found a post from WizNET’s wizard Eugeny stating:

but you must ensure that after server completes communicating from its side it performs DISCON and then (without waiting) performs CLOSE, and then OPEN, and LISTEN. Faster you do this process, less time you will have with socket being offline.

This is different from the implementation in EthernetClient.cpp as it suggests to not wait between Sock_DISCON and Sock_CLOSE.

I am lacking the skills to decide if that really applies here.

  • Addendum: the joy was short… Over time, no connection can be made anymore. Even restarting the client does not change this. Connections from a Linux box to the same server do work without problems.

Argh. I am really getting p*ssed. The additional ‘A’ before the stop() is caught with the next connect() sometimes, crashing my data protocol. So this is definitely no solution for me.

It looks like the sockets are closed unreliably, sometimes getting re-used although being told to close and finally cleaned up by some janitor routine after a long time. I can observe sockets seemingly being available again after 2 minutes or so, but not constantly. I had to run down three stairs several times to reset the server :angry:.

In parallel, the Linux box can open connections as it likes - no hangup, no stuck connections.

I’ve scanned through the original library repo and there is an issue which may match yours: Reuse of EthernetClient instances don't work when check connected() before stop() · Issue #7 · arduino-libraries/Ethernet · GitHub

There’s also a possible fix description inside.

I think it would also be a good idea to open an issue there in reference to this thread, the arduino forum thread and the other issue if it matches.

Thanks for the pointer. The issue described there does not really match mine, as I would like to keep the connection open if it is established. Pseudo-code-wise it should be like

get target host/port
if same as before AND connected
  use connection
else
  connect to it
  use connection just made
endif

The server I am connecting to is closing the connection after all data has been sent, so I would expect to always get into the else path. But I will land in the if path mostly, as connected() returns true despite the server has closed the connection.

With the odd client.write('A') before client.stop() I indeed get into the else path, but with disastrous effects to my protocol. Second, this seems to be wasting sockets as well (albeit not 1:1), so I get stuck with no more sockets available on the server at some time anyway.

I will open an issue for arduino-libraries as you are suggesting.

I tried to understand in detail what will happen when the server calls stop().

First a DISCON command is issued on the W5500. That is supposed to send a FIN to the client and shall wait for the answering FIN in return to then do a CLOSE.

This call to DISCON should be blocking, as execCmdSn() will loop until the command register is cleared. The block in fact seems to be for the send of the FIN only, as we else could expect the socket status to be CLOSED in all cases - which is not.

If we assume that the client W5500 does receive the FIN, some magic in the W5500 needs to work to answer it with the responding FIN. I did not find any documentation on this,neither in the Ethernet code nor in the Wiznet data sheet.

Upon receipt of the answering FIN the server shall set the socket to CLOSED, while the client supposedly will do so after sending the answering FIN.

In reality l am getting ESTABLISHED and CLOSE_WAIT only as status values in connected(), so it looks like something is not happening that is supposed to happen.

I see two potential causes for this:

  • the FIN is never sent by the server or
  • the FIN is not recognized or not properly handled on the client side.

Is the sending of the FIN subject to timeout and retry on the W5500? If so, that could explain the long time the socket gets blocked on the server, until it finally is closed hard.

Closing a TCP socket is definitely a 3-step protocol per Transmission Control Protocol - Wikipedia with some quirks (timeout, half-open connection). The entire TCP stack should be handled according to its specification in hardware by the W5500 chip. Though there might still be cases where its influenced by the software and the termination doesn’t go through completely?

You might want to use a spare router or make your microcontroller talk to your PC as the server (some simple python script for acting as a server, or a tool like hercules in combination with Wireshark) to observe the TCP flow. Though this is getting quite deep into debugging low-level TCP stuff now :sweat_smile:.

1 Like

That is what I did :grin:
I got a EdgeRouterX that is able to do tcpdump and hooked it between the server and the client. I collected a sequence of 3 requests and responses - working okay, since all devices were restarted.
The first is directed to another Server in the same net, requests #2 and #3 are sent to the W5500 server.

Wireshark could read the pcap file and produced this output:


(The pcap file is not uploadable here)

I can see pairs of FIN packets being sent back and forth, with the exception of the last, which has only the one sent from the server.

The data parts are matching those I sent and received, so I can confirm these are the three requests I sent.

I am lacking knowledge of all the other info shown, though…

Any suggestions of what to capture else?

I got a second pcap file now that contains the complete sequence of initial successful requests, more and more lines with a “RST” in it (marked red by Wireshark) and finally the grinding to a complete stop where only TCP retransmissions are to be seen any more.

The file is 46 kB, but unfortunately not to be linked here. So I put it on Google Drive in case you want to have a look at it: Google Drive: bridge_2.pcap

I suppose there are some timing issues visible, where a previous connection is not yet finished, but the partner is already starting a new or something similar.

I think I am onto something now.

I had a short wait period after the last byte read from the socket to make sure I will not miss anything trailing. This was measured in micros() and set to 20000 initially, aka 20 milliseconds. I now tried reducing that more and more and the hangups got less. I then figured I will be reading off the RX buffer of the W5500 anyway, so I will not have to take care for stray bytes, as the W5500 will guarantee the complete packet was assembled there.

This is a leap ahead; what I now have left is a repeating pattern of the server losing track of the FIN:

As you see here, the connection to remote socket 50433 is closed by the server (FIN in line 207) and is ACK’d by the client in line 209.
The client then immediately starts the next connection 50434 in line 210, but the server was missing something and sends a RST in line 211 for the past connection 50433.

Do you have an idea what that would be?

But at least I got no complete hangups any more (and will meekly close the issue at the Ethernet library).

The way I read the packet flow is that packet nr. 207 is the FIN, ACK from the server to the client to close the connection between port 502 (server side) and 50433 (client side), this gets ACKed by 208. Then approx 1 second later (209) the client suddenly wants to close the connection on port 50433 again, and after that opens a new connection request from port 50434. The server reacts on the close-request on 50433, which was already closed by the server beforehand, with a RST, ACK.

As the wiki says (previous link)

If the host actively closes a connection, while still having unread incoming data available, the host sends the signal RST (losing any received data) instead of FIN.

So this might be what’s going on.

It seems to be the client code calls .stop() / disconnect on the old socket right before open a new socket, and the server didn’t like getting a FIN,ACK on a socket it has already closed server side.

Though in theory it shouldn’t be problem, since TCP allows these “half-open” connections (closed from server->client but not other way round). Hm.

Ah, so that also means, if the server has data for the client, but already sent the FIN,ACK to it, then the client will receive the RST, ACK upon its closing request because there might be indeed still data waiting. If that is the dummy data there’s no problem after all.

Or it a problem which only occurs when talking between two W5500 chips with their specific TCP implementations. Like, a dilema in which the W5500 needs to send the FIN,ACK to close the socket on its side, but the W5500 in the server role sends a RST, ACK when the clients wants to close a connection that was already closed server-side. Testing the firmware and module against a PC-side implemented test server might be worth looking into.

The Wiznet docs are stating that a DISCON from the server is starting the closing dance by sending a FIN to the client. The client has to react with another FIN, after which the connection is considered closed.

And there is in fact no stop() at all in the client code that could trigger the spurious FIN.

I still believe the two FINs are to be seen as the closing process going on.

I now dropped the wait after the read loop in the client completely and added another before the next connection, if the server remains the same. By trial and error I found that a 100ms gap between requests/connections is required to keep everything in order. No more hangups, lost or corrupted packets between these two W5500 any more.

I will have an eagle’s eye on that for quite a while, but I think the knot is broken now.

Thank you (once again…) for your helpful tips!