Insight into Handshake Timeout

Blockchain is built on a peer-to-peer architecture, where the peers, or nodes, communicate regularly with one another to exchange information about the latest network updates. The communication begins by exchanging signals, a process commonly referred to as “a handshake”. A “handshake timeout” defines a time period in which a node should establish a connection with another node. A handshake timeout issue will occur when the establishment fails within this period.

The handshake timeout issue, which I’ll discuss in this post, theoretically leads to nodes becoming isolated from one another, eventually resulting in the collapse of the entire peer-to-peer network.

I’ll begin with three timeout observations in real-world scenarios. Next, I’ll explain my speculations, show you how to narrow down the scope, and finally find the solution. This post should be of interest to blockchain engineers, especially those building infrastructures. If you struggle with any TCP/IP terminology while reading, this glossary will help.

Handshake Timeout

In Tentacle: the Network Layer of CKB, I explained Tentacle, the network layer of Nevos Network. It’s a layered design that contains the encryption layer secio, the multiplexing layer yamux, and the protocol abstraction atop.

The handshake timeout takes place on the secio layer, and I noticed a service error monitored in one test server. The following service error was reported repeatedly in the log:

service error: HandshakeError(Timeout("deadline has elapsed"))

Upon checking several older versions, nothing similar has occurred. The error originated on the secio layer due to a handshake timeout. This error occurs when the timer expires after the default timeout of 10s, implying that this handshake exceeded 10s, which is unusual.

To identify the cause of the timeout error, these thoughts came to me:

Is the timeout error the result of the delayed transmission due to the multiple layers? As our service is deployed on Kubernetes, and more layers cause more consumption.
Is it related to certain policies of the cloud service provider?
Is packet sniffing on the public network not supported by the protocol?
Are there any bugs in the libraries?

Troubleshooting

I found in the debug logs that the IP in the handshake timeout is the one that the client intended to connect to. And Netcat (a networking utility for reading from and writing to network connections using TCP or UDP) indicated that the network was open. Secio is an encryption protocol based on a Matryoshka-like structure of TCP, so the handshake is a regular ECDH procedure with a message size of only 200-400 bytes in transmission. Given such an environment, the likelihood of a timeout is theoretically negligible.

I also examined the trace logs to obtain more handshake details and I noticed that on both the client and server side, the ECDH process stopped after sending the first proposal until the timeout occurred. This suggests that the error report makes sense.

But why does it occur?

The code is identical on both ends, so there is no incompatibility across multiple implementations. At this time, without being able to reproduce the error or collecting more information, I had to suspend and wait.

Observation 1: Multiple Acknowledgments (ACKs) with Sequence Number = 1

About half a month later, connection timeout was reported again and frequently occurred just like before. This time I used tcpdump, a command-line packet analyzer tool, to make sure if anything was successfully sent from the server.

Here you can see the most relevant data from the server, showing the process of a connection from establishment to disruption. I removed the IPs and the rest is all here:

1   TCP 80  36766 → 50020 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=2685571870 TSecr=0 WS=512
2   TCP 80  50020 → 36766 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=2685571857 TSecr=2685571870 WS=512
3   TCP 72  36766 → 50020 [ACK] Seq=1 Ack=1 Win=29696 Len=0 TSval=2685571872 TSecr=2685571857
4   TCP 249 36766 → 50020 [PSH, ACK] Seq=1 Ack=1 Win=29696 Len=177 TSval=2685571873 TSecr=2685571857
5   TCP 249 [TCP Retransmission] 36766 → 50020 [PSH, ACK] Seq=1 Ack=1 Win=29696 Len=177 TSval=2685572077 TSecr=2685571857
6   TCP 249 [TCP Retransmission] 36766 → 50020 [PSH, ACK] Seq=1 Ack=1 Win=29696 Len=177 TSval=2685572283 TSecr=2685571857
7   TCP 249 [TCP Retransmission] 36766 → 50020 [PSH, ACK] Seq=1 Ack=1 Win=29696 Len=177 TSval=2685572696 TSecr=2685571857
8   TCP 80  [TCP Retransmission] 50020 → 36766 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=2685572860 TSecr=2685572696 WS=512
9   TCP 72  [TCP Dup ACK 3#1] 36766 → 50020 [ACK] Seq=178 Ack=1 Win=29696 Len=0 TSval=2685572874 TSecr=2685571857
10  TCP 249 [TCP Retransmission] 36766 → 50020 [PSH, ACK] Seq=1 Ack=1 Win=29696 Len=177 TSval=2685573522 TSecr=2685571857
11  TCP 80  [TCP Retransmission] 50020 → 36766 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=2685575060 TSecr=2685573522 WS=512
12  TCP 72  [TCP Dup ACK 3#2] 36766 → 50020 [ACK] Seq=178 Ack=1 Win=29696 Len=0 TSval=2685575074 TSecr=2685571857
13  TCP 249 [TCP Retransmission] 36766 → 50020 [PSH, ACK] Seq=1 Ack=1 Win=29696 Len=177 TSval=2685575176 TSecr=2685571857
14  TCP 249 [TCP Retransmission] 36766 → 50020 [PSH, ACK] Seq=1 Ack=1 Win=29696 Len=177 TSval=2685578480 TSecr=2685571857
15  TCP 72  36766 → 50020 [FIN, ACK] Seq=178 Ack=1 Win=29696 Len=0 TSval=2685581873 TSecr=2685571857
16  TCP 60  50020 → 36766 [RST] Seq=1 Win=0 Len=0car

You can find TCP handshake process and transition diagrams here.

So where does the problem lie?

Attempt to explain the ACKs with Seq = 1

Let’s look carefully at Line 8. It is a [SYN, ACK] (SYN refers to “synchronize”) retry process with the sequence number (Seq) of zero.

Line 8 looks rather questionable. As we observed from Line 4 to Line 7, the client has retransmitted 249 bytes of handshake data multiple times. So, why is the server still trying to respond to SYN, as shown in Line 8?

Speculation 1: Server failed to respond

This doesn’t make sense. If the server failed to respond to the [SYN ACK], the client won't consider that the TCP connection had been successfully established.

Speculation 2: Server self-identified failure to respond

Server identified that it had failed to send [SYN, ACK], so it started retrying repeatedly. Meanwhile, the client has received [SYN, ACK], established the connection, and started the handshake. So, how does the TCP protocol know that the server knows that the client has received the [SYN, ACK]? The answer is that the client will reply to that ACK with another ACK. This following ACK is what you see in Line 3, Line 9, and Line 12. This means that the client responds to the ACK’s arrival at the server, but then drops it, therefore the construction of the TCP Stream is not complete!

The analysis above suggests that the timeout problem is fundamentally irrelevant to the encryption layer secio, but rather to the underlying TCP. On the server’s end, if the ACK of [SYN, ACK] fails to be received, the TCP handshake cannot be completed, meaning the handshake remains half-opened. So, for users, it is inoperable. In other words, it is unwritable. The handshake stays in that half-opened queue in the operating system (OS), waiting for further responses.

Now you may ask, what makes the OS drop ACKs after receiving them?

I have two speculations in mind: OS errors and TCP-related parameter setting problems.

Observation 2: TCP Listen Queue Overflowed

I used nstat -s | rg overflowed to examine on the server and spotted a leak:

$ nstat -s | rg overflowed
10138 times the listen queue of a socket overflowed

TCP Listen has a backlog parameter for setting the queue size, to receive stream (large data blocks) in the time window from the half-opened state until the handshake is completed. Normally, the backlog parameter is fixed at 1024. Such a leak is theoretically impossible for most computers today, as long as they are running on correctly-written programs. This is why I found it quite strange but had no idea how to proceed.

Observation 3: TCP Sockets in CLOSE_WAIT State

A further report came online later, stating that a large number of sockets on some machines were hanging in the CLOSE_WAIT state and could not be closed automatically. Meanwhile, connection with these services could not be established and the backlogs overflowed.

This third observation differs from the first two in that, the two previous times we were unable to identify the causes and we had to use rebooting as an expedient solution. But this time we have a more practical solution. Hope this time we can locate the root of the problem.

What is CLOSE_WAIT then?

Some of you may think of FIN_WAIT (where FIN refers to “finish”), another state that may occur during the TCP connection. Be sure not to confuse CLOSE_WAIT with FIN_WAIT_1 or FIN_WAIT_2. They are three completely different states. Let's me clarify:

CLOSE_WAIT is a state that the receiver/server enters after receiving the first FIN. If things are working well, the receiver will respond with ACK + FIN and transfer to LAST_ACK, then properly close.
FIN_WAIT_1 is a state that the sender/client enters after sending FIN and waiting for the ACK.
FIN_WAIT_2 is a state that the sender/client enters after sending FIN and having received ACK, waiting for the receiver/server to respond with FIN.

So what causes a large number of connections stuck in CLOSE_WAIT and cannot be properly closed? The receiver normally closes the process as soon as FIN arrives. The entire TCP protocol is implemented in the kernel and usually not affected by the upper layer applications, unless an abnormality gets in the way and stops the process on the OS.

My goal was to identify such anomalies. Without access to the environment, I could still use screenshots (as the packets no longer worked) to get some additional information from the OS.

Locate the Root of Backlog Overflow

Let’s revisit the second observation, the backlog overflow. This is key. Backlog is the kernel's accept queue for storing the newly generated TCP stream which has completed the handshake and is waiting to be accepted by the user. We assume that the backlog overflow was probably the result of the user’s failure to accept in time.

Possible causes for such failures are as follows:

The application is inside the tokio asynchronous runtime, where a large amount of code may be executing the blocking operation, thus occupying Worker. Consequently, scheduling fails and no longer accepts, causing the backlog to overflow.
The port may be disrupted by the DoS (Denial of Service) due to tasks overload, causing the backlog to accumulate much faster than the acceptance rate.
Possible vulnerability that prevents the events added to wake up accept, leading to the backlog cannot be cleaned up.

I cannot accurately analyze the first two guesses because of lacking access to the production environment. But for the last one, at least one thing I can be sure of: there were bugs like this in previous versions, which had been fixed several months ago. Likewise, could it be that the libraries of this environment are outdated?

After the recheck, as I predicted, the library is outdated. For now, the tentative conclusion is that this bug caused the TCP connection timeout. However, what is the relationship between the CLOSE_WAIT state and the client not accepting Stream in the backlog?

As I noted before, CLOSE_WAIT is the state the receiver/server enters after receiving FIN. If a handshake is successfully established but the stream still remains in the receiver's (server's) backlog, the requestor (client) will forcibly disconnect due to no response from the receiver until timeout, followed by sending FIN. Once the stream in the backlog receives the FIN, it enters the CLOSE_WAIT state and will remain there until a user tries to accept. Then, the stream will be removed from the queue, followed by a “broken pipe” error. At this point, the file descriptors, the identifier for the pipe, get discarded; If no one takes out the stream, the file descriptors would never be deleted. That is what we saw in the first phenomenon where a large number of TCP connections were hanging up in the CLOSE_WAIT state.

Simply put, because the server no longer accepts TCP stream from the backlog, TCP connections in the backlog enter into the CLOSE_WAIT status and the backlog is overflown.

It is not coincident that similar problems have been recorded in other blockchain projects, such as Cardano’s socket leak.

Prerequisite: Understand `waker` Mechanism in Rust’s `Future` Trait

To grasp the rational behind the solutions (which will be shown in the next part), first we need to comprehend how Rust’s Future trait works. The Future trait is the main subject of Rust‘s asynchronous programming. A Future is an asynchronous computation that can produce a value.

I like to use the Notice of Readiness Model (NOR Model) to better comprehend Rust's Future mechanism. The NOR Model originates from logistics terminology, and refers to the message from the ship captain informing the terminal operator that the vessel is ready.

In the context of Rust programming, for any event, once ready, a waker (the handle in the context to wake up tasks) is used to inform the executor to accept the corresponding Future and execute the poll method. The poll attempts to resolve the Future into a final value. If Future returns ready, it suggests that the task is complete and can be cleaned up; If Future returns pending, this indicates that the task is not yet complete and needs to wait for the waker to call the poll method again.

Looks simple, right? But when it comes to the implementation of Future, we need to consider at least these following scenarios:

The Future depends on the underlying i/o events, such as network related i/o events.
The Future needs self-notification, independent of the underlying i/o, such as channels (the media that allows communication to take place between two or more threads).
The Future has its own intermediate state, requiring notifications of the underlying i/o events and of the upper-level read/write events when its state changes.

It is also important to remember that waker is implemented by the specific runtime itself. In a multi-threaded runtime implementation, each poll of the same Future may result in a different waker due to the separation of reads and writes of the same Stream by the upper layer. In other words, the intermediate layer should at least consider separating reads from writes, i.e., storing the waker of reads and writes separately. Why? To eliminate the possibility of not informing the upper layers in case the waker is lost.

Here are some examples to help you understand. A common Stream which has implemented AsyncRead and AsyncWrite, will automatically implement a split method. When splitting into two structures, Readable and Writeable require different wake objects. Normally, both waker are reserved for later use in respective scenarios. For instance, the sub-stream of yamux keeps a separate writable waker to notify the upper layer that the object becomes writable. This waker will be registered only if the write window is zero, at the same time as the returned pending. Another example is the waker proxy mechanism in tokio-tungstenite. Please feel free to read if you’d like to dive deeper. But to keep this post at an ideal length, I‘d leave the details out for now.

It's not just about the read/write separation. Depending on the implementation, even the same operation can produce different waker, including but not limited to context switches caused by thread switching. At this point, the intermediate layer needs to consider that the corresponding waker should be saved for the next wake-up only when pending occurs, meaning that the underlying waker is determined by the context brought by each poll method. In most cases, the underlying layer will not keep the waker which entered for the first time permanently in use. The assumption that the waker will be permanently reserved is completely wrong.

In particular, in the multi-producer, single-consumer (mpsc) channel implementation, when the receiver is blocked resulting in sending failure, the implementer must ensure that the receiver can successively wake up the multiple sender in pending state, once the queue resumes consumption. At this point, the waker in the sender is registered in a linked list based on the waker in the context. Following the "first-in-first-out" rule, the wakeup will be executed using the waker of its sender.

Regarding fairness, it is also relevant to the implementation of asynchronous locks. Tokio's lock is based on asynchronous semaphores. The queuing mechanism in asynchronous locks is similar to that of the mpsc channel. In this case, the requested waker gets stored in a linked list and will be woken up sequentially.

How to Fix

First, let’s take a look at this pull request, regardless of its title and main idea which is apparently irrelevant to the problem we’re discussing. Besides fixing the problem in its issue, the pull request also fixed the problem in which the accept method could not be called due to an anomaly. This suggests that there are some hidden details in this pull request.

The main idea of this fix is that the call to peer_addr that generated an error and returned pending was changed to a loop. This loop ignored the error and returned to the pending state until the underlying layer returned pending.

The reason for this fix has been explained in the last section. Here I‘d like to briefly rephrase as follows:

The upper layer is allowed to return pending only if the underlying layer returns pending.
The upper layer does not necessarily return pending when the underlying layer does.
It’s highly likely that the underlying waker was saved by the poll when the pending occurred (not excluding the possibility that every poll gets registered).

Before the problem was fixed, the Future could not be informed when the underlying event occurred because the pending was returned when it was not supposed to. This led to . Put differently, the Future of accept could never be woken up in such a state, therefore the backlog overflow occurred.

How to Reproduce `peer_addr` Call Error

We have learned that calling peer_addr error can reproduce the problem, but how can we reproduce peer_addr call error?

The mechanism of peer_addr is the same as getpeername, which returns the remote address. In a normal state, no error would be reported, except in the case where the connection received from accept is already interrupted. To reproduce this problem (can be either peer-addr error or timeout), we can follow the steps below:

Client initiates a SYN request and allows Tentacle to temporarily not to accept the Future due to busy network or other reasons.
Once the TCP handshake is complete, the client sends the FIN.
After receiving the FIN, Tentacle will retrieve a Stream with CLOSE_WAIT state from the backlog.
Tentacle returns pending, ignoring any subsequent client access requests.
Multiple clients send SYN requests, wait for a timeout or send FIN to disconnect. This corresponds to what we saw in the first observation, where multiple ACKs occurred.

Now you can see that Stream is stuck in CLOSE_WAIT and not able to be further processed. Sockets hanging in the CLOSE_WAIT state (in the third observation) are thus reproduced. As for the listen queue overflow in the second observation, just repeat these steps until the leak is reported.

The procedure above can be tested locally, as long as we are aware of the problem (be it peer-addr error or handshake timeout), local latency is negligible; however, in a real-world network environment, a single jitter may cause latency. But this is another problem beyond the scope of this post.

Closing Thoughts

Tracking and fixing this handshake timeout issue took months. The problem could not be consistently reproduced until adequate observations were made. The final solution was not the result of proper locating of the problem, but was coincidentally found in the process of fixing other problems.

All in all, I’d like to add the final note that the key to locating root causes is to narrow down the scope, make bold guesses, and reproduce the problem. It's all about being bold when conjecturing while being cautious when verifying.

✍🏻 Witten by: Chao Luo

By the same author:

Tentacle: the Network Layer of CKB

You might also be interested in: