Link Set

Repository RustNetworkingState Machine

During the development of the Spider project, there was an issue with how connections were managed. In that project, two nodes are capable of connecting to each other using different connection providers. For example, they could connect through plain TCP, as well as through the Routing Chord. Using multiple channels like this requires extra management to ensure that the messages’ order is maintained, and the best channel available is used at any given moment. Also, multiple channels muddy the concept of what it means to be connected or disconnected.

Link Set is a crate that solves these problems. It manages a list of connections (links), that are assumed to be between it and a single other process running a corresponding Link Set. When sending a message, the Link Set chooses one of its links and sends the message across it. It tracks the sequence number of the message and will resend the message if it detects it is not received. Normally TCP will take care of message order and failed delivery, but the combined connection loses this property even if using two links built on top of TCP. A message could be sent down one TCP Link, which then immediately closes, dropping the message. The other TCP Link has no knowledge of this message and therefore cannot retry it.

The Link Set can be in one of six different states. This keeps the behavior for each mode separate, and easy to write. The general lifecycle is to start at Disconnected. Then attempt to connect by moving to Connecting. Once connected, the Link Set and its peer negotiate an epoch in the EpochMismatch state. On successful negotiation the Link Set moves to Connected, where it exchanges messages. If all Links are lost the Link Set moves to either Reconnecting to actively restore the link, or GracePeriod to passively allow its peer to reconnect. Finally, if all else fails and auto connect is disabled, the Link Set moves back to Disconnected.

The first state is Disconnected. This is the starting state, and the simplest state. It waits until it is told to connect, then transitions to Connecting. If a new Link is received, it transitions to EpochMismatch. Any other state can transition to Disconnected if the Link Set is given the command to disconnect.

Next is Connecting. The Link Set contains a list of connectors, and potential peer addresses. When it attempts to establish a connection, it feeds each address to each connector to see if the connector yields a Link. This state continues until it times out, upon which it transitions back to Disconnected. If this state receives a new Link, either from the connector or by happenstance, it transitions to EpochMismatch.

The EpochMismatch state is one of the most complex states. It is also the first state to have an active Link to its peer. The Link Set considers itself to be connected for any duration where it has at least one Link. It considers itself disconnected only when it has lost all Links. The connected duration that spans multiple links is an ‘epoch’.

However, this presents an issue where if the peer dies and is able to establish a new Link before the old Link times out, one side would believe its in the middle of an active session and the other doesn’t know whats going on. Therefore each epoch is numbered, and each message carries this number. In this state, each side presents what it thinks the next epoch should be to its peer. The larger of the two numbers is chosen. When both peers present equal epochs to each other, they are in agreement and transition to Connected. The Link Set emits a connected message to its caller to inform it that it can send messages.

If instead the link dies during this negotiation, the Link Set transitions to Disconnected, or if auto connect is enabled to Connecting.

Once in the Connected state the Link Set and its peer exchange messages with each other. Here is where the resend and message ordering logic is applied. In addition to an epoch number, each message also has a sequence number. Messages are kept until a sequence number ahead of them is acknowledged as received. Every so often, the first message is resent to ensure delivery. Incoming messages are also stored, and only emitted to the application once all previous sequence numbers have been emitted.

The Link Set maintains multiple Links, and every so often does a partial sort to push lower latency Links towards the beginning, and higher latency ones towards the end. When sending a message, it chooses a Link from the beginning of the list and sends the message through it. This way when a Link is dropped, it can automatically choose the next best one from the list.

If a message arrives with an unexpected epoch the Link Set concludes that its peer must have reset, and it transitions back to EpochMismatch to renegotiate the epoch. Since this means that the epoch has ended, The Link Set emits a disconnect message locally so its caller is informed.

If all Links are lost the Link Set transitions to either Disconnected, Reconnecting or, GracePeriod depending on its configuration settings.

In the Reconnecting state, the current epoch from the last Connected state is maintained. This is to allow brief transient network disruptions to recover without having to bother the application. The Link Set attempts to create a new Link by using the same list of connectors and peer addresses it used during the Connecting phase. If a new Link is established before Reconnecting times out, the Link Set transitions directly back to Connecting without renegotiating the epoch.

If instead the time limit elapses without establishing a new Link, the Link Set transitions to Disconnected or if auto reconnect is enabled, back to Connecting. The distinction between Reconnecting and Connecting is whether the epoch is maintained. Leaving this state means the epoch reset, and therefore a disconnected message is returned to the caller.

The final state GracePeriod exists as a counterpart to Reconnecting. If one side entered Reconnecting but the other side didn’t, the epoch would be reset by the other side. However, sometimes one side might not have the ability to reestablish a Link. The GracePeriod allows the epoch to last a little longer without any attempt to reconnect. If there is an incoming connection within its timeframe it is able to transition back to Connected, otherwise it follows the same behavior as when Reconnecting’s timer expires.

Future work

One of the goals at the start of this project was to avoid reimplementing too much of the TCP protocol. TCP is already written and tested by decades of use. However, as mentioned before, the existence of multiple channels hamstrings the TCP protocol’s efforts to provide message ordering and resending. Therefore some of the future work on this project is likely to look more and more like a multichannel version of TCP.

Another benefit to implementing more of TCP’s internals is greater control over the aggregate behavior of the Link Set. For example, TCP measures lost packets to determine the transmission rate and how much backpressure to apply. This would be directly useful to how the Link Set implementation chooses which Link to transmit across. If there are multiple Links, some of which are implemented through different overlay networks, that are all operating across the same physical network link, no performance can be gained by splitting the load across them. However, if there are multiple Links that utilize different physical links, some performance may be gained. Taking inspiration from TCP’s implementation could be a good guide for the design here.

As more of TCP’s features are added, there will be more and more benefit to switching the default Link implementation from TCP to UDP to avoid a TCP over TCP scenario. A switch to UDP also has the benefit of making NAT hole punching easier.