During the development of the Spider project, there was an issue with how connections were managed. In that project, two nodes are capable of connecting to each other using different connection providers. For example, they could connect through plain TCP, as well as through the Routing Chord. Using multiple channels like this requires extra management to ensure that the messages’ order is maintained, and the best channel available is used at any given moment. Also, multiple channels muddy the concept of what it means to be connected or disconnected.
Link Set is a crate that solves these problems. It manages a list of connections (links), that are assumed to be between it and a single other process running a corresponding Link Set. When sending a message, the Link Set chooses one of its links and sends the message across it. It tracks the sequence number of the message and will resend the message if it detects it is not received. Normally TCP will take care of message order and failed delivery, but the combined connection loses this property even if using two links built on top of TCP. A message could be sent down one TCP Link, which then immediately closes, dropping the message. The other TCP Link has no knowledge of this message and therefore cannot retry it.
The Link Set can be in one of six different states. This keeps the behavior for
each mode separate, and easy to write. The general lifecycle is to start at
Disconnected. Then attempt to connect by moving to Connecting. Once
connected, the Link Set and its peer negotiate an epoch in the EpochMismatch
state. On successful negotiation the Link Set moves to Connected, where it
exchanges messages. If all Links are lost the Link Set moves to either
Reconnecting to actively restore the link, or GracePeriod to passively allow
its peer to reconnect. Finally, if all else fails and auto connect is disabled,
the Link Set moves back to Disconnected.
The first state is Disconnected. This is the starting state, and the simplest
state. It waits until it is told to connect, then transitions to Connecting.
If a new Link is received, it transitions to EpochMismatch. Any other state
can transition to Disconnected if the Link Set is given the command to
disconnect.
Next is Connecting. The Link Set contains a list of connectors, and potential
peer addresses. When it attempts to establish a connection, it feeds each
address to each connector to see if the connector yields a Link. This state
continues until it times out, upon which it transitions back to Disconnected.
If this state receives a new Link, either from the connector or by happenstance,
it transitions to EpochMismatch.
The EpochMismatch state is one of the most complex states. It is also the
first state to have an active Link to its peer. The Link Set considers itself to
be connected for any duration where it has at least one Link. It considers
itself disconnected only when it has lost all Links. The connected duration that
spans multiple links is an ‘epoch’.
However, this presents an issue where if the peer dies and is able to establish
a new Link before the old Link times out, one side would believe its in the
middle of an active session and the other doesn’t know whats going on. Therefore
each epoch is numbered, and each message carries this number. In this state,
each side presents what it thinks the next epoch should be to its peer. The
larger of the two numbers is chosen. When both peers present equal epochs to
each other, they are in agreement and transition to Connected. The Link Set
emits a connected message to its caller to inform it that it can send messages.
If instead the link dies during this negotiation, the Link Set transitions to
Disconnected, or if auto connect is enabled to Connecting.
Once in the Connected state the Link Set and its peer exchange messages with
each other. Here is where the resend and message ordering logic is applied. In
addition to an epoch number, each message also has a sequence number. Messages
are kept until a sequence number ahead of them is acknowledged as received.
Every so often, the first message is resent to ensure delivery. Incoming
messages are also stored, and only emitted to the application once all previous
sequence numbers have been emitted.
The Link Set maintains multiple Links, and every so often does a partial sort to push lower latency Links towards the beginning, and higher latency ones towards the end. When sending a message, it chooses a Link from the beginning of the list and sends the message through it. This way when a Link is dropped, it can automatically choose the next best one from the list.
If a message arrives with an unexpected epoch the Link Set concludes that its
peer must have reset, and it transitions back to EpochMismatch to renegotiate
the epoch. Since this means that the epoch has ended, The Link Set emits a
disconnect message locally so its caller is informed.
If all Links are lost the Link Set transitions to either Disconnected,
Reconnecting or, GracePeriod depending on its configuration settings.
In the Reconnecting state, the current epoch from the last Connected state
is maintained. This is to allow brief transient network disruptions to recover
without having to bother the application. The Link Set attempts to create a new
Link by using the same list of connectors and peer addresses it used during the
Connecting phase. If a new Link is established before Reconnecting times
out, the Link Set transitions directly back to Connecting without
renegotiating the epoch.
If instead the time limit elapses without establishing a new Link, the Link Set
transitions to Disconnected or if auto reconnect is enabled, back to
Connecting. The distinction between Reconnecting and Connecting is whether
the epoch is maintained. Leaving this state means the epoch reset, and therefore
a disconnected message is returned to the caller.
The final state GracePeriod exists as a counterpart to Reconnecting. If one
side entered Reconnecting but the other side didn’t, the epoch would be reset
by the other side. However, sometimes one side might not have the ability to
reestablish a Link. The GracePeriod allows the epoch to last a little longer
without any attempt to reconnect. If there is an incoming connection within its
timeframe it is able to transition back to Connected, otherwise it follows the
same behavior as when Reconnecting’s timer expires.
One of the goals at the start of this project was to avoid reimplementing too much of the TCP protocol. TCP is already written and tested by decades of use. However, as mentioned before, the existence of multiple channels hamstrings the TCP protocol’s efforts to provide message ordering and resending. Therefore some of the future work on this project is likely to look more and more like a multichannel version of TCP.
Another benefit to implementing more of TCP’s internals is greater control over the aggregate behavior of the Link Set. For example, TCP measures lost packets to determine the transmission rate and how much backpressure to apply. This would be directly useful to how the Link Set implementation chooses which Link to transmit across. If there are multiple Links, some of which are implemented through different overlay networks, that are all operating across the same physical network link, no performance can be gained by splitting the load across them. However, if there are multiple Links that utilize different physical links, some performance may be gained. Taking inspiration from TCP’s implementation could be a good guide for the design here.
As more of TCP’s features are added, there will be more and more benefit to switching the default Link implementation from TCP to UDP to avoid a TCP over TCP scenario. A switch to UDP also has the benefit of making NAT hole punching easier.