Two Generals – A Story Of Nodes And Networks

I ran into an interesting problem the other day that I just couldn’t solve. It turns out that I could not solve it because it is an unsolvable problem.

The idea behind the project is simple enough; I want to create a distributed application that can run across multiple machines and update at a rate that would be acceptable for a real-time video game loop. In order to achieve this I figured that I would split the game space into nodes which can be linked to other nodes, generally by geographical nearness. Each server would host a set of nodes for which it would hold the master copy, and replicated nodes for those on the edge that are held by another machine. Through a priority rule each node can determine for itself if it should be the next one in it’s cluster of connections to execute (I’ll probably write more about this once I get it working). In theory, this can work because each node “owns” it’s own data.

Where the trouble comes in is creating and destroying connections between nodes. The challenge here is that neither node really “owns” the link, and in creating or destroying a link you don’t really want to commit to it unless both nodes are able to commit to it, at the same time. The trouble with messaging is that certain guarantees are impossible. By using a reply and acknowledgement pattern you can know with certainty that your partner did receive a message when you receive an acknowledge back, but you can’t know if they didn’t receive it because their not receiving it and your not receiving an acknowledgement looks the same to you; no acknowledgement. This scenario is generally pretty easy to solve; you send a message, and if you don’t get an acknowledgement within the time you expected to or your partner continues to send you other messages and no acknowledgement then you send it again. On the partner’s side, you need to be able to handle the possibility of getting the same message twice; you can do this by having a message id, message order indicator, or by being able to process messages in a way that is idempotent.

Back to the links between nodes; we don’t just need to know that our partner got the message, we also need our partner to know that we know that they got the message, and that we are now both committed to the action. As I came to discover this problem is more commonly known as the two generals problem and it has been formally proven to be unsolvable. On a network without 100% reliability (all networks) there is no way to ensure that both servers are committed to some change in data, and therefor no way to really have two equal and independent master copies. Given this constraint I have two options. One solution is to use a series of messages to make the possibility of something going wrong increasingly small until there is a tolerable possibility of failure. This is basically how GUIDs work – it is not that it is not possible that they would ever collide; it is simply that it is so unlikely that you just don’t worry about the possibility. The other solution is to rework my design to make it so that one of the nodes owns the master copy of the link and just keeps messaging the other node until it confirms the change.

I’d love to conclude with the solution here, but honestly I am still working through the idea. The links are integral to how each node updates, and if it has the wrong set for even a single cycle it could put the process into an invalid state.

Leave a Reply

Your email address will not be published. Required fields are marked *