The ethereum 2.0 testnet consensus failure appears to be fatal with the network still not up and running after three days following a bug that brought down nodes.
That bug had the effect of instantly crashing all prysm nodes, which were the vast majority, but also it increased the resource requirements of other node clients with Lighthouse nodes for example needing 70GB storage and some 14GB memory at some point.
So causing the chain to break into 4 forks or more, with the only solution now apparently being a hardfork as in explaining fork choice, Vitalik Buterin previously said:
“One important thing to note is that the fork choice is not a pure function; that is, what you accept as a canonical chain does not depend just on what data you also have, but also when you received it.
The main reason this is done is to enforce finality: if you accept a block as finalized, then you will never revert it, even if you later see a conflicting block as finalized.
Such a situation would only happen in cases where there is an active >1/3 attack on the chain; in such cases, we expect extra-protocol measures to be required to get all clients back on the same chain.”
The explanation itself does not state what these extra-protocol measures are, but in the only subtle comment Buterin made on the events, he links to a 2016 article that basically says if there’s a 1/3rd attack, then you just hardfork.
In this case there’s isn’t an attack as such, or at least not that we know of with it probably being just Peter Todd doing some free testing, but the code obviously can’t tell a difference between a ‘malicious’ attack and all these nodes suddenly falling due to an ‘innocent’ bug.
That is, it can’t tell a difference between an attack and an accident, therefore it treats both the same.
Certain complex mechanisms have kicked in now with one part of the protocol, the safety FFG, preventing block finalization, while the other part, the Greedy Heaviest Observed Subtree (GHOST), keeps counting votes.
“Since GHOST is live, but not safe, it may change its mind about the head of the chain – this is because new blocks are continually added to the chain, which means nodes keep learning new information,” Carl Beekhuizen, an eth 2 dev, said previously.
In short, GHOST kind of has no clue about what’s going on, with FFG knowing as much and thus saying the network can’t keep moving.
Some devs suggest once all clients sync they can see all that’s going on and GHOST stops changing it’s mind, but it looks like even if clients sync to the tip, they still fall behind now and then. Raul Jordan, an eth 2.0 dev for prysm, says:
“We’re looking today on why nodes are lagging behind once they are synced to chain head. We are gathering as much data as possible…
We have several unanswered questions here that we need information about:
1. Why do nodes sync to head just fine but then end up lagging behind eventually? (It always happens, it’s just a matter of time).
2. Does this lag correlate with other events, such as resource consumption, the peers you’re connected to, etc.
3. Look into propagation of objects and fork resolution.
4. Look into the status of parent block requests and how quickly we are able to resolve them. Gonna be looking at charts a lot today.”
There is no confirmation from devs that this has to hard fork with that being our suggestion based on the available information and previous statements with this seemingly being a bit like if a bitcoin upgrade goes wrong and we now have two chains as in 2013 or 2015 with both appearing to be valid chains initially until one of them ‘wins.’
You could of course let this process run its course and wait for weeks or more during which time it is unsafe to use the network, or you could have some social coordination to pick one fork/client and discard the other one.
As bugs happen, and as consensus failure has happened and can happen in a live environment, sprucing up that process now should make a live run a lot smoother.
So there’s no rush as this is a testnet and getting all the information about whether everything did and is running as it should is important, and in addition even though this is a testnet they do have to get it right because we’re sure Todd et al is watching.
With the question here being just which fork do you choose and how, and also how do you have this hardfork client, what does it change?
Those are for others to answer, including perhaps Vitalik Buterin himself who is still an eth 2 dev, as well as for other eth 2 protocol or design devs.
In addition days can be tolerated in testnet, but in a live environment even an hour or two is too much once we get past phase 1.5, as until then it is still kind of a testnet but with real eth in as far as there are no transactions or value exchanges, and therefore no full history or a full ledger as such.
So at that point presumably there will have to be a sort of backup hardfork client that goes live as soon as it becomes clear there has been a consensus failure.
Meaning altho the optics here aren’t great, in a way it does not seem to be much different than what would happen in bitcoin, depending of course on how devs now resolve this situation on testnet.
That also means conceptually there shouldn’t be any delay to the live launch as all seems to have worked as it should as far as is known so far, with the bug that kicked it all being itself very tiny and insta solved days ago, while the rest just seems to be the protocol doing its job.