A brief celebration about the ethereum 2.0 testnet finally getting a justified block and then finalized block gave way to nodes crashing.
“I just had a WHOLE bunch of peers disconnect,” said one ethereum 2.0 testnet node runner.
“My beacon node just went bonkers,” said another. “Yeah got a lot of invalid finalized root,” commented a third.
“All teku nodes went down that’s why you lost peers… at the same time upon finality,” said Raul Jordan of Prysmatic.
“I dropped from 850 to 268 peers,” came another comment. “I’m getting way out of sync again. 10 slots behind everyone now,” adds one more.
“Seems like a disagreement on what is the finalized root,” came what for now may be the conclusion.
“Were there THAT many teku nodes as part of the network? I lost about 300-400 peers out of 400-500 (don’t remember exact #s),” asks someone.
“No I don’t think it’s just teku,” Preston Van Loon of prysm says. “For some reason peers are not responding with the correct finalized head.”
Participation has dropped again now to at times 25% or so as nodes kick each other out for thinking everyone else is in the wrong fork except for them.
As it happens it is probable they’re all in the wrong fork because the network has experienced a total failure and therefore it has broken down at the protocol level with “extra-protocol measures” most likely needed now as they explain themselves in their fork-choice documentation.
As such, this insta crash on finalization should have been expected you’d think, but the eth 2 coordinator Danny Ryan appears to have thought otherwise, while Vitalik Buterin is kind enough to just sit and watch despite ostensibly designing this whole thing.
Jordan however says “there was no disagreement on the finalized block between nodes. The peer disconnections were due to a different bug at the time of finality that had to do with inefficiencies in processing at the database layer. There is a fix in progress in prysm and in teku as we speak.”
Apparently a bit of code to migrate 30k slots worth of states to the database failed.
“With the order of logic here, we had updated our finalized checkpoint in memory but not updated it on the database.
When Prysm verified status messages from peers, it chekced that the block was finalized in the database which it was not marked as such because of the above error,” Van Loon says.
So maybe this is just simple bugs instead of ‘how do nodes know what fork,’ but what it is exactly remains to be seen depending on how it is resolved.