There are currently four or five major forks on the ethereum 2.0 testnet (pictured) following a bug that brought down Prysm nodes and the whole network.
“We need to get all the clients to agree on the current head. It’s also causing sync havoc with peers joining claiming different views of the chain. There are range of edge cases and fixes we will be deploying thorughout the week to address all these,” Age Manning of Lighthouse says.
“It jumps up and down all over the place… its like it can’t decide,” someone running the prysm client says with Nishant Das, a prysm dev, stating: “Yeah a lot of re-orgs.”
“There are many different forks happening right now and some nodes are stuck very far behind, so you get all these parent block requests to try and resolve it, but the major one is currently shown on eth2stats which has consensus between lighthouse and prysm,” says Raul Jordan, another prysm dev.
Due to all these forks, ram requirements skyrocketed around midnight today London time:
“The most effective database compression techniques happen after finality,” Paul Hauner of Lighthouse says. “We’ve also seen some issues with the database that prevent pruning, but I’m not sure that’s playing a part here.”
The situation now appears to have improved significantly since about 12 hours ago with more reaching the chain-tip head.
Node runners are asked to just let it run if they can, instead of restarting as that just loses all the sync up to that point. Also:
“I’m using –block-batch-limit=512 & –p2p-max-peers=200 seem to be chugging along,” says a node runner.
The max peer parameter isn’t part of the recommendation by devs, but the block batch limit is, with Das stating:
“So when you get stuck, it’s cycling through your peers to try to get unstuck, using larger batch sizes will cycle through those peers faster.”
A number of individuals state they getting some error about request parent block. They’re asked to just ignore it as the node is going through all the forks with Preston Van Loon publishing a tree of the network.
Apparently you can get this tree by going to
localhost:8080/tree, which kind of allows you to see how the chain is running along.
As you might expect, that shows initially one chain running nicely, and then we have two, and they have their own chain, all of which then eventually drop off with one chain running again.
New nodes apparently just need to go through the syncing, and they need to become aware of these forks, and then they drop off those forks with the node then jumping to the valid tip.
Apparently this needs to get to a participation rate of 66%, with those that have dropped off getting slashed until then:
This etherean was nicely making quite a bit of money by effectively doing nothing after turning on the node set up.
Moreover, even after the slashing he is still in profit, but you can see the way up was a lot slower than the way down. He was earning by the day, he’s now losing by the hour.
That means he has very big incentives to get back to sync as the sooner he and others can reach the chain tip, the quicker he gets back to earning rather than losing money.
The devs are doing their best to help him on the way there, trying different tricks and fixes while also assisting anyone that needs help on their discord.
A few proudly announced they reached the tip, with what was a domino effect on the way down now potentially being a domino effect on the way up as the more nodes are at the tip, the more nodes can sync to it.
“Medalla is far from dead, it can be fixed,” Loon says and it has to be fixed because this could potentially happen live as well.
Bitcoin for example has had chain-split forks twice or more on mainnet after some bug during an upgrade caused miners to be on different versions.
The bitcoin network keeps running during that time however, leading to social media announcements telling people to wait for 100 confirmations or more.
While here if 34% are knocked off, then the network stops running until they behave.
This is a developing story, so how they get to behave exactly is yet to be seen because no new slots/blocks have been found as the chain is currently not finalizing.
Making all this a drill of what could potentially happen live as bugs happen despite the greatest care with lessons being learned like having some method to quickly export keys to other clients. A prysm dev says:
“The whole point of having multiple clients is that you can switch to an alternative in the event that something is not working properly in your main client.
When we had the roughtime issues yesterday, it could have been a good idea to switch to another client to avoid liveness penalty.”
Roughtime so being a claimed decentralized way to sync time which turned out to not be very decentralized as six different time sources for some reason made prysm nodes jump ahead four hours, giving an error and thus the network crash. Das says:
“The arriving block coming in also has a slot number to it through which we determine if its valid or not. Basically genesis_time + slot_Num * slot_time.
If a node thinks it is 4 hours into the future, it rejects that block since it appears it is coming from the past.
This also messes prysm’s validators block proposal since now the local clock according to the validator is 4 hours ahead.”
The fix for that was somewhat simple in just not defaulting to roughtime, so leading to the interesting developments now unfolding on testnet which is with fake eth so nothing is being lost and much is being gained.
In addition, if the network does come back to life through the coded mechanisms kicking in, then this shouldn’t delay the live launch as everything would have happened as it is meant to happen with the bug itself being very very tiny presuming there are no code related complications with the mechanisms that kicked in.