The ethereum 2.0 testnet network crashed this Friday and is unable to reach finality with it stuck currently.
A time related bug brought down Prysm which quite astonishingly is being used by the vast majority of validators despite there being five clients.
It seems Prysm was the only client to provide a nice tutorial on how to onboard, so everyone has ignored our repeated suggestion to use different clients because of how slashing works.
Everyone on Prysm got slashed here because there was some sort of error with a time synchronizer going four hours into the future, giving an error:
“WARN roughtime: Roughtime reports your clock is off by more than 2 seconds offset=4h0m0.028854657s.”
Apparently “the nodes connect to a NTP server to sync their time and they returned wrong values. Currently they use 6 NTP servers to mitigate this, but it seems this was not enough as they all returned wrong timings.”
According to the diagnosis report: “The cloudflare roughtime servers all returned wrong information, and Prysm nodes did not properly fallback from this situation.”
Raul Jordan, an eth 2.0 dev for Prysmatic explained further that the current participation rate is apparently not correct because:
“Almost no one is synced to the chain head, so unless you have a node that is synced to head, we can’t get participation reliably. Not even sure if there’s > 0% participation.”
Nishant Das, another eth 2.0 dev for Prysm, explained that some Prysm nodes are to the tip but too many people are trying to sync at the same time so nodes trying to onboard are getting errors as pictured in the featured image. Explaining further, Jordan says:
“Time is critical for eth2. Without synchronized time, then network cannot function properly. You can rely on system time, which will invariably drift away. We use Cloudflare’s roughtime as a way to adjust your local clock if it is off.
However, roughtime was off by 4 hours yesterday, which led to chaos. The solution was to not forcibly adjust people’s time based on roughtime but instead log errors telling them their time is off.”
So this little bug brought the whole thing down, with the last slot (block) bearing yesterday’s date:
Other clients are fine with one solution being to switch to another client, but the bug has now been fixed. However Jordan says:
“In fixing this bug, we accidentally removed all critical features for Prysm nodes to function, making the issue infinitely worse.”
That’s the joy of testnet. Fun and games here but reminds us a bit of Peter Szilagyi and other eth devs hacking at DDos code in the middle of the night just hours before Devcon was to open in Shanghai in 2016.
That was on mainnet. Here, thank god this thing happened because it’s a testnet with fake eth, however everything else is in the configuration of the mainnet.
There’s some 30,000 validators on it and about one million eth who have seen first hand why they should run away from a most used client because the ethereum 2.0 incentives are designed to use small or obscure, but secure, clients, operating systems, every and anything that makes part of the set up really.
Because if there’s some bug say on Windows OS that affects eth 2.0 somehow, then all validators running Windows go down and get slashed, while those on Linux and other OS-es are not affected.
That’s a lesson that needs to be drilled but a different sort of bug has been discovered here in that people follow tutorials and therefore tutorials need to be diversified as well, not just clients.
Another discovery is that the network has just stopped. Ethereum has never stopped. Whether during that 2016 event mentioned above or the DAO hack or the fork, whatever happened, blocks kept coming. While here it has stopped.
Going by memory problems start once around 30% of validators fall, with such problems getting bigger after about 50% and then it clearly just stops after 70%.
Some sort of complex process kicks in at that point that rebalances with the aim of restarting the network, all of which we’ll probably get familiar with in the coming days.
But for now the trick seems to be either move to another client or just wait a bit instead of syncing straight away because all at once seems to have a DDoS effect.