Hey vexr, are you on domain 0 (Nova) on all these operators? I’ve just successfully got to chainhead on domain 1 (auto-id) with sep-03.
Here are some logs if you want to compare. I did spot all the Failed to dial peer during bootstrapping at the start but don’t have time right now to go through them properly.
I am running multiple nodes for different purposes, each running on the same system configuration within its own VM environment. Of those that didn’t reach the chainhead, two were non-domain nodes, two were on domain 0, and two were on domain 1.
Although this issue seems specific to the v2 release in a VM environment, I wanted to share my experience. The log I provided was brief, but I updated almost immediately and dealt with the issue for about four hours before reporting it.
Since my original post, I’ve been running sep-03 continuously on half of the nodes (one on each domain), but the sep-03 builds haven’t been able to keep up, while I’ve had no issues on jul-29.
Currently, the domain blocks are all derived from the consensus block locally, so if the sync of the consensus chain is slow or paused the domain chain will also progress slow or paused, since there are also consensus node having this issue I think the issue is related to the consensus chain syncing.
From the above consensus node log, after the node is started, it progresses slowly for the first 10 minutes (#3131593 → #3131614) and then the node gets stuck at #3131614 and having error:
2024-09-04T06:53:31.728618Z WARN Consensus: sync: 💔 Error importing block 0x6baed6897ecbbff758ce476805fc1c7c17dc08189cf34fb99c6c83df0173d56d: block has an unknown parent
2024-09-04T06:53:31.729262Z WARN Consensus: sync: 💔 Error importing block 0x9e8bcb6d6cdd6ac944f76f7e2a9c264f31555ba6412b95e56bf8e9fedcbeeb19: block has an unknown parent
2024-09-04T06:53:31.729751Z WARN Consensus: sync: 💔 Error importing block 0x2f6d9a419c515cde0b5588b60a6b8daf7a8ac4be243800c5b167f74d5a7425fa: block has an unknown parent
...
I checked from the gemini-3h consensus RPC node, these error logs point to the blocks following #3131614, i.e. #3131616, #3131617, #3131618…, etc, but except #3131615 and then the node is synced to #3131615 (0x38fc…270f) which is in the stale fork because from the RPC node the hash of #3131615 is 0xe5c6509154fa7ab6295421fec2ade76b71d3305439808e7915f18d86bbbac82e, after that the node stuck at #3131615 till the end of the log.
So I guess there is an issue with the consensus chain networking cause the node can’t fetch the correct #3131615 in the canonical fork, @nazar-pc or @shamil plz take a look.
I tried running it again (sep-03) and let it go for six hours. The log shows that while it downloads some blocks (albeit slowly), it never fully catches up. However, after reverting to the jul-29 version, it quickly processed all the backlog almost immediately.
Try to copy the database and run just consensus node with that copy (but with the same sync/pruning options).
We did upgrade Substrate and there could be some upstream or downstream changes affecting this, but the first thing is to identify if domains have anything to do with it.
Also I see you have 128 connections instead of normal 40, removing customization for number of peers might help as well depending on what the root cause of the issue is. There shouldn’t be any reason for you to benefit from it anyway.
I should have clarified earlier: this latest test was conducted on a consensus node only. I’m encountering the same issue with domain nodes as well, but the log provided is from a consensus node.
I am running the test again at chainhead (obtained with jul-29), but this time without any custom peer connections, only the specified sync/pruning options are applied. I’ll provide an update in a few hours with a new log, but so far it seems that syncing is still failing.
I am seeing high CPU usage. You can see the VM running at < 3% CPU until I started testing again with sep-03. The system is pegged with this release while trying to sync.
So far can’t reproduce. Had archival node data with height 2430237 blocks (I believe synced with one of the June releases), then synced from DSN using jul-29, but it then failed to finalize for some reason, so I switched to sep-03 and it finished sync shortly afterwards and has no issues staying in sync.
I used --chain gemini-3h --sync full --blocks-pruning archive-canonical --state-pruning archive-canonical.