Node sync issues on sep-03

I have multiple nodes (6 of them) that are unable to reach the block tip on sep-03 that have no issue on jul-29.

Two are consensus nodes and four are domain nodes.

Environment

  • Operating System: Ubuntu 24.04 (VM) Dual Xeon 2680 v3
  • Space Acres/Advanced CLI/Docker: subspace-node-ubuntu-x86_64-v2-gemini-3h-2024-sep-03

Problem

It does sync a few blocks, but is not able to keep up with the network and just falls behind. No issues when reverting to previous release (jul-29).

I do not have this issue running on subspace-node-ubuntu-x86_64-skylake-gemini-3h-2024-sep-03 with a non-domain node.

Parameters:
–sync full
–blocks-pruning archive-canonical
–state-pruning archive-canonical

Log

Hey vexr, are you on domain 0 (Nova) on all these operators? I’ve just successfully got to chainhead on domain 1 (auto-id) with sep-03.

Here are some logs if you want to compare. I did spot all the Failed to dial peer during bootstrapping at the start but don’t have time right now to go through them properly.

I am running multiple nodes for different purposes, each running on the same system configuration within its own VM environment. Of those that didn’t reach the chainhead, two were non-domain nodes, two were on domain 0, and two were on domain 1.

Although this issue seems specific to the v2 release in a VM environment, I wanted to share my experience. The log I provided was brief, but I updated almost immediately and dealt with the issue for about four hours before reporting it.

Since my original post, I’ve been running sep-03 continuously on half of the nodes (one on each domain), but the sep-03 builds haven’t been able to keep up, while I’ve had no issues on jul-29.

Interesting. I am running the same parameters but am on Skylake bare metal. Thanks for reporting!

@ning any ideas what could be going on here?

Currently, the domain blocks are all derived from the consensus block locally, so if the sync of the consensus chain is slow or paused the domain chain will also progress slow or paused, since there are also consensus node having this issue I think the issue is related to the consensus chain syncing.

From the above consensus node log, after the node is started, it progresses slowly for the first 10 minutes (#3131593#3131614) and then the node gets stuck at #3131614 and having error:

2024-09-04T06:53:31.728618Z  WARN Consensus: sync: 💔 Error importing block 0x6baed6897ecbbff758ce476805fc1c7c17dc08189cf34fb99c6c83df0173d56d: block has an unknown parent    
2024-09-04T06:53:31.729262Z  WARN Consensus: sync: 💔 Error importing block 0x9e8bcb6d6cdd6ac944f76f7e2a9c264f31555ba6412b95e56bf8e9fedcbeeb19: block has an unknown parent    
2024-09-04T06:53:31.729751Z  WARN Consensus: sync: 💔 Error importing block 0x2f6d9a419c515cde0b5588b60a6b8daf7a8ac4be243800c5b167f74d5a7425fa: block has an unknown parent 
...

I checked from the gemini-3h consensus RPC node, these error logs point to the blocks following #3131614, i.e. #3131616, #3131617, #3131618…, etc, but except #3131615 and then the node is synced to #3131615 (0x38fc…270f) which is in the stale fork because from the RPC node the hash of #3131615 is 0xe5c6509154fa7ab6295421fec2ade76b71d3305439808e7915f18d86bbbac82e, after that the node stuck at #3131615 till the end of the log.

So I guess there is an issue with the consensus chain networking cause the node can’t fetch the correct #3131615 in the canonical fork, @nazar-pc or @shamil plz take a look.

This looks suspiciously similar to Blocks downloaded from peers undergoing reorg are treated as extending canonical chain and fail to import · Issue #2094 · paritytech/polkadot-sdk · GitHub that is a long-standing known issue in Substrate. It should be possible to step over it once archiving processes this block because sync from DSN is not affected by this issue, which is why we have closed Synced onto a fork · Issue #1744 · autonomys/subspace · GitHub eventually.

I tried running it again (sep-03) and let it go for six hours. The log shows that while it downloads some blocks (albeit slowly), it never fully catches up. However, after reverting to the jul-29 version, it quickly processed all the backlog almost immediately.

Any thoughts on how to resolve?

Log

Try to copy the database and run just consensus node with that copy (but with the same sync/pruning options).

We did upgrade Substrate and there could be some upstream or downstream changes affecting this, but the first thing is to identify if domains have anything to do with it.

Also I see you have 128 connections instead of normal 40, removing customization for number of peers might help as well depending on what the root cause of the issue is. There shouldn’t be any reason for you to benefit from it anyway.

I should have clarified earlier: this latest test was conducted on a consensus node only. I’m encountering the same issue with domain nodes as well, but the log provided is from a consensus node.

I am running the test again at chainhead (obtained with jul-29), but this time without any custom peer connections, only the specified sync/pruning options are applied. I’ll provide an update in a few hours with a new log, but so far it seems that syncing is still failing.

Do you see high CPU usage in the process or not?

I am seeing high CPU usage. You can see the VM running at < 3% CPU until I started testing again with sep-03. The system is pegged with this release while trying to sync.


1 Like

Interesting, I’ll check it on my end then, large CPU usage indicates it is probably doing something it shouldn’t.

Here is the latest log. It starts with being able to sync on jul-29 and the restart with sep-03 at timestamp 2024-09-06T06:55:34.248339Z.

It was never able to fully sync with just over two hours of runtime.

Log

So far can’t reproduce. Had archival node data with height 2430237 blocks (I believe synced with one of the June releases), then synced from DSN using jul-29, but it then failed to finalize for some reason, so I switched to sep-03 and it finished sync shortly afterwards and has no issues staying in sync.

I used --chain gemini-3h --sync full --blocks-pruning archive-canonical --state-pruning archive-canonical.