Node out of sync

Issue Report

Environment

  • Operating System: Ubuntu 22.04.3 LTS
  • Advanced CLI

Problem

When the following error appears in the node logs, it will no longer stay in sync

`2024-06-09T14:54:11.405124Z ERROR sc_utils::mpsc: The number of unprocessed messages in channel `subspace_acknowledgement` exceeded 100.
The channel was created at:
   0: sc_utils::mpsc::tracing_unbounded
   1: sc_consensus_subspace::archiver::send_archived_segment_notification::{{closure}}.124193
   2: subspace_service::new_full::{{closure}}::{{closure}}.124192
   3: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
   4: <tracing_futures::Instrumented<T> as core::future::future::Future>::poll
   5: tokio::runtime::task::raw::poll
   6: std::sys_common::backtrace::__rust_begin_short_backtrace
   7: core::ops::function::FnOnce::call_once{{vtable.shim}}
   8: std::sys::pal::unix::thread::Thread::new::thread_start
   9: <unknown>
  10: <unknown>


                                         Last message was sent from:
   0: <sc_consensus_subspace_rpc::SubspaceRpc<Block,Client,SO,AS> as sc_consensus_subspace_rpc::SubspaceRpcApiServer>::acknowledge_archived_segment_header::{{closure}}
   1: jsonrpsee_core::server::rpc_module::RpcModule<Context>::register_async_method::{{closure}}::{{closure}}
   2: tokio::runtime::task::core::Core<T,S>::poll
   3: tokio::runtime::task::raw::poll
   4: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
   5: tokio::runtime::scheduler::multi_thread::worker::run
   6: tokio::runtime::task::raw::poll
   7: std::sys_common::backtrace::__rust_begin_short_backtrace
   8: core::ops::function::FnOnce::call_once{{vtable.shim}}
   9: std::sys::pal::unix::thread::Thread::new::thread_start
  10: <unknown>
  11: <unknown>`

How many farms do you have connected to the node? Any other details you can provide about the setup?

There are 80 farmers connected to this node. Is it because there are too many farmers connected? How many farmers should a single node handle?

Hypothetically there are no hard limits and above message should be non-blocking, I will take a closer look at what might be the root cause of it. You may also want to consider newly added farming custer as a more scalable option for such use case, especially with upcoming release that contains a bunch of improvements.

BTW you mentioned 80 farmers, but how many farms is it?

Sorry, I meant there are 80 devices connected to this node, each with 8 farmer directories, making it a total of 640 farmers

I see, that is quite a large setup. Can you provide all node logs since it started rather than just a single error, please? There might be additional information in there as well.

Currently, the node logs have been cleared, so I can’t provide them at the moment. This error has been occurring daily recently. When it happens again, I will provide the complete logs.

I think this might happen if one of the farmers crashed and disconnected after new segment of archival history was created. Upcoming release will increase performance and may help with this, while farming cluster is even less likely to end up in this situation due to its architecture.

Add segment acknowledgement timeout by nazar-pc · Pull Request #2835 · subspace/subspace · GitHub will prevent node from getting stuck completely at least, though the recommendation is still to switch to farming cluster and add some redundancy with more nodes.

My two nodes have encountered this error again. The complete log file is as follows

node-01
node-02

Right, new segment was just archived, so it happens every time. BTW I see you have change number of peers. Not sure why, it should not be necessary and can even be harmful in certain cases.

Excuse me, what does “new segment was just archived” mean, and how can I avoid this error? The frequent issue of nodes being out of sync

I already gave you a few options how to avoid above (more nodes, farming cluster), next release will also improve situation for such cases. Farmer was supposed to run with a node on the same machine. 80 farmers connected to the same node was never the intended setup even though I know many do it.

Okay, I plan to replace all programs with the farmer cluster soon.