[Cluster] Plotters do not resume after Farmer update / restart

Issue Report

As requested, creating a report for the issue of plotters not restarting if farmer is updated.

Environment

Linux (Unraid), clustered on docker (taurus-2024-oct-24)
Main host: Node / Farmer / Cache / Controller / NATS
2x Plotters: dual Plotters

Problem

Plotters do not resume after Farmer update / restart.
In sample screen shots, plotters were given 10min to start plotting, sectors were requested but never fulfilled.
Once plotter was restarted, sectors requested again and plotting begins immediately.

plotter #1 not restarted… no response

2024-10-26T11:45:48.100597Z  INFO {public_key=0c6a9718ae802380f45097b6f9b0bd85d32ed7405de7cd685bd74477b486c857 sector_index=466}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2024-10-26T11:45:48.219310Z  INFO {public_key=0c6a9718ae802380f45097b6f9b0bd85d32ed7405de7cd685bd74477b486c857 sector_index=470}: subspace_farmer::cluster::plotter: Plot sector request
2024-10-26T11:46:13.798454Z  INFO {public_key=0c6a9718ae802380f45097b6f9b0bd85d32ed7405de7cd685bd74477b486c857 sector_index=469}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2024-10-26T11:46:28.129135Z  INFO {public_key=0c6a9718ae802380f45097b6f9b0bd85d32ed7405de7cd685bd74477b486c857 sector_index=470}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2024-10-26T11:57:04.226656Z  INFO {public_key=d26957076b4db7b7c1ad8d9454c6906180483a835107c7e899736bb72262d318 sector_index=473}: subspace_farmer::cluster::plotter: Plot sector request
2024-10-26T11:57:04.226722Z  INFO {public_key=0c6a9718ae802380f45097b6f9b0bd85d32ed7405de7cd685bd74477b486c857 sector_index=473}: subspace_farmer::cluster::plotter: Plot sector request

Plotter #2 restarted (you can see no reply here also)

2024-10-26T11:46:38.190383Z  INFO {public_key=0c6a9718ae802380f45097b6f9b0bd85d32ed7405de7cd685bd74477b486c857 sector_index=472}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2024-10-26T11:57:04.226801Z  INFO {public_key=20775bd896ed375be39dc8831c539c4d705d22146381e7979d6ddc1963b99b34 sector_index=473}: subspace_farmer::cluster::plotter: Plot sector request
2024-10-26T11:57:04.226913Z  INFO {public_key=20775bd896ed375be39dc8831c539c4d705d22146381e7979d6ddc1963b99b34 sector_index=474}: subspace_farmer::cluster::plotter: Plot sector request
2024-10-26T11:57:04.227779Z  INFO {public_key=0c6a9718ae802380f45097b6f9b0bd85d32ed7405de7cd685bd74477b486c857 sector_index=474}: subspace_farmer::cluster::plotter: Plot sector request
2024-10-26T12:06:11.193065Z  INFO subspace_farmer::utils: Received SIGTERM, shutting down farmer...
2024-10-26T12:06:11.943182Z  INFO async_nats: event: connected
2024-10-26T12:06:12.424105Z  INFO subspace_farmer::commands::cluster::plotter: Using CUDA GPUs used_cuda_devices=[0]
2024-10-26T12:06:12.424405Z  INFO subspace_farmer::commands::cluster::plotter: CPU plotting was explicitly disabled
2024-10-26T12:06:19.540031Z  INFO {public_key=20775bd896ed375be39dc8831c539c4d705d22146381e7979d6ddc1963b99b34 sector_index=475}: subspace_farmer::cluster::plotter: Plot sector request
2024-10-26T12:06:19.541598Z  INFO {public_key=20775bd896ed375be39dc8831c539c4d705d22146381e7979d6ddc1963b99b34 sector_index=476}: subspace_farmer::cluster::plotter: Plot sector request
2024-10-26T12:06:47.811904Z  INFO {public_key=20775bd896ed375be39dc8831c539c4d705d22146381e7979d6ddc1963b99b34 sector_index=475}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2024-10-26T12:06:49.087901Z  INFO {public_key=20775bd896ed375be39dc8831c539c4d705d22146381e7979d6ddc1963b99b34 sector_index=477}: subspace_farmer::cluster::plotter: Plot sector request
2024-10-26T12:07:03.300915Z  INFO {public_key=20775bd896ed375be39dc8831c539c4d705d22146381e7979d6ddc1963b99b34 sector_index=476}: subspace_farmer::cluster::plotter: Finished plotting sector successfully

farmer/controller logs since updated config

2024-10-26T11:56:48.792720Z  INFO subspace_farmer::commands::cluster::controller: Connecting to node RPC url=ws://subspace-node:9944
2024-10-26T11:56:48.792742Z  INFO async_nats: event: connected
2024-10-26T11:56:48.808004Z  INFO subspace_farmer::node_client::caching_proxy_node_client: Downloading all segment headers from node...
2024-10-26T11:56:48.808739Z  INFO subspace_farmer::node_client::caching_proxy_node_client: Downloaded all segment headers from node successfully
2024-10-26T11:56:48.810566Z  INFO subspace_networking::constructor: DSN instance configured. allow_non_global_addresses_in_dht=false peer_id=12D3KooWFu78siF2M7fyAJ71z5EbgdDmosSSo7fPLZegheTXpF16 protocol_version=/subspace/2/295aeafca762a304d92ee1505548695091f6082d3f0aa4d092ac3cd6397a6c5e
2024-10-26T11:56:48.811254Z  INFO libp2p_swarm: local_peer_id=12D3KooWFu78siF2M7fyAJ71z5EbgdDmosSSo7fPLZegheTXpF16
2024-10-26T11:56:49.399351Z  INFO subspace_farmer::node_client::caching_proxy_node_client: Downloading all segment headers from node...
2024-10-26T11:56:49.400678Z  INFO subspace_farmer::commands::cluster::controller::caches: New cache discovered, scheduling reinitialization cache_id=01JB3SXM9C633PWJDMGVZ8BQQJ
2024-10-26T11:56:49.400905Z  INFO subspace_farmer::node_client::caching_proxy_node_client: Downloaded all segment headers from node successfully
2024-10-26T11:56:49.401331Z  INFO subspace_farmer::commands::shared::network: DSN listening on /ip4/127.0.0.1/tcp/30533/p2p/12D3KooWFu78siF2M7fyAJ71z5EbgdDmosSSo7fPLZegheTXpF16
2024-10-26T11:56:49.402775Z  INFO subspace_farmer::commands::shared::network: DSN listening on /ip4/10.32.54.110/tcp/30533/p2p/12D3KooWFu78siF2M7fyAJ71z5EbgdDmosSSo7fPLZegheTXpF16
2024-10-26T11:56:50.021029Z  INFO {farm_index=0}: subspace_farmer::single_disk_farm: Farm size has changed old_space=465.7 GiB new_space=698.5 GiB
2024-10-26T11:56:50.021927Z  INFO {farm_index=1}: subspace_farmer::single_disk_farm: Farm size has changed old_space=465.7 GiB new_space=698.5 GiB
2024-10-26T11:56:50.022415Z  INFO {farm_index=2}: subspace_farmer::single_disk_farm: Farm size has changed old_space=465.7 GiB new_space=698.5 GiB
2024-10-26T11:56:50.696546Z  INFO {farm_index=1}: subspace_farmer::single_disk_farm::plot_cache: Checking plot cache contents, this can take a while
2024-10-26T11:56:50.700391Z  INFO {farm_index=1}: subspace_farmer::single_disk_farm::plot_cache: Finished checking plot cache contents
2024-10-26T11:56:50.728271Z  INFO {farm_index=0}: subspace_farmer::single_disk_farm::plot_cache: Checking plot cache contents, this can take a while
2024-10-26T11:56:50.731612Z  INFO {farm_index=0}: subspace_farmer::single_disk_farm::plot_cache: Finished checking plot cache contents
2024-10-26T11:56:50.813092Z  INFO {farm_index=2}: subspace_farmer::single_disk_farm::plot_cache: Checking plot cache contents, this can take a while
2024-10-26T11:56:50.816340Z  INFO {farm_index=2}: subspace_farmer::single_disk_farm::plot_cache: Finished checking plot cache contents
2024-10-26T11:56:50.879974Z  INFO {farm_index=2}: subspace_farmer::single_disk_farm: Benchmarking faster proving method
2024-10-26T11:56:55.510409Z  INFO {farm_index=2}: subspace_farmer::single_disk_farm: Faster proving method found fastest_mode=ConcurrentChunks
2024-10-26T11:56:55.626297Z  INFO {farm_index=0}: subspace_farmer::single_disk_farm: Benchmarking faster proving method
2024-10-26T11:56:59.838875Z  INFO {farm_index=0}: subspace_farmer::single_disk_farm: Faster proving method found fastest_mode=ConcurrentChunks
2024-10-26T11:56:59.967373Z  INFO {farm_index=1}: subspace_farmer::single_disk_farm: Benchmarking faster proving method
2024-10-26T11:57:04.078003Z  INFO {farm_index=1}: subspace_farmer::single_disk_farm: Faster proving method found fastest_mode=ConcurrentChunks
2024-10-26T11:57:04.218742Z  INFO {farm_index=1}: subspace_farmer::commands::cluster::farmer: Farm 1:
2024-10-26T11:57:04.218770Z  INFO {farm_index=1}: subspace_farmer::commands::cluster::farmer:   ID: 01JB306H622EBMJC9T552W2N8F
2024-10-26T11:57:04.218800Z  INFO {farm_index=1}: subspace_farmer::commands::cluster::farmer:   Genesis hash: 0x295aeafca762a304d92ee1505548695091f6082d3f0aa4d092ac3cd6397a6c5e
2024-10-26T11:57:04.218804Z  INFO {farm_index=1}: subspace_farmer::commands::cluster::farmer:   Public key: 0x20775bd896ed375be39dc8831c539c4d705d22146381e7979d6ddc1963b99b34
2024-10-26T11:57:04.218821Z  INFO {farm_index=1}: subspace_farmer::commands::cluster::farmer:   Allocated space: 698.5 GiB (750.0 GB)
2024-10-26T11:57:04.218828Z  INFO {farm_index=1}: subspace_farmer::commands::cluster::farmer:   Directory: /farmer/plot-ds10b
2024-10-26T11:57:04.219221Z  INFO {farm_index=2}: subspace_farmer::commands::cluster::farmer: Farm 2:
2024-10-26T11:57:04.219230Z  INFO {farm_index=2}: subspace_farmer::commands::cluster::farmer:   ID: 01JB306H6241RBX7YYCGVWYCC8
2024-10-26T11:57:04.219234Z  INFO {farm_index=2}: subspace_farmer::commands::cluster::farmer:   Genesis hash: 0x295aeafca762a304d92ee1505548695091f6082d3f0aa4d092ac3cd6397a6c5e
2024-10-26T11:57:04.219237Z  INFO {farm_index=2}: subspace_farmer::commands::cluster::farmer:   Public key: 0xd26957076b4db7b7c1ad8d9454c6906180483a835107c7e899736bb72262d318
2024-10-26T11:57:04.219243Z  INFO {farm_index=2}: subspace_farmer::commands::cluster::farmer:   Allocated space: 698.5 GiB (750.0 GB)
2024-10-26T11:57:04.219246Z  INFO {farm_index=2}: subspace_farmer::commands::cluster::farmer:   Directory: /farmer/plot-ds10c
2024-10-26T11:57:04.220058Z  INFO {farm_index=0}: subspace_farmer::commands::cluster::farmer: Farm 0:
2024-10-26T11:57:04.220071Z  INFO {farm_index=0}: subspace_farmer::commands::cluster::farmer:   ID: 01JB306H62VDCB6QTGR7H7SQ1M
2024-10-26T11:57:04.220075Z  INFO {farm_index=0}: subspace_farmer::commands::cluster::farmer:   Genesis hash: 0x295aeafca762a304d92ee1505548695091f6082d3f0aa4d092ac3cd6397a6c5e
2024-10-26T11:57:04.220078Z  INFO {farm_index=0}: subspace_farmer::commands::cluster::farmer:   Public key: 0x0c6a9718ae802380f45097b6f9b0bd85d32ed7405de7cd685bd74477b486c857
2024-10-26T11:57:04.220084Z  INFO {farm_index=0}: subspace_farmer::commands::cluster::farmer:   Allocated space: 698.5 GiB (750.0 GB)
2024-10-26T11:57:04.220088Z  INFO {farm_index=0}: subspace_farmer::commands::cluster::farmer:   Directory: /farmer/plot-ds10a
2024-10-26T11:57:04.220530Z  INFO subspace_metrics: Metrics server started. endpoints=[0.0.0.0:9081]
2024-10-26T11:57:04.220546Z  INFO actix_server::builder: starting 2 workers
2024-10-26T11:57:04.221246Z  INFO {farm_index=0}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2024-10-26T11:57:04.221294Z  INFO {farm_index=0}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2024-10-26T11:57:04.221399Z  INFO {farm_index=0}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2024-10-26T11:57:04.221447Z  INFO {farm_index=1}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2024-10-26T11:57:04.221458Z  INFO {farm_index=1}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2024-10-26T11:57:04.221464Z  INFO {farm_index=1}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2024-10-26T11:57:04.221492Z  INFO {farm_index=2}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2024-10-26T11:57:04.221505Z  INFO {farm_index=2}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2024-10-26T11:57:04.221514Z  INFO {farm_index=2}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2024-10-26T11:57:04.221874Z  INFO subspace_farmer::commands::cluster::controller::farms: Discovered new farm, initializing farm_index=0 farm_id=01JB306H62VDCB6QTGR7H7SQ1M
2024-10-26T11:57:04.221999Z  INFO subspace_farmer::commands::cluster::controller::farms: Discovered new farm, initializing farm_index=1 farm_id=01JB306H622EBMJC9T552W2N8F
2024-10-26T11:57:04.222009Z  INFO subspace_farmer::commands::cluster::controller::farms: Discovered new farm, initializing farm_index=2 farm_id=01JB306H6241RBX7YYCGVWYCC8
2024-10-26T11:57:04.222988Z  INFO actix_server::server: Tokio runtime found; starting in existing Tokio runtime
2024-10-26T11:57:04.223027Z  INFO actix_server::server: starting service: "actix-web-service-0.0.0.0:9081", workers: 2, listening on: 0.0.0.0:9081
2024-10-26T11:57:04.223357Z  INFO {farm_index=1}:{sector_index=473}: subspace_farmer::single_disk_farm::plotting: Plotting sector (66.71% complete)
2024-10-26T11:57:04.223414Z  INFO {farm_index=2}:{sector_index=473}: subspace_farmer::single_disk_farm::plotting: Plotting sector (66.71% complete)
2024-10-26T11:57:04.223418Z  INFO {farm_index=0}:{sector_index=473}: subspace_farmer::single_disk_farm::plotting: Plotting sector (66.71% complete)
2024-10-26T11:57:04.224275Z  INFO {farm_index=1}:{sector_index=474}: subspace_farmer::single_disk_farm::plotting: Plotting sector (66.85% complete)
2024-10-26T11:57:04.224541Z  INFO {farm_index=2}:{sector_index=474}: subspace_farmer::single_disk_farm::plotting: Plotting sector (66.85% complete)
2024-10-26T11:57:04.224970Z  INFO {farm_index=0}:{sector_index=474}: subspace_farmer::single_disk_farm::plotting: Plotting sector (66.85% complete)
2024-10-26T11:57:04.658596Z  INFO {farm_index=2}:{sector_index=475}: subspace_farmer::single_disk_farm::plotting: Plotting sector (67.00% complete)
2024-10-26T11:57:04.788087Z  INFO subspace_farmer::commands::cluster::controller::farms: Farm initialized successfully farm_index=0 farm_id=01JB306H62VDCB6QTGR7H7SQ1M
2024-10-26T11:57:05.345742Z  INFO subspace_farmer::commands::cluster::controller::farms: Farm initialized successfully farm_index=1 farm_id=01JB306H622EBMJC9T552W2N8F
2024-10-26T11:57:05.895736Z  INFO subspace_farmer::commands::cluster::controller::farms: Farm initialized successfully farm_index=2 farm_id=01JB306H6241RBX7YYCGVWYCC8
2024-10-26T11:57:19.402758Z  INFO subspace_farmer::farmer_cache: Initializing piece cache
2024-10-26T11:57:22.566626Z  INFO {farm_index=2}: subspace_farmer::single_disk_farm::reward_signing: Successfully signed reward hash 0x495f5e72b1245e4abcf0c05a6d6b092d6569658b59b24cb90aa42033060a05b8
2024-10-26T11:57:50.785663Z  INFO subspace_farmer::farmer_cache: Synchronizing piece cache
2024-10-26T11:57:50.789872Z  INFO subspace_farmer::farmer_cache: Finished piece cache synchronization
2024-10-26T12:04:55.148663Z  INFO {farm_index=2}: subspace_farmer::single_disk_farm::reward_signing: Successfully signed reward hash 0xc5818b6060a32b9e70aa977675851cd7886ebcfd152c0dd0697b39dd898982ac
2024-10-26T12:06:19.537867Z  INFO {farm_index=1}:{sector_index=475}: subspace_farmer::single_disk_farm::plotting: Plotting sector (67.00% complete)
2024-10-26T12:06:19.539623Z  INFO {farm_index=1}:{sector_index=476}: subspace_farmer::single_disk_farm::plotting: Plotting sector (67.14% complete)
2024-10-26T12:06:49.085285Z  INFO {farm_index=1}:{sector_index=477}: subspace_farmer::single_disk_farm::plotting: Plotting sector (67.28% complete)
2024-10-26T12:07:04.227474Z  WARN {farm_index=1}:{sector_index=474}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-10-26T12:07:04.227520Z  WARN {farm_index=1}:{sector_index=473}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-10-26T12:07:04.371745Z  INFO {farm_index=1}:{sector_index=478}: subspace_farmer::single_disk_farm::plotting: Plotting sector (67.42% complete)
2024-10-26T12:07:14.367265Z  WARN subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01JB4AZT3SGJHGPWPR1G812M6D expected_index=377 request_type=subspace_farmer::cluster::controller::ClusterControllerPiecesRequest response_type=(subspace_core_primitives::pieces::PieceIndex, subspace_core_primitives::pieces::Piece)
2024-10-26T12:07:14.625731Z  WARN subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01JB4AZT3QKFS888AX208D93ZX expected_index=405 request_type=subspace_farmer::cluster::controller::ClusterControllerPiecesRequest response_type=(subspace_core_primitives::pieces::PieceIndex, subspace_core_primitives::pieces::Piece)
2024-10-26T12:07:19.364278Z  INFO {farm_index=1}:{sector_index=479}: subspace_farmer::single_disk_farm::plotting: Plotting sector (67.56% complete)
2024-10-26T12:07:35.192621Z  INFO {farm_index=1}:{sector_index=480}: subspace_farmer::single_disk_farm::plotting: Plotting sector (67.70% complete)
2024-10-26T12:07:51.443375Z  INFO {farm_index=1}:{sector_index=481}: subspace_farmer::single_disk_farm::plotting: Plotting sector (67.84% complete)

One possible reason I can think of is I have Controller on same container as farmer.
So when farmer goes down, so does controller.

I found a post from Nazar which mentions ideal start order being:
controller > plotter > cache > farmer
I’ll try splitting controller out again and see if same occurs

Ok that fixed the issue.
Problem was caused by controller restarting at same time as farmer (due to shared container).
Separating the 2 allowed the plotters to begin immediately.

Previous architecture:

  • node
  • nats
  • cache
  • farmer+controller
  • plotter(s)

New architecture:

  • node
  • nats
  • controller+cache
  • farmer
  • plotter(s)

I would note that I did find plotters eventually came online under previous architecture, but took as long as 15min.
Now under the new architecture, they come online immediately.

1 Like