Farmer jun 11 stop farming after replot start sometimes

The cache component crashed immediately after starting the controller and had to be restarted:

Cache

thread 'tokio-runtime-worker' panicked at /home/ubuntu/.rustup/toolchains/nightly-2024-04-22-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/time.rs:417:33:
overflow when adding duration to instant
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Error: Cache service failed: task 267 panicked

farmer1
Update: Performing a scrub on farm index 8 to see if there is an issue that just started with that plot.

2024-06-16T04:06:44.435099Z ERROR subspace_farmer::commands::cluster::farmer: Farm errored and stopped farm_index=8 error=Low-level auditing error: Failed read s-bucket 1736 of sector 305: Input/output error (os error 5)
2024-06-16T04:07:14.436495Z ERROR subspace_farmer::commands::cluster::farmer: Farm errored and stopped farm_index=8 error=Low-level auditing error: Failed read s-bucket 1736 of sector 305: Input/output error (os error 5)
2024-06-16T04:07:44.438381Z ERROR subspace_farmer::commands::cluster::farmer: Farm errored and stopped farm_index=8 error=Low-level auditing error: Failed read s-bucket 1736 of sector 305: Input/output error (os error 5)

farmer6

  • This farmer is only farming about 1/2 of the plots according to network transfer (NVMeoF connected drives).
  • It spammed the same reward signing about 65 times as well.
2024-06-16T03:42:40.508325Z  INFO {farm_index=8}: subspace_farmer::reward_signing: Successfully signed reward hash 0x1947bad91fe726fa34c181402b04ea4be35226aa0dc60d0685e7efe4268a8f68
2024-06-16T03:42:40.508369Z  INFO {farm_index=8}: subspace_farmer::reward_signing: Successfully signed reward hash 0x1947bad91fe726fa34c181402b04ea4be35226aa0dc60d0685e7efe4268a8f68
2024-06-16T03:42:40.508413Z  INFO {farm_index=8}: subspace_farmer::reward_signing: Successfully signed reward hash 0x1947bad91fe726fa34c181402b04ea4be35226aa0dc60d0685e7efe4268a8f68

Snapshot build #351

System Specs

Role: NATS (nats.log)
Dual Socket 2680v3 24 cores (48 threads)
Link: 20Gbit

Role: Cache, Controller, Plotter, Farmer (cache.log controller.log plotter.log farmer1.log)
Dual Socket 7742 128 cores (256 threads)
Cache: 100GiB
Plots: 109T
Link: 100Gbit

Role: Node (node.log)
EPYC 7F72 24 Cores (48 Threads)
Plots: 0
Link: 100Gbit

Role: Farmer (farmer3.log)
Dual Socket 2687v4 24 cores (48 threads)
Plots:189T
Link: 20Gbit

Role: Farmer (farmer4.log)
Dual Socket 2697A 32 Cores (64 Threads)
Plots:189T
Link: 20Gbit

Role: Farmer (farmer6.log)
Dual Socket 6136 24 Cores (48 Threads)
Plots 91TB
Link: 20Gbit

Logs

logs.tar

Did it start fine the next time?

Hm… this was supposed to be fixed in latest release of tokio that we’ve upgraded to, I’ll change it back to a safer option for now then.

This looks like permissions issue to me, I don’t think it has anything to do with that particular test build. This is probably why you see lower NVMeoF bandwidth.

I believe this should fix the issue with overflow and multiple signing notifications: Snapshot build Ā· subspace/subspace@cc85e31 Ā· GitHub

I can replicate it panicking every time if I start the cache before the controller component. If I start the controller first, it does not have any issues.

This particular drive is throwing up errors. Weird timing as it did not have any issues prior, but is corrected by removing it from the farmer. Sorry about that.

That is very odd, but there is another fix in tokio that was merged, but not released yet. I’ll have another look at it once they make a new release.

As expected with latest release none drive failed when replot started…

1 Like

Nice! Will include it in the next release then. I also received good explanation of what happens exactly upstream, so we’ll be able to improve it further in future releases.

Snapshot build #352

  • Reward signing logs are fixed in this release.
  • Timed out without ping from plotter is the most common error.
  • It seems that in the span of about five hours not a single sector was successfully plotted according to metrics data? In that time, there should have been ~ 400-450 sectors plotted.

Update: I see that there were only thee sectors that finished plotting successfully in the provided logs. In 30 minutes since reverting to June-11, I have 32 entries for Finished plotting sector successfully.

Farmer

2024-06-17T03:23:23.445355Z  WARN {farm_index=22}:{sector_index=1433}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-17T03:25:33.141409Z  WARN {farm_index=40}:{sector_index=294}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-17T03:25:33.141569Z  WARN {farm_index=13}:{sector_index=271}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-17T03:25:33.141673Z  WARN {farm_index=50}:{sector_index=61}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-17T03:25:42.700254Z  WARN {farm_index=39}:{sector_index=511}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-17T03:27:52.843554Z  WARN {farm_index=55}:{sector_index=361}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter

System Specs

Role: NATS (nats.log)
Dual Socket 2680v3 24 cores (48 threads)
Link: 20Gbit

Role: Cache, Controller, Plotter, Farmer (cache.log controller.log plotter.log farmer1.log)
Dual Socket 7742 128 cores (256 threads)
Cache: 100GiB
Plots: 109T
Link: 100Gbit

Role: Node (node.log)
EPYC 7F72 24 Cores (48 Threads)
Plots: 0
Link: 100Gbit

Role: Farmer (farmer3.log)
Dual Socket 2687v4 24 cores (48 threads)
Plots:189T
Link: 20Gbit

Role: Farmer (farmer4.log)
Dual Socket 2697A 32 Cores (64 Threads)
Plots:189T
Link: 20Gbit

Role: Farmer (farmer6.log)
Dual Socket 6136 24 Cores (48 Threads)
Plots 91TB
Link: 20Gbit

Logs

logs.tar

That does seem to be a problem, I’ll check, thanks!

There is a regression in these test builds that writes broken sectors (fixed in Fix sector write offset by nazar-pc Ā· Pull Request #2859 Ā· subspace/subspace Ā· GitHub) that scrub can find and fix.

Also checking why plotting times out, happens for me locally as well.

I was able to reproduce some of the issues, but no longer able with these changes: Snapshot build Ā· subspace/subspace@1a06fd1 Ā· GitHub

Please let me know as soon as you see it doesn’t seem to behave properly.

It also fixes plotting of broken sectors, @reb0rn.

1 Like

Snapshot build: #353
Run time: 15 hours

Observations

  • Sector write offset (replotting issue) resolved.
  • Cache component panic if started before Controller issue resolved.
  • Timed out without ping from Plotter issue is still present, but not happening nearly as often.
  • There are still some lingering, but not constant warnings listed under the Plotter below.

WARN: Plotter

2024-06-18T04:24:26.955784Z  WARN subspace_farmer::plotter::cpu: Failed to send error progress update error=send failed because receiver is gone
2024-06-18T04:26:13.712415Z  WARN subspace_farmer::plotter::cpu: Failed to send error progress update error=send failed because receiver is gone
2024-06-18T04:13:45.894942Z  WARN {public_key=22c4285f85f314aa82440faaa1d635297deb16de2d67af0a32c3da0a41c97640 sector_index=1491}: subspace_farmer::cluster::plotter: Response sending ended early
2024-06-18T04:13:46.578953Z  WARN {public_key=2a6a499fcad2e99ccf2dd53762a2a7c55f46e44898c6a1dca10126a7a29eea29 sector_index=578}: subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01J0MR5K0EKFVNFA7AC8E98X8G expected_index=62 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress
2024-06-18T04:13:46.579005Z  WARN {public_key=2a6a499fcad2e99ccf2dd53762a2a7c55f46e44898c6a1dca10126a7a29eea29 sector_index=578}: subspace_farmer::cluster::plotter: Response sending ended early
2024-06-18T04:13:46.579014Z  WARN {public_key=f61f1ea8b31425c7ada584a116e48f993cdf0ab1801eba2b514a4c85ff659351 sector_index=1492}: subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01J0MRDH5C0M3453WJ28A2PQF4 expected_index=37 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress

WARN: Farmer
2024-06-17T03:23:23.445355Z  WARN {farm_index=22}:{sector_index=1433}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-17T03:25:33.141409Z  WARN {farm_index=40}:{sector_index=294}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-17T03:25:33.141569Z  WARN {farm_index=13}:{sector_index=271}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-17T03:25:33.141673Z  WARN {farm_index=50}:{sector_index=61}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-17T03:25:42.700254Z  WARN {farm_index=39}:{sector_index=511}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-17T03:27:52.843554Z  WARN {farm_index=55}:{sector_index=361}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
System Specs

Role: NATS (nats.log)
Dual Socket 2680v3 24 cores (48 threads)
Link: 20Gbit

Role: Cache, Controller, Plotter, Farmer (cache.log controller.log plotter.log farmer1.log)
Dual Socket 7742 128 cores (256 threads)
Cache: 100GiB
Plots: 109T
Link: 100Gbit

Role: Node (node.log)
EPYC 7F72 24 Cores (48 Threads)
Plots: 0
Link: 100Gbit

Role: Farmer (farmer3.log)
Dual Socket 2687v4 24 cores (48 threads)
Plots:189T
Link: 20Gbit

Role: Farmer (farmer4.log)
Dual Socket 2697A 32 Cores (64 Threads)
Plots:189T
Link: 20Gbit

Role: Farmer (farmer6.log)
Dual Socket 6136 24 Cores (48 Threads)
Plots 91TB
Link: 20Gbit

Logs
  Cluster

Great! We’ll chip away those timeouts separately then. Testing upcoming release code and will prepare actual release soon.

1 Like

Fix shipped in Release gemini-3h-2024-jun-18 Ā· subspace/subspace Ā· GitHub

Please try Snapshot build Ā· subspace/subspace@cec9c1e Ā· GitHub (contains Node client optimization by nazar-pc Ā· Pull Request #2865 Ā· subspace/subspace Ā· GitHub and some other improvements on top of jun-18), it should drastically reduce number of messages between controller and node as well as farmer and controller, which may help with above logs. If not, please create another thread with updated logs.

The reason I think it’ll help is because of this:

[1] 2024/06/17 16:02:16.198649 [INF] 10.20.120.11:54150 - cid:7 - Slow Consumer Detected: WriteDeadline of 10s exceeded with 132 chunks of 8390396 total bytes.

Some messages were dropped, which resulted in some plotted sectors being aborted and retried later. By sending fewer messages (given how many individual farms you have per instance it should be significant) we should have a better chance of avoiding it.

1 Like