The cache component crashed immediately after starting the controller and had to be restarted:
Cache
thread 'tokio-runtime-worker' panicked at /home/ubuntu/.rustup/toolchains/nightly-2024-04-22-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/time.rs:417:33:
overflow when adding duration to instant
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Error: Cache service failed: task 267 panicked
farmer1
Update: Performing a scrub on farm index 8 to see if there is an issue that just started with that plot.
This farmer is only farming about 1/2 of the plots according to network transfer (NVMeoF connected drives).
It spammed the same reward signing about 65 times as well.
2024-06-16T03:42:40.508325Z INFO {farm_index=8}: subspace_farmer::reward_signing: Successfully signed reward hash 0x1947bad91fe726fa34c181402b04ea4be35226aa0dc60d0685e7efe4268a8f68
2024-06-16T03:42:40.508369Z INFO {farm_index=8}: subspace_farmer::reward_signing: Successfully signed reward hash 0x1947bad91fe726fa34c181402b04ea4be35226aa0dc60d0685e7efe4268a8f68
2024-06-16T03:42:40.508413Z INFO {farm_index=8}: subspace_farmer::reward_signing: Successfully signed reward hash 0x1947bad91fe726fa34c181402b04ea4be35226aa0dc60d0685e7efe4268a8f68
Hm⦠this was supposed to be fixed in latest release of tokio that weāve upgraded to, Iāll change it back to a safer option for now then.
This looks like permissions issue to me, I donāt think it has anything to do with that particular test build. This is probably why you see lower NVMeoF bandwidth.
I can replicate it panicking every time if I start the cache before the controller component. If I start the controller first, it does not have any issues.
This particular drive is throwing up errors. Weird timing as it did not have any issues prior, but is corrected by removing it from the farmer. Sorry about that.
That is very odd, but there is another fix in tokio that was merged, but not released yet. Iāll have another look at it once they make a new release.
Nice! Will include it in the next release then. I also received good explanation of what happens exactly upstream, so weāll be able to improve it further in future releases.
Timed out without ping from plotter is the most common error.
It seems that in the span of about five hours not a single sector was successfully plotted according to metrics data? In that time, there should have been ~ 400-450 sectors plotted.
Update: I see that there were only thee sectors that finished plotting successfully in the provided logs. In 30 minutes since reverting to June-11, I have 32 entries for Finished plotting sector successfully.
Farmer
2024-06-17T03:23:23.445355Z WARN {farm_index=22}:{sector_index=1433}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-17T03:25:33.141409Z WARN {farm_index=40}:{sector_index=294}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-17T03:25:33.141569Z WARN {farm_index=13}:{sector_index=271}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-17T03:25:33.141673Z WARN {farm_index=50}:{sector_index=61}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-17T03:25:42.700254Z WARN {farm_index=39}:{sector_index=511}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-17T03:27:52.843554Z WARN {farm_index=55}:{sector_index=361}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
Cache component panic if started before Controller issue resolved.
Timed out without ping from Plotter issue is still present, but not happening nearly as often.
There are still some lingering, but not constant warnings listed under the Plotter below.
WARN: Plotter
2024-06-18T04:24:26.955784Z WARN subspace_farmer::plotter::cpu: Failed to send error progress update error=send failed because receiver is gone
2024-06-18T04:26:13.712415Z WARN subspace_farmer::plotter::cpu: Failed to send error progress update error=send failed because receiver is gone
2024-06-17T03:23:23.445355Z WARN {farm_index=22}:{sector_index=1433}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-17T03:25:33.141409Z WARN {farm_index=40}:{sector_index=294}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-17T03:25:33.141569Z WARN {farm_index=13}:{sector_index=271}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-17T03:25:33.141673Z WARN {farm_index=50}:{sector_index=61}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-17T03:25:42.700254Z WARN {farm_index=39}:{sector_index=511}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-17T03:27:52.843554Z WARN {farm_index=55}:{sector_index=361}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
The reason I think itāll help is because of this:
[1] 2024/06/17 16:02:16.198649 [INF] 10.20.120.11:54150 - cid:7 - Slow Consumer Detected: WriteDeadline of 10s exceeded with 132 chunks of 8390396 total bytes.
Some messages were dropped, which resulted in some plotted sectors being aborted and retried later. By sending fewer messages (given how many individual farms you have per instance it should be significant) we should have a better chance of avoiding it.