Timed out without ping from plotter

vexr · June 20, 2024, 11:53pm

Snapshot build: #356

Observations

Timed out without ping from plotter warnings appear once replotting starts.
Once replotting began I started to see more slow blocks than I have on previous versions. Previous versions are around 1 miss per day and I am currently at 15 misses over the past 24 hours. It may be unrelated, but everything else has remained the same.

WARN: Farmer

2024-06-20T09:58:25.244453Z  WARN {farm_index=57}: subspace_farmer::single_disk_farm::farming: Proving for solution skipped due to farming time limit slot=12031922 sector_index=236
2024-06-20T10:01:03.427559Z  WARN {farm_index=4}:{sector_index=493}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-20T10:01:03.514342Z  WARN {farm_index=49}:{sector_index=825}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-20T10:01:03.514418Z  WARN {farm_index=49}:{sector_index=225}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-20T10:01:03.514430Z  WARN {farm_index=49}:{sector_index=1206}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter

WARN: Plotter

2024-06-20T23:18:15.993066Z  WARN {public_key=3c6a74c0bdb8063795d23f4ffe76fcd20bd7bcbf04802dab8a22daee2fd86e17 sector_index=2514}: subspace_farmer::cluster::plotter: Response sending ended early
2024-06-20T23:18:15.993184Z  WARN {public_key=2cebdd6938352d560b4aa3a3dd0d6838d388d5379e7a479edf1caaba5e96d60c sector_index=1361}: subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01J0VY6P4YN92V85JZE94CZ4E6 expected_index=89 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress
2024-06-20T23:18:15.993304Z  WARN {public_key=2cebdd6938352d560b4aa3a3dd0d6838d388d5379e7a479edf1caaba5e96d60c sector_index=1361}: subspace_farmer::cluster::plotter: Response sending ended early
2024-06-20T23:18:16.364731Z  WARN {public_key=eccc8ae4300899473a8d40c48bcccef6047a1969dda5c6f52c40a94b94855021 sector_index=1196}: subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01J0VXZ2KY8F5BNEPXEC44012J expected_index=114 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress

System Specs

Role: NATS (nats.log)
Dual Socket 2680v3 24 cores (48 threads)
Link: 20Gbit

Role: Cache, Controller, Plotter, Farmer (cache.log controller.log plotter.log farmer1.log)
Dual Socket 7742 128 cores (256 threads)
Cache: 100GiB
Plots: 109T
Link: 100Gbit

Role: Node (node.log)
EPYC 7F72 24 Cores (48 Threads)
Plots: 0
Link: 100Gbit

Role: Farmer (farmer3.log)
Dual Socket 2687v4 24 cores (48 threads)
Plots:189T
Link: 20Gbit

Role: Farmer (farmer4.log)
Dual Socket 2697A 32 Cores (64 Threads)
Plots:189T
Link: 20Gbit

Role: Farmer (farmer6.log)
Dual Socket 6136 24 Cores (48 Threads)
Plots 91TB
Link: 20Gbit

Logs
Cluster

nazar-pc · June 24, 2024, 10:59am

I found a few things to optimize based on those logs, but the challenge is that there was no trace logs for cache, farmer and plotter components, so I don’t know the root cause.

The primary indication of the problem, likely related to cache, is this:

2024-06-20T09:57:47.506914Z TRACE subspace_farmer::cluster::controller: Find piece request request=ClusterControllerFindPieceInCacheRequest { piece_index: PieceIndex(79071) }
2024-06-20T09:57:47.507637Z TRACE subspace_farmer::cluster::controller: Find piece request request=ClusterControllerFindPieceInCacheRequest { piece_index: PieceIndex(7870) }
2024-06-20T09:57:48.861917Z TRACE subspace_farmer::commands::cluster::controller::caches: Received identification for already known cache cache_id=01J0SX3WAJKR9ZFFQVJ8K1HN7G
2024-06-20T09:58:10.504158Z TRACE subspace_farmer::commands::cluster::controller::caches: Received identification for already known cache cache_id=01J0SX3WAJKR9ZFFQVJ8K1HN7G
2024-06-20T09:58:10.504170Z TRACE subspace_farmer::commands::cluster::controller::caches: Received identification for already known cache cache_id=01J0SX3WAJKR9ZFFQVJ8K1HN7G
2024-06-20T09:58:10.504174Z TRACE subspace_farmer::commands::cluster::controller::caches: Received identification for already known cache cache_id=01J0SX3WAJKR9ZFFQVJ8K1HN7G
2024-06-20T09:58:10.504179Z TRACE subspace_farmer::commands::cluster::controller::caches: Received identification for already known cache cache_id=01J0SX3WAJKR9ZFFQVJ8K1HN7G

Basically there is ~20s gap where very few things happening in the cluster (there are some rewards signed, but that is it). Also cache instance was clearly not able to send messages in time or something was blocking them because 4 messages that should have arrived with 5s interval were received all at once at the same millisecond.

There is a chance that communication with NATS was saturated, but we were getting slots and rewards, so it is more likely that cache instance was overloaded instead. It seems unlikely that controller was so busy it couldn’t receive messages or else it would also not process slot info and rewards, though still possible.

The architecture you have is nice for debugging of these issues, but more generally suboptimal, if you places NATS on the same machine as cache and controller you’d skip A LOT of unnecessary bandwidth between machines.

If you could run all farmer components with RUST_LOG=info,subspace_farmer=trace,jsonrpsee=trace it’d help to narrow things down further. I’ll look into the code closer in the meantime.

I initiated following build that optimizes some bottlenecks: Snapshot build · subspace/subspace@3f9bc0b · GitHub

While I don’t think it’ll fix the issue, it should help, at least to a degree. I’ll probably wait for updated logs before doing further tweaks.

It is likely that previously those would have been skipped altogether (there was just a debug message in those cases). Once we fix most of the issues I’ll make completely skipped slots (meaning farmer didn’t even try to audit) proper warnings rather than debug messages.

vexr · June 26, 2024, 6:21am

Snapshot build: #356

I was able to replicate the warnings again with the log levels you requested. The logs are large and compressed as individual tar files.

Added a second cache and controller (cache2.tar and controller2.tar). The rest of the logs also are named after the systems found in System specs below.

There are a few restarts in node and plotter components as the node dropped out of sync at around block #2,116,786 and required a few prior snapshot attempts to recover. This happened around the time replotting started.

WARN: Farmer

2024-06-25T18:12:25.342794Z  WARN {farm_index=43}:{sector_index=488}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-25T18:12:28.824522Z  WARN {farm_index=27}:{sector_index=272}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-25T18:12:28.888172Z  WARN {farm_index=39}:{sector_index=434}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-25T18:12:29.396469Z  WARN {farm_index=6}:{sector_index=81}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter

WARN: Plotter

2024-06-26T03:21:42.108453Z  WARN {public_key=98307e384887f42a595960177fe08667271570fda9226d2620fc9713036ce45d sector_index=441}: subspace_farmer::cluster::plotter: Response sending ended early
2024-06-26T03:21:42.108480Z  WARN {public_key=5ea0804344fed9942b10d49b159556225acda5567fd94b142989e530f5db5c78 sector_index=593}: subspace_farmer::cluster::plotter: Response sending ended early
2024-06-26T03:21:42.108513Z  WARN {public_key=8a47123bc8a54a54c00e6a0b0a5f50dad0793b1e892f2edfd64f902bc811e118 sector_index=1189}: subspace_farmer::cluster::plotter: Response sending ended early
2024-06-26T03:21:42.108534Z  WARN {public_key=98b7b1583aee1641b87868a0a6ffe22f0beb94e2177d0c76952faa65eef8dd03 sector_index=15}: subspace_farmer::cluster::plotter: Response sending ended early

2024-06-26T03:21:42.108520Z  WARN {public_key=98b7b1583aee1641b87868a0a6ffe22f0beb94e2177d0c76952faa65eef8dd03 sector_index=15}: subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01J1989GWV0D7GZ8GJPGGNWWHG expected_index=71 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress
2024-06-26T03:21:42.154668Z  WARN {public_key=086285aa36ec0b83243f0943e0fb38142945b5761d23bca008c27e4f6041ba68 sector_index=795}: subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01J198SDQ5RZ7BVMEFQKGSA4HX expected_index=18 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress
2024-06-26T03:21:43.210282Z  WARN {public_key=54e227b01db2b9cb44d9dd72465a622543fc4af1791f5ab4bba7ce96de791f63 sector_index=1576}: subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01J1980Q7BYBWDXA1BX96RCRCJ expected_index=99 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress
2024-06-26T03:21:43.278599Z  WARN {public_key=1a5868818c85f27c4028ac212b874824e40c2ef28d873143fe7180ec92707a2a sector_index=709}: subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01J197PY8WA8HKETZ03FEGVCTF expected_index=132 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress

System Specs

Role: NATS (nats.log)
Dual Socket 2680v3 24 cores (48 threads)
Link: 20Gbit

Role: Cache, Controller, Plotter, Farmer (cache1.log controller1.log plotter.log farmer1.log)
Dual Socket 7742 128 cores (256 threads)
Cache: 100GiB
Plots: 84T
Link: 100Gbit

Role: Node (node.log cache2.log controller2.log)
EPYC 7F72 24 Cores (48 Threads)
Plots: 0
Link: 100Gbit

Role: Farmer (farmer3.log)
Dual Socket 2687v4 24 cores (48 threads)
Plots:190T
Link: 20Gbit

Role: Farmer (farmer4.log)
Dual Socket 2697A 32 Cores (64 Threads)
Plots:190T
Link: 20Gbit

Role: Farmer (farmer5.log)
Dual Socket 2680v3 24 cores (48 threads)
Plots 113TB
Link: 20Gbit

Logs

  Cache 1
  Cache 2
  Controller 1
  Controller 2
  Farmer 1
  Farmer 3
  Farmer 4
  Farmer 5
  NATS
  Node
  Plotter

vexr · June 27, 2024, 8:46am

Snapshot build: #358

I was able to replicate the warnings again with the log levels you requested on the new #358 snapshot. The logs are large and compressed as individual tar files.

This was again run with second cache and controller (cache2.tar and controller2.tar). The rest of the logs also are named after the component and system found in System specs below.

I restarted farmer 1, 3, 4, and 5 as they did not replot 4, 6, 6, and 2 respectively AboutToExpire sectors. They each replotted successfully the second time around after restarting just the farmer component on each machine.

WARN: Farmer

2024-06-27T01:10:42.083177Z  WARN {farm_index=18}:{sector_index=1536}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-27T01:10:42.194949Z  WARN {farm_index=25}:{sector_index=707}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-27T01:10:44.987205Z  WARN {farm_index=9}:{sector_index=760}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-27T01:10:45.399167Z  WARN {farm_index=39}:{sector_index=955}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter

WARN: Plotter

2024-06-27T03:08:27.322450Z  WARN {public_key=da054d1f66cbf3f5272615cfe409309f986963bc2df88695ea452c4e593ccf5d sector_index=1679}: subspace_farmer::cluster::plotter: Response sending ended early
2024-06-27T03:08:31.290277Z  WARN {public_key=a61f38cc6ddfd93a28477a2cf68718e10a6ff232bda279e173b360bc58303d71 sector_index=1458}: subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01J1BTGEMM37ZRQZD2XMGDTDM1 expected_index=9 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress
2024-06-27T03:08:31.290328Z  WARN {public_key=a61f38cc6ddfd93a28477a2cf68718e10a6ff232bda279e173b360bc58303d71 sector_index=1458}: subspace_farmer::cluster::plotter: Response sending ended early
2024-06-27T03:08:31.369662Z  WARN {public_key=44b04141f61d7bbcf5264d70c3af6d2e6b221dacf4693fbd298696b844fe3d42 sector_index=1590}: subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01J1BTGEQ6AJTMNYFWJQFBGG6F expected_index=11 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress

WARN: Cache

2024-06-27T02:52:50.560472Z TRACE subspace_farmer::cluster::cache: Read piece request offset=77582 reply_subject=_INBOX.DofCNtogemLkEDqLBC2fZB.DofCNtogemLkEDqLBWarNl

System Specs

Role: NATS (nats.log)
Dual Socket 2680v3 24 cores (48 threads)
Link: 20Gbit

Role: Cache, Controller, Plotter, Farmer (cache1.log controller1.log plotter.log farmer1.log)
Dual Socket 7742 128 cores (256 threads)
Cache: 100GiB
Plots: 84T
Link: 100Gbit

Role: Node (node.log cache2.log controller2.log)
EPYC 7F72 24 Cores (48 Threads)
Plots: 0
Link: 100Gbit

Role: Farmer (farmer3.log)
Dual Socket 2687v4 24 cores (48 threads)
Plots:190T
Link: 20Gbit

Role: Farmer (farmer4.log)
Dual Socket 2697A 32 Cores (64 Threads)
Plots:190T
Link: 20Gbit

Role: Farmer (farmer5.log)
Dual Socket 2680v3 24 cores (48 threads)
Plots 113TB
Link: 20Gbit

Logs (856MiB Compressed)

  Cache 1
  Cache 2
  Controller 1
  Controller 2
  Farmer 1
  Farmer 3
  Farmer 4
  Farmer 5
  NATS
  Node
  Plotter

nazar-pc · June 27, 2024, 1:32pm

Is NATS server network overloaded when this happens? I’m not sure how it is all wired together, but I suspect 20Gbps link might be saturated for some amount of time when new segment is archived, which in turn would result in observed timeouts.

The issue now is that some apps can’t read incoming messages fast enough:

[1] 2024/06/26 09:34:51.765631 [INF] 10.20.120.11:56032 - cid:14 - Slow Consumer Detected: WriteDeadline of 10s exceeded with 329 chunks of 20975696 total bytes.
[1] 2024/06/27 01:09:59.669177 [INF] 10.20.120.11:35648 - cid:15 - Slow Consumer Detected: WriteDeadline of 10s exceeded with 199 chunks of 12586484 total bytes.
[1] 2024/06/27 02:43:25.291054 [INF] 10.20.120.11:42400 - cid:16 - Slow Consumer Detected: WriteDeadline of 10s exceeded with 199 chunks of 12586290 total bytes.
[1] 2024/06/27 02:53:15.340809 [INF] 10.20.120.11:48870 - cid:17 - Slow Consumer Detected: WriteDeadline of 10s exceeded with 199 chunks of 12586049 total bytes.

I found some edge-cases (that caused plotting to appear to be “stuck”) that I fixed in Skip duplicate node client notifications by nazar-pc · Pull Request #2882 · subspace/subspace · GitHub, but I don’t think any of it is critical here.

There are a few improvements and additional logs messages that should be included in the next release, I think we should resume testing with that version once it is out.

P.S. Zip files (if possible) are better than Tar, they allow random reads, while Tar files have to be read sequentially during any operation.

nazar-pc · June 27, 2024, 2:03pm

Well, I initiatest test build already that includes those changes: Snapshot build · subspace/subspace@f6ed626 · GitHub

Note that new farms initialized with this test build are compatible with future GPU plotter and can’t be used with older versions of the software.

vexr · June 27, 2024, 7:32pm

The highest usage peak I see on that link is 12.9Gbps, so saturation should not be an issue.

Can you explain this a bit more? I have no problem testing, and can stay on test builds until this feature is rolled out to stable releases, but will this impact existing plots in any way such as replotting for example?

nazar-pc · June 27, 2024, 7:38pm

I’m already on latest test release and node is a bit broken there, but farmer is fine and worth testing. Just use node from jun-18 for now.

It shouldn’t impact anything on existing farms, but new farms that are created will be a tiny bit different to be incompatible. There is a sorting algorithm internally that has changed, such that it is compatible with upcoming GPU implementation. Without new format GPU will not be usable for plotting/replotting of farms created with older releases. While new farmer will default to new format, it will still support the old one just fine. Old farmer will refuse to work with new farms because it doesn’t know what new format means yet.

vexr · June 29, 2024, 8:26am

Farmer: Snapshot build: #359
Node: jun-18

I seem to be encountering the same warns as the previous snapshot.

Farmer 3 had 3 AboutToExpire sectors it did not finish.
Farmer 4 had 4 AboutToExpire sectors it did not finish.

Logs are compressed as zip as requested.

WARN: Farmer

2024-06-29T03:57:04.985743Z  WARN {farm_index=45}:{sector_index=1257}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-29T03:57:05.379975Z  WARN {farm_index=45}:{sector_index=1645}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-29T04:09:46.453724Z  WARN {farm_index=45}:{sector_index=1645}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-06-29T04:42:33.828935Z  WARN {farm_index=45}:{sector_index=1054}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter

WARN: Plotter

2024-06-29T06:01:56.997735Z  WARN {public_key=3e536f7b3ff426098421b479f62814d25580402fb596c783f8a7b379c329ac14 sector_index=3488}: subspace_farmer::plotter::cpu: Failed to send progress update error=send failed because receiver is gone
2024-06-29T06:01:57.148474Z  WARN {public_key=443e92d0657d1ab21db4b3236a24a5ec51b0602ad3f82e6277ea2838a476702d sector_index=1637}: subspace_farmer::plotter::cpu: Failed to send progress update error=send failed because receiver is gone
2024-06-29T06:01:57.827972Z  WARN {public_key=708b73b32d7a6cf701053fec452208b675079a1b22dc0d3811a868f82bd8cd09 sector_index=3331}: subspace_farmer::plotter::cpu: Failed to send progress update error=send failed because receiver is gone
2024-06-29T06:01:58.440089Z  WARN {public_key=1e884369f35785d13dc6b04267da4a685655865fe773c00de7cff3975bca3726 sector_index=1704}: subspace_farmer::plotter::cpu: Failed to send progress update error=send failed because receiver is gone

2024-06-29T05:50:12.094135Z  WARN {public_key=443e92d0657d1ab21db4b3236a24a5ec51b0602ad3f82e6277ea2838a476702d sector_index=1637}: subspace_farmer::cluster::plotter: Response sending ended early
2024-06-29T05:50:12.794105Z  WARN {public_key=f277ea3ca0dd751aeddb4e439e28295cf0489e2714f87cc3ba7f883eb891891c sector_index=3737}: subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01J1H8E9D5GR59NDR5ZX6ED6X1 expected_index=21 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress
2024-06-29T05:50:12.794174Z  WARN {public_key=f277ea3ca0dd751aeddb4e439e28295cf0489e2714f87cc3ba7f883eb891891c sector_index=3737}: subspace_farmer::cluster::plotter: Response sending ended early
2024-06-29T05:50:13.113902Z  WARN {public_key=50c945dae1bf540899834cb24bcda91cfd6fcd79ba5df2c21fa7935f8fdc7645 sector_index=1425}: subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01J1H8F6VV1S3CDNMXY2R5PRTF expected_index=19 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress

WARN: Cache

2024-06-28T10:32:38.344074Z TRACE subspace_farmer::cluster::nats_client: Processing request request_type=subspace_farmer::cluster::cache::ClusterCacheReadPieceRequest request=ClusterCacheReadPieceRequest { offset: PieceCacheOffset(48138) } reply_subject=_INBOX.njwF68S6SqqDb03CWh57Ch.njwF68S6SqqDb03CWuWarn
2024-06-28T13:45:09.562591Z TRACE subspace_farmer::cluster::nats_client: Processing request request_type=subspace_farmer::cluster::cache::ClusterCacheReadPieceRequest request=ClusterCacheReadPieceRequest { offset: PieceCacheOffset(45443) } reply_subject=_INBOX.njwF68S6SqqDb03CWh57Ch.njwF68S6SqqDb03CX7WaRn
2024-06-28T20:20:42.559366Z TRACE subspace_farmer::cluster::nats_client: Processing request request_type=subspace_farmer::cluster::cache::ClusterCacheReadPieceRequest request=ClusterCacheReadPieceRequest { offset: PieceCacheOffset(65465) } reply_subject=_INBOX.njwF68S6SqqDb03CWh57Ch.njwF68S6SqqDb03CXXwArN

System Specs

Role: NATS (nats.log)
Dual Socket 2680v3 24 cores (48 threads)
Link: 20Gbit

Role: Cache, Controller, Plotter, Farmer (cache1.log controller1.log plotter.log farmer1.log)
Dual Socket 7742 128 cores (256 threads)
Cache: 100GiB
Plots: 84T
Link: 100Gbit

Role: Node (node.log cache2.log controller2.log)
EPYC 7F72 24 Cores (48 Threads)
Plots: 0
Link: 100Gbit

Role: Farmer (farmer3.log)
Dual Socket 2687v4 24 cores (48 threads)
Plots:190T
Link: 20Gbit

Role: Farmer (farmer4.log)
Dual Socket 2697A 32 Cores (64 Threads)
Plots:190T
Link: 20Gbit

Role: Farmer (farmer5.log)
Dual Socket 2680v3 24 cores (48 threads)
Plots 113TB
Link: 20Gbit

Logs (1.7GiB)

  Cache 1
  Cache 2
  Controller 1
  Controller 2
  Farmer 1
  Farmer 3
  Farmer 4
  Farmer 5
  NATS
  Node
  Plotter

nazar-pc · July 2, 2024, 11:44pm

I read the logs and created upstream bug report at Silent connection breakage · Issue #1283 · nats-io/nats.rs · GitHub, which I am not necessarily expecting will result in any upstream action.

The logs are detailed and very helpful. You can see the description in above issue, but in short plotter seems to have lost the connection to NATS server for quite some time. In fact it was reconnecting a few times according to logs:

2024-06-28T07:27:00.596690Z  INFO async_nats: event: connected
2024-06-28T08:04:30.541474Z  INFO async_nats: event: disconnected
2024-06-28T08:04:30.561621Z  INFO async_nats: event: connected
2024-06-28T10:12:35.894032Z  INFO async_nats: event: disconnected
2024-06-28T10:12:35.907651Z  INFO async_nats: event: connected
2024-06-28T20:35:33.275481Z  INFO async_nats: event: disconnected
2024-06-28T20:35:33.283930Z  INFO async_nats: event: connected
2024-06-28T22:28:14.615647Z  INFO async_nats: event: disconnected
2024-06-28T22:28:14.751726Z  INFO async_nats: event: connected
2024-06-28T22:41:56.147174Z  INFO async_nats: event: disconnected
2024-06-28T22:41:56.152369Z  INFO async_nats: event: connected
2024-06-29T04:06:53.291356Z  INFO async_nats: event: disconnected
2024-06-29T04:06:53.300005Z  INFO async_nats: event: connected
2024-06-29T04:19:46.389589Z  INFO async_nats: event: disconnected
2024-06-29T04:19:46.478738Z  INFO async_nats: event: connected
2024-06-29T04:51:31.464021Z  INFO async_nats: event: disconnected
2024-06-29T04:51:31.472670Z  INFO async_nats: event: connected
2024-06-29T05:06:08.236118Z  INFO async_nats: event: disconnected
2024-06-29T05:06:08.242358Z  INFO async_nats: event: connected
2024-06-29T06:01:50.180457Z  INFO async_nats: event: disconnected
2024-06-29T06:01:50.210329Z  INFO async_nats: event: connected

Which actually followed NATS server’s inability to deliver messages to the plotter:

[1] 2024/06/28 07:56:22.738998 [INF] 10.20.120.11:55398 - cid:7 - Slow Consumer Detected: WriteDeadline of 10s exceeded with 183 chunks of 11537598 total bytes.
[1] 2024/06/28 10:02:18.279896 [INF] 10.20.120.11:58370 - cid:14 - Slow Consumer Detected: WriteDeadline of 10s exceeded with 214 chunks of 13634212 total bytes.
[1] 2024/06/28 20:22:14.744967 [INF] 10.20.120.11:34418 - cid:15 - Slow Consumer Detected: WriteDeadline of 10s exceeded with 51 chunks of 3146983 total bytes.
[1] 2024/06/28 22:18:31.472940 [INF] 10.20.120.11:46410 - cid:16 - Slow Consumer Detected: WriteDeadline of 10s exceeded with 181 chunks of 11536468 total bytes.
[1] 2024/06/28 22:30:23.702239 [INF] 10.20.120.11:43940 - cid:17 - Slow Consumer Detected: WriteDeadline of 10s exceeded with 180 chunks of 11536316 total bytes.
[1] 2024/06/29 03:56:20.572175 [INF] 10.20.120.11:55322 - cid:18 - Slow Consumer Detected: WriteDeadline of 10s exceeded with 165 chunks of 10487950 total bytes.
[1] 2024/06/29 04:08:56.475206 [INF] 10.20.120.11:38598 - cid:19 - Slow Consumer Detected: WriteDeadline of 10s exceeded with 100 chunks of 6293121 total bytes.
[1] 2024/06/29 04:41:49.926066 [INF] 10.20.120.11:43264 - cid:20 - Slow Consumer Detected: WriteDeadline of 10s exceeded with 214 chunks of 13634457 total bytes.
[1] 2024/06/29 04:53:30.023128 [INF] 10.20.120.11:50512 - cid:21 - Slow Consumer Detected: WriteDeadline of 10s exceeded with 182 chunks of 11537336 total bytes.

The only plausible explanation I have so far is that we have an async tokio task that is unexpectedly blocking for a while and it sometimes happens to be sharing the same thread as NATS client (because tokio runs multiple tasks per core due to work stealing architecture).

No action needed from you at this moment, I’ll analyze all the task spawning code we have and will try to find a source of this.

Provided logs helped immensely!

nazar-pc · July 3, 2024, 12:08am

I think I found a plausible cause for tokio executor stalling that can result in NATS communication issues, RPC timeouts and basically anything else async you can imagine not working the way one would expect it to. This build addresses that issue, please give it a try : Snapshot build · subspace/subspace@8da255b · GitHub

vexr · July 4, 2024, 6:43am

Farmer: Snapshot build: #361
Node: jun-18

Still seem to be encountering the same warns as the previous snapshot.
There does seem to be fewer INFO async_nats: event: disconnected instances.
Slight change in architecture, where node was moved to the same server as NATS.

WARN: Farmer

2024-07-03T14:15:01.337901Z  WARN {farm_index=1}:{sector_index=1798}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-07-03T17:01:40.564376Z  WARN {farm_index=37}:{sector_index=1057}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Plotting progress stream ended before plotting finished
2024-07-03T17:01:40.589024Z  WARN {farm_index=36}:{sector_index=1039}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Plotting progress stream ended before plotting finished
2024-07-03T17:01:40.643086Z  WARN {farm_index=26}:{sector_index=515}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Plotting progress stream ended before plotting finished
2024-07-03T17:02:15.437576Z  WARN {farm_index=5}:{sector_index=808}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter

WARN: Plotter

2024-07-03T21:54:59.518878Z  WARN {public_key=22eb9d8a7261137928cb310928f52f1645541934ef0e629a0a7c827f0bdceb4b sector_index=1587}: subspace_farmer::plotter::cpu: Failed to send progress update error=send failed because receiver is gone
2024-07-03T21:55:02.765533Z  WARN {public_key=f23705ae08f399143dcb25ec4b6e6bcfd6ac8098165ceac23180dd72a3b0b93d sector_index=670}: subspace_farmer::plotter::cpu: Failed to send progress update error=send failed because receiver is gone
2024-07-03T21:55:16.870898Z  WARN {public_key=004f977b1ea9aba6bc3d808d636f9d95b85014ccd64332af113a5f987b465453 sector_index=840}: subspace_farmer::plotter::cpu: Failed to send progress update error=send failed because receiver is gone
2024-07-03T21:55:35.837597Z  WARN {public_key=64eed8ef8e8053247cf20551f42f1a76d5d89f7b3702f9c96ea019952b848c13 sector_index=1160}: subspace_farmer::plotter::cpu: Failed to send progress update error=send failed because receiver is gone

2024-07-03T21:34:03.404972Z  WARN {public_key=f23705ae08f399143dcb25ec4b6e6bcfd6ac8098165ceac23180dd72a3b0b93d sector_index=670}: subspace_farmer::cluster::plotter: Response sending ended early
2024-07-03T21:34:03.404984Z  WARN {public_key=004f977b1ea9aba6bc3d808d636f9d95b85014ccd64332af113a5f987b465453 sector_index=840}: subspace_farmer::cluster::plotter: Response sending ended early
2024-07-03T21:34:31.223038Z  WARN {public_key=708b73b32d7a6cf701053fec452208b675079a1b22dc0d3811a868f82bd8cd09 sector_index=2065}: subspace_farmer::cluster::plotter: Response sending ended early
2024-07-03T21:34:31.941318Z  WARN {public_key=eccc8ae4300899473a8d40c48bcccef6047a1969dda5c6f52c40a94b94855021 sector_index=1638}: subspace_farmer::cluster::plotter: Response sending ended early
2024-07-03T21:34:32.752765Z  WARN {public_key=50c945dae1bf540899834cb24bcda91cfd6fcd79ba5df2c21fa7935f8fdc7645 sector_index=788}: subspace_farmer::cluster::plotter: Response sending ended early

System Specs

Role: NATS, Node (nats.log node.log)
Dual Socket 2680v3 24 cores (48 threads)
Link: 20Gbit

Role: Cache, Controller, Plotter, Farmer (cache1.log controller1.log plotter.log farmer1.log)
Dual Socket 7742 128 cores (256 threads)
Cache: 100GiB
Plots: 84T
Link: 100Gbit

Role: Cache, Controller (cache2.log controller2.log)
EPYC 7F72 24 Cores (48 Threads)
Plots: 0
Link: 100Gbit

Role: Farmer (farmer3.log)
Dual Socket 2687v4 24 cores (48 threads)
Plots:190T
Link: 20Gbit

Role: Farmer (farmer4.log)
Dual Socket 2697A 32 Cores (64 Threads)
Plots:190T
Link: 20Gbit

Role: Farmer (farmer5.log)
Dual Socket 2680v3 24 cores (48 threads)
Plots 113TB
Link: 20Gbit

Logs (1.86GiB)

  Cache 1
  Cache 2
  Controller 1
  Controller 2
  Farmer 1
  Farmer 3
  Farmer 4
  Farmer 5
  NATS
  Node
  Plotter

nazar-pc · July 8, 2024, 5:38am

Okay, so the issue is in these lines:

2024-07-03T11:24:16.106925Z DEBUG {public_key=3e536f7b3ff426098421b479f62814d25580402fb596c783f8a7b379c329ac14 sector_index=97}: subspace_farmer::cluster::nats_client: Publishing stream response messages response_subject=stream-response.01J1W455837Z5XPXG1P8Z2W8E5 num_messages=1 index=250 is_done=false
2024-07-03T11:24:30.439486Z TRACE {public_key=3e536f7b3ff426098421b479f62814d25580402fb596c783f8a7b379c329ac14 sector_index=97}: subspace_farmer::cluster::nats_client: Waiting for acknowledgement response_subject=stream-response.01J1W455837Z5XPXG1P8Z2W8E5 expected_index=249

These lines are next to each other in logs, yet they belong to plotting of the same farm and sector.

In code they are next to each other, meaning that publishing to NATS took that many seconds, which in turn is a buffered channel with capacity of 2048 messages.

This effectivey means NATS client wasn’t publishing any messages (nor was it receiving any) for many seconds for some reason. This looks like a bug in NATS client because I checked logs carefully and I don’t see any plotting processes starting right before that, meaning it is most likely not the tokio executor that was blocked (and I don’t see reasons for that in the code).

At this point I’d like to collect more logs from NATS client itself, so please run plotter with RUST_LOG=info,subspace_farmer=trace,async_nats=trace (others are fine as is).

I’ll share findings with NATS client maintainers in the meantime.

vexr · July 9, 2024, 5:34am

Farmer: gemini-3h-2024-jul-05 Latest
Node: gemini-3h-2024-jul-05 Latest

Was able to duplicate the same warnings as before with updated logging on plotter as requested.
I stopped all farmers at around the 2024-07-08T22:00 mark to complete sectors that did not complete on their own with the plotter sitting idle. You can see those sectors finished after the restart of just the farmers as expected.

WARN: Farmer

2024-07-09T03:37:46.809915Z  WARN {farm_index=0}:{sector_index=1628}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Plotting progress stream ended before plotting finished
2024-07-09T03:37:46.811929Z  WARN {farm_index=2}:{sector_index=1303}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Plotting progress stream ended before plotting finished
2024-07-09T03:37:46.811987Z  WARN {farm_index=39}:{sector_index=1185}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Plotting progress stream ended before plotting finished
2024-07-09T03:38:19.671520Z  WARN {farm_index=23}:{sector_index=1165}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter

WARN: Plotter

2024-07-09T03:52:45.685919Z  WARN {public_key=743fb740e408c122de6ce499d15e5fe2479674c3b6a486a753a4f0c13f472f30 sector_index=327}: subspace_farmer::plotter::cpu: Failed to send progress update error=send failed because receiver is gone
2024-07-09T03:52:53.657664Z  WARN {public_key=7ac9e6b954216cce51a789e8cab4affb9f08eadd0d4fda928b17173897ab107f sector_index=6}: subspace_farmer::plotter::cpu: Failed to send progress update error=send failed because receiver is gone
2024-07-09T03:54:32.335814Z  WARN {public_key=88e351dfe57d4a12d12edc05b2bde9847526a82755b89fe2cc81bc14e324b90e sector_index=1165}: subspace_farmer::plotter::cpu: Failed to send progress update error=send failed because receiver is gone
2024-07-09T03:58:35.854705Z  WARN {public_key=524621f002bbea6edd2aeb62a90bf24370ed54d12d65957e3f679cd3feb48e70 sector_index=1628}: subspace_farmer::plotter::cpu: Failed to send progress update error=send failed because receiver is gone

2024-07-09T03:38:44.357735Z  WARN {public_key=78c8ae66f84067698a7167f2fb26685f163c82a77bd9bb44d7c9f7dc500e3b75 sector_index=922}: subspace_farmer::cluster::plotter: Response sending ended early
2024-07-09T03:38:44.879646Z  WARN {public_key=524621f002bbea6edd2aeb62a90bf24370ed54d12d65957e3f679cd3feb48e70 sector_index=1628}: subspace_farmer::cluster::plotter: Response sending ended early
2024-07-09T03:38:46.043000Z  WARN {public_key=743fb740e408c122de6ce499d15e5fe2479674c3b6a486a753a4f0c13f472f30 sector_index=327}: subspace_farmer::cluster::plotter: Response sending ended early
2024-07-09T04:23:36.608676Z  WARN {public_key=14eb930764afdb2e2dfa22f4e99a9ca69c88690d97c5dcd7214565f0e81e5954 sector_index=701}: subspace_farmer::cluster::plotter: Response sending ended early
2024-07-09T04:24:13.408993Z  WARN {public_key=aab3c470eb8497285e0870eb4b6de522d1364cf877085980643dce65a4bd6568 sector_index=880}: subspace_farmer::cluster::plotter: Response sending ended early

System Specs

Role: NATS, Node (nats.log node.log)
Dual Socket 2680v3 24 cores (48 threads)
Link: 20Gbit

Role: Cache, Controller, Plotter, Farmer (cache1.log controller1.log plotter.log farmer1.log)
Dual Socket 7742 128 cores (256 threads)
Cache: 100GiB
Plots: 84T
Link: 100Gbit

Role: Cache, Controller (cache2.log controller2.log)
EPYC 7F72 24 Cores (48 Threads)
Plots: 0
Link: 100Gbit

Role: Farmer (farmer3.log)
Dual Socket 2687v4 24 cores (48 threads)
Plots:190T
Link: 20Gbit

Role: Farmer (farmer4.log)
Dual Socket 2697A 32 Cores (64 Threads)
Plots:190T
Link: 20Gbit

Role: Farmer (farmer5.log)
Dual Socket 2680v3 24 cores (48 threads)
Plots 113TB
Link: 20Gbit

Logs (1.54GiB)

  Cache 1
  Cache 2
  Controller 1
  Controller 2
  Farmer 1
  Farmer 3
  Farmer 4
  Farmer 5
  NATS
  Node
  Plotter

nazar-pc · July 9, 2024, 2:36pm

Some additional details, but still inconclusive. Maybe try to enable debug logs on NATS server with --debug CLI option? Maybe that will shine some more light on this.

Really not sure what cause this. It might be something related to thread scheduling in the kernel given how many cores you have and how loaded most of them are during active plotting.

Given that disconnections are relatively infrequent and last a short amount of time, maybe it would be worth implementing some kind of retry system for these plotting notifications

Still looking into plotting stuff and some other things I noticed.

nazar-pc · July 9, 2024, 3:18pm

Can you clarify what you mean by this? Which farmer and which farms you think had issues?

vexr · July 9, 2024, 5:56pm

There have been ongoing issues with the plotter failing to complete all AboutToExpire sectors as expected. While I do not have the exact count from the latest set of logs (some previous reports contained counts), each farmer consistently shows up to 20 sectors listed as AboutToExpire in Prometheus that never complete after each replotting round. The fact that these sectors remain incomplete while the plotter sits idle suggests a bug.

I suspect these sectors may be linked to the send failed because receiver is gone errors, or other warnings and that these sectors are not being retried.

Around that timestamp, I stopped all farmer components in the cluster (farmer1.log, farmer3.log, farmer4.log, and farmer5.log) and restarted them as each farmer had sectors that were not replotting, yet showed AboutToExpire. Subsequently, those sectors were able to successfully complete.

I have logging set to debug in NATS and will provide on the next update.

nazar-pc · July 10, 2024, 12:34am

Indeed, some sectors finish plotting successfully, then some chunks are received and written to disk and… nothing, it just hangs in there. I’d suspect the disk has hanged, but no, farming and plotting of other sector worked fine (even after one of them hanged on writes).

In fact there was a slow down on disk due to sector being written and some slots were skipped completely (Slow farming, skipping slot), but then it normalized again, while sector 1653 just got stuck and never made further progress after that, while sector 1466 finished plotting later and was successfully written to disk:

2024-07-08T09:24:44.714548Z TRACE {farm_index=36}:{sector_index=1653}: subspace_farmer::cluster::plotter: Processing plotting response notification (sector chunk)
2024-07-08T09:24:44.806791Z TRACE {farm_index=36}:{sector_index=1653}: subspace_farmer::cluster::plotter: Processing plotting response notification (sector chunk)
2024-07-08T09:24:45.045560Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: New slot slot_info=SlotInfo { slot_number: 13584784, global_challenge: [118, 135, 213, 4, 234, 161, 90, 167, 207, 156, 200, 83, 109, 185, 51, 6, 151, 123, 143, 90, 70, 79, 10, 151, 29, 227, 103, 18, 108, 85, 111, 117], solution_range: 42072681, voting_solution_range: 420726810 }
2024-07-08T09:24:45.046970Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: Reading sectors slot=13584784 sector_count=1813
2024-07-08T09:24:45.142752Z TRACE {farm_index=36}:{sector_index=1653}: subspace_farmer::cluster::plotter: Processing plotting response notification (sector chunk)
2024-07-08T09:24:46.094130Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: New slot slot_info=SlotInfo { slot_number: 13584785, global_challenge: [206, 180, 67, 103, 48, 240, 52, 157, 195, 27, 31, 73, 125, 123, 146, 11, 102, 253, 9, 125, 91, 235, 114, 195, 44, 249, 203, 86, 80, 220, 159, 27], solution_range: 42072681, voting_solution_range: 420726810 }
2024-07-08T09:24:47.105552Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: New slot slot_info=SlotInfo { slot_number: 13584786, global_challenge: [101, 62, 14, 176, 75, 115, 80, 178, 16, 157, 96, 187, 189, 113, 130, 98, 12, 18, 27, 91, 16, 73, 35, 102, 154, 54, 238, 211, 140, 84, 144, 225], solution_range: 42072681, voting_solution_range: 420726810 }
2024-07-08T09:24:47.105556Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: Slow farming, skipping slot slot=13584786
2024-07-08T09:24:48.052897Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: New slot slot_info=SlotInfo { slot_number: 13584787, global_challenge: [192, 68, 225, 144, 135, 73, 124, 132, 225, 215, 143, 209, 91, 154, 199, 180, 130, 167, 211, 242, 35, 141, 32, 191, 53, 179, 136, 109, 47, 227, 209, 110], solution_range: 42072681, voting_solution_range: 420726810 }
2024-07-08T09:24:48.052902Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: Slow farming, skipping slot slot=13584787
2024-07-08T09:24:49.048287Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: New slot slot_info=SlotInfo { slot_number: 13584788, global_challenge: [215, 156, 119, 163, 195, 159, 237, 121, 22, 168, 247, 10, 156, 60, 69, 79, 50, 136, 50, 151, 101, 178, 115, 223, 191, 67, 155, 197, 158, 94, 203, 89], solution_range: 42072681, voting_solution_range: 420726810 }
2024-07-08T09:24:49.048291Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: Slow farming, skipping slot slot=13584788
2024-07-08T09:24:50.066853Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: New slot slot_info=SlotInfo { slot_number: 13584789, global_challenge: [44, 59, 187, 117, 132, 15, 155, 193, 229, 70, 245, 182, 46, 97, 164, 243, 37, 83, 176, 211, 72, 24, 72, 178, 13, 241, 98, 118, 100, 174, 129, 36], solution_range: 42072681, voting_solution_range: 420726810 }
2024-07-08T09:24:50.066857Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: Slow farming, skipping slot slot=13584789
2024-07-08T09:24:51.082080Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: New slot slot_info=SlotInfo { slot_number: 13584790, global_challenge: [249, 78, 120, 21, 156, 154, 169, 187, 59, 3, 48, 252, 206, 157, 250, 135, 24, 38, 174, 107, 46, 142, 154, 134, 144, 246, 184, 85, 185, 89, 249, 109], solution_range: 42072681, voting_solution_range: 420726810 }
2024-07-08T09:24:51.082083Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: Slow farming, skipping slot slot=13584790
2024-07-08T09:24:52.032467Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: New slot slot_info=SlotInfo { slot_number: 13584791, global_challenge: [81, 119, 67, 215, 154, 193, 62, 133, 98, 60, 252, 55, 190, 89, 202, 202, 167, 155, 168, 155, 75, 205, 52, 145, 141, 145, 114, 39, 134, 33, 178, 207], solution_range: 42072681, voting_solution_range: 420726810 }
2024-07-08T09:24:52.032471Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: Slow farming, skipping slot slot=13584791
2024-07-08T09:24:53.029417Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: New slot slot_info=SlotInfo { slot_number: 13584792, global_challenge: [238, 48, 69, 150, 209, 180, 123, 22, 205, 23, 19, 195, 239, 83, 223, 211, 239, 19, 31, 58, 29, 2, 71, 108, 230, 206, 46, 145, 149, 149, 128, 174], solution_range: 42072681, voting_solution_range: 420726810 }
2024-07-08T09:24:53.029421Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: Slow farming, skipping slot slot=13584792
2024-07-08T09:24:54.056406Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: New slot slot_info=SlotInfo { slot_number: 13584793, global_challenge: [73, 35, 197, 246, 248, 172, 251, 114, 221, 195, 179, 168, 219, 91, 215, 218, 38, 53, 157, 99, 164, 7, 249, 163, 73, 146, 4, 98, 37, 9, 62, 222], solution_range: 42072681, voting_solution_range: 420726810 }
2024-07-08T09:24:54.056410Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: Slow farming, skipping slot slot=13584793
2024-07-08T09:24:55.044074Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: New slot slot_info=SlotInfo { slot_number: 13584794, global_challenge: [215, 111, 193, 144, 121, 71, 8, 1, 10, 1, 79, 62, 6, 19, 242, 70, 186, 43, 168, 211, 168, 148, 195, 186, 228, 94, 23, 140, 199, 218, 117, 220], solution_range: 42072681, voting_solution_range: 420726810 }
2024-07-08T09:24:55.044089Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: Slow farming, skipping slot slot=13584794
2024-07-08T09:24:56.059822Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: New slot slot_info=SlotInfo { slot_number: 13584795, global_challenge: [96, 49, 180, 28, 192, 173, 253, 170, 233, 48, 174, 21, 2, 55, 67, 176, 184, 47, 155, 101, 248, 92, 116, 76, 110, 128, 85, 15, 82, 117, 27, 30], solution_range: 42072681, voting_solution_range: 420726810 }
2024-07-08T09:24:56.059826Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: Slow farming, skipping slot slot=13584795
2024-07-08T09:24:57.082102Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: New slot slot_info=SlotInfo { slot_number: 13584796, global_challenge: [86, 162, 216, 104, 16, 188, 65, 153, 59, 243, 48, 144, 105, 147, 151, 240, 106, 164, 164, 137, 214, 242, 21, 11, 24, 39, 187, 134, 199, 32, 118, 142], solution_range: 42072681, voting_solution_range: 420726810 }
2024-07-08T09:24:57.082106Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: Slow farming, skipping slot slot=13584796
2024-07-08T09:24:58.058746Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: New slot slot_info=SlotInfo { slot_number: 13584797, global_challenge: [149, 148, 245, 153, 150, 113, 69, 128, 187, 176, 64, 113, 113, 47, 88, 66, 6, 82, 115, 173, 155, 145, 251, 96, 245, 254, 3, 209, 147, 140, 129, 169], solution_range: 42072681, voting_solution_range: 420726810 }
2024-07-08T09:24:58.058751Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: Slow farming, skipping slot slot=13584797
2024-07-08T09:24:59.056199Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: New slot slot_info=SlotInfo { slot_number: 13584798, global_challenge: [42, 68, 96, 235, 38, 180, 7, 239, 232, 109, 175, 70, 122, 34, 93, 107, 164, 173, 225, 61, 112, 191, 136, 54, 231, 113, 123, 57, 91, 99, 228, 74], solution_range: 42072681, voting_solution_range: 420726810 }
2024-07-08T09:24:59.056205Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: Slow farming, skipping slot slot=13584798
2024-07-08T09:24:59.468734Z TRACE {farm_index=36}:{sector_index=1466}: subspace_farmer::cluster::plotter: Processing plotting response notification response=Ping
2024-07-08T09:25:00.112820Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: New slot slot_info=SlotInfo { slot_number: 13584799, global_challenge: [130, 25, 165, 201, 157, 179, 126, 175, 105, 157, 65, 157, 130, 214, 56, 249, 158, 134, 185, 172, 178, 28, 62, 97, 41, 127, 202, 78, 145, 94, 164, 28], solution_range: 42072681, voting_solution_range: 420726810 }
2024-07-08T09:25:00.112847Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: Slow farming, skipping slot slot=13584799
2024-07-08T09:25:00.443839Z TRACE {farm_index=36}:{sector_index=1466}: subspace_farmer::cluster::plotter: Processing plotting response notification response=Ping
2024-07-08T09:25:01.063193Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: New slot slot_info=SlotInfo { slot_number: 13584800, global_challenge: [186, 176, 90, 114, 71, 67, 138, 253, 112, 144, 28, 10, 7, 95, 162, 104, 54, 136, 243, 225, 26, 36, 152, 61, 69, 51, 171, 92, 13, 235, 45, 94], solution_range: 42072681, voting_solution_range: 420726810 }
2024-07-08T09:25:01.063198Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: Slow farming, skipping slot slot=13584800
2024-07-08T09:25:02.081086Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: New slot slot_info=SlotInfo { slot_number: 13584801, global_challenge: [4, 186, 20, 66, 6, 248, 117, 38, 132, 201, 237, 31, 6, 161, 151, 216, 168, 139, 79, 148, 219, 125, 125, 86, 188, 229, 93, 185, 205, 43, 16, 195], solution_range: 42072681, voting_solution_range: 420726810 }
2024-07-08T09:25:02.081091Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: Slow farming, skipping slot slot=13584801
2024-07-08T09:25:02.833868Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: Reading sectors slot=13584785 sector_count=1813
2024-07-08T09:25:03.051899Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: New slot slot_info=SlotInfo { slot_number: 13584802, global_challenge: [228, 60, 53, 178, 198, 228, 231, 8, 172, 180, 76, 176, 234, 108, 121, 34, 6, 42, 51, 253, 163, 9, 244, 108, 226, 164, 57, 69, 93, 28, 39, 102], solution_range: 42072681, voting_solution_range: 420726810 }
2024-07-08T09:25:03.051946Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: Reading sectors slot=13584802 sector_count=1813
2024-07-08T09:25:04.059804Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: New slot slot_info=SlotInfo { slot_number: 13584803, global_challenge: [66, 10, 106, 222, 30, 178, 29, 166, 99, 57, 71, 76, 176, 8, 206, 38, 104, 212, 148, 121, 144, 252, 65, 116, 71, 20, 253, 130, 145, 218, 18, 123], solution_range: 42072681, voting_solution_range: 420726810 }
2024-07-08T09:25:04.059829Z DEBUG {farm_index=36}: subspace_farmer::single_disk_farm::farming: Reading sectors slot=13584803 sector_count=1813

I have reported "malloc_consolidate(): unaligned fastbin chunk detected" when attempting to block single-threaded tokio runtime · Issue #2863 · rust-lang/futures-rs · GitHub before and we do use FuturesOrdered/FuturesUnordered heavily, so if there is some kind of memory safety issue there it may potentially cause this, though I’m having a hard time getting to the source of it and the fact that it is the reason here is fairly low.

I have added two low-level trace messages in farmer code in Snapshot build · subspace/subspace@27b377e · GitHub, it’ll hopefully help me narrow down whether it is tokio’s timer that is not working when NATS communication with plotter breaks or spawning blocking task that writes plotted sector to disk or something else.

I strongly suspect it is not our code to blame, but we’ll see.

vexr · July 10, 2024, 8:21am

Was there any other changes in this snapshot aside from additional trace logging options? It appears the plotter component never started working on any sectors.

Around the 2024-07-10T07:11:00 mark, I restarted all the farmer components and it still did not start processing sectors.

Farmer 1: 20 AboutToExpire sectors
Farmer 3: 187 AboutToExpire sectors
Farmer 4: 174 AboutToExpire sectors + 1 Expired sectors
Farmer 5: 58 AboutToExpire sectors

I unfortunately was unable to capture the nats log for this round of testing. If this is needed, I can rerun the test.

After reverting to gemini-3h-2024-jul-05 the plotter has started processing sectors again.

–

Farmer: Snapshot build #363
Node: Snapshot build #363

System Specs

Role: NATS, Node (nats.log node.log)
Dual Socket 2680v3 24 cores (48 threads)
Link: 20Gbit

Role: Cache, Controller, Plotter, Farmer (cache1.log controller1.log plotter.log farmer1.log)
Dual Socket 7742 128 cores (256 threads)
Cache: 100GiB
Plots: 84T
Link: 100Gbit

Role: Cache, Controller (cache2.log controller2.log)
EPYC 7F72 24 Cores (48 Threads)
Plots: 0
Link: 100Gbit

Role: Farmer (farmer3.log)
Dual Socket 2687v4 24 cores (48 threads)
Plots:190T
Link: 20Gbit

Role: Farmer (farmer4.log)
Dual Socket 2697A 32 Cores (64 Threads)
Plots:190T
Link: 20Gbit

Role: Farmer (farmer5.log)
Dual Socket 2680v3 24 cores (48 threads)
Plots 113TB
Link: 20Gbit

Logs (17.7MiB)

  Cache 1
  Cache 2
  Controller 1
  Controller 2
  Farmer 1
  Farmer 3
  Farmer 4
  Farmer 5
  Node
  Plotter

nazar-pc · July 10, 2024, 3:16pm

Yeah, looks like there was a small bug in one of the changes. This build should fix that: Snapshot build · subspace/subspace@265db19 · GitHub

Topic		Replies	Views
Jun-18 subspace farmer plotter ping timeout Support	7	83	June 26, 2024
Plotting Stops after Error (taurus-oct-2024) Support	10	138	November 1, 2024
There are too many warning logs in the farmer cluster Incentivized Testnet farmer	23	246	June 11, 2024
Proving for solution skipped due to farming time limit when plotting Support	7	147	February 29, 2024
Failed to get block Support	3	854	May 1, 2022

Timed out without ping from plotter

Related topics