Plotting Stops after Error (taurus-oct-2024)

Issue Report

Environment

  • Operating System: Ubuntu 24.04
  • Space Acres/Advanced CLI/Docker: Docker

Problem

After receiving the below error my plotter no longer creates any new sectors. No errors on NATs, the Farmer has no errors, it just does not progress any further. Restarting the plotters and farmers will fix the issue.

I noticed in Discord 3-4 other people also have had this same issue. I can add more logs if needed.

2024-10-29T20:38:59.462267Z  WARN subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01JBCZZDZV8HG1P9DAHF6GWDJ8 expected_index=6 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress
2024-10-29T20:38:51.535683Z  WARN subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01JBCZZ3QPBJ92X29FMZW0420T expected_index=6 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress

Do you see any CPU usage on plotters after this?

Some of the reasons for this were fixed, but some others might still remain, I’m trying to understand how to reproduce this.

If it happens again I will take a look.

Just happened again. CPU usage is almost 0% after this happens. The Plotter docker container is still up but the last log is just the WARN log.

Also just for more information on my setup in case it helps. I have a two plotters with 4090s. One has NATs, Controller, Cache, Plotter. The other has Controller, Cache, Plotter and connected to NATs on the other host. Then my farmers are 6 hosts with 16T each. They connect to NATs on the plotter. They are running only a Farmer.

Also one thing I just noticed - I only need to restart the Farmer and it starts plotting again.

I am also seeing this behaviour. May be a complete coincidence but I also have a 4090 plotter (need more data points I suspect). Plotting halts after these messages and restarting the farmer gets it going again.

2024-10-30T16:45:26.595544Z  WARN subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01JBF50EPDY3J1F9GANAJEP0F6 expected_index=6 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress
2024-10-30T16:45:39.896439Z  WARN subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01JBF50VNQB7WS5KWYYY25VVYF expected_index=6 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress
2024-10-30T17:17:14.700998Z  WARN subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01JBF6TJFARK5S5PSBQ34PS7Z5 expected_index=6 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress
2024-10-30T17:17:15.776517Z  INFO {public_key=eec1f7b750de081123daf1b57d5a88f2fada58fd033498e82bf0f95e30f3bd51 sector_index=33}: subspace_farmer::cluster::plotter: Plot sector request
2024-10-30T17:17:19.985461Z  WARN subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01JBF6TJC5X8EX7W1SQCQXF9GJ expected_index=8 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress

(the middle line above is another farmer to the one that stopped)

This happened twice in an hour earlier. I am on Windows binaries, not Docker.

UPDATE: Just happened again.

2024-10-30T21:16:07.411009Z  WARN subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01JBFMFZBXXPGH6Q7DZTDFKWP6 expected_index=6 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress
2024-10-30T21:16:12.681265Z  WARN subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01JBFMG072BES07W8NTEP64JBB expected_index=8 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress

Okay, so something hangs on the farmer.

Now there are two things I’d like to request.

First to start farmer with RUST_LOG=info,subspace_farmer=trace, which will generate a lot more logs, but should also help with debugging (make sure Docker is configured to retain a couple GB of logs at least).

Second, it’d help to dump backtraces of threads with gdb when this happens.
Here is how I’d do it:

docker run --rm -it --pid host --privileged --user root --entrypoint /bin/bash ghcr.io/autonomys/farmer:whatever-tag-you-used

In the container:

apt-get install --no-install-recommends gdb
gdb /subspace-farmer PID_OF_FARMER_PROCESS_ON_THE_HOST

Now in gdb type the following:

set pagination off
thead apply all bt

And send me the whole output.

Then you can exit gdb and container.

This should hopefully provide a hint about what got stuck and where.


@Jim-Autonomys I’m not sure how to debug apps so easily on Windows, but enabling and collecting trace logs might help there as well.

Okay I’ve added the RUST_LOG, will report back when it happens again. Another interesting thing I noticed is that I have 6 farmers attached to the plotter. But ALL farmers stop when the plotter throws that WARN. Restarting any of the Farmers will get plotting going again for just that Farmer.

I think I located the issue and Snapshot build · autonomys/subspace@9d3ba5d · GitHub should fix it, please give it a try once it is built and let me know how it goes.

I will give a try for Windows version

1 Like

Windows and Linux builds have been running for >12 hours without incident. Before I would not make 2 hours before requiring a restart.

Fix was shipped in Release taurus-2024-nov-01 · autonomys/subspace · GitHub

1 Like