Farming cluster

Your plotter and farmer logs literally say sector plotting both started and finished successfully. It would be helpful to know what hardware you’re running on and other details. See how vexr was providing feedback above.

The issue in your case is that communication between plotter and farmer didn’t quite work correctly, so it would help to know what you’re running each component on and what logs you have on NATS server because I suspect it did drop some messages.

UPD: I’m particularly interested in reproducing those ping/acknowledgment timeouts locally so I can fix them.

System Specs

Role: NATS
CPU: Dual Socket 2680v3 24 cores (48 threads)
Link: 10g

Role: Cache, Controller, Plotter, Farmer (Farmer-1)
Dual Socket 7742 128 cores (256 threads)
Cache: 200GiB
Plots: 1.9TB * 32
Link: 100g

Role: Farmer (Farmer-2)
Dual Socket 2687v4 24 cores (48 threads)
Plots 1.9TB * 32
Link: 10g

Errors

Plotter

  • subspace_farmer::cluster::plotter: Response sending ended early

  • subspace_farmer::cluster::nats_client: Acknowledgement wait timed out request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress

  • WARN subspace_farmer::plotter::cpu: Failed to send error progress update error=send failed because receiver is gone

  • WARN {public_key=e864a74422319ea6c26790b2407c7b0cc47ba4ac168e10d8a147b846c77a9040 sector_index=1860}: subspace_farmer::cluster::nats_client: Unexpected acknowledgement index received_index=1 expected_index=0 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress 01000000

Controller

  • 2024-05-11T20:44:33.690066Z WARN subspace_farmer::utils::piece_validator: Received invalid piece from peer piece_index=62270 source_peer_id=12D3KooWRYcpaSsfmYm3YfinT7auP7XiBYT2KA9A1P98Mt5p8wQ2

Observations

Cache, Controller, Plotter, Farmer

  • 16 * INFO async_nats: event: connected displayed at the start of each. If threads, perhaps combine them, or number them to show why the same message is being repeated.

Other

  • If not implemented, bumping replotting thread pools size on Plotter to match plotting thread pools size as default would make sense if the idea is to better consolidate resources with clustering.

Logs

NATS (Only the last three lines are from this test)
Cache
Controller
Farmer-1
Farmer-2
Plotter

New build is cooking: Snapshot build · subspace/subspace@25ab72c · GitHub

It should improve situation with NATS-related messaging resulting in acknowledgement/ping errors, this is achieved by limiting piece retrieval concurrency to 100 pieces at any time and is also correlated to size of the connections pool to NATS (reduced from 16 in previous build to 4).

It is also possible to run multiple components in the same process if desired by separating them with -- (all will share the same pool of connections to NATS as well). Example of running cache and plotter together:

subspace-farmer cluster --nats-server nats://IP:4222 \
    cache \
        path=/path/to/cache,size=SIZE
    --
    plotter

Plotter warnings are te same as before and should be resolved in new build.

This is not cluster-specific and rarely happens, nothing to worry about.

There is a pool of connections to NATS now and it is configurable with corresponding CLI option, though I do not expect that you’ll need to change it.

There are no replotting CLI options in farming cluster plotter because plotter doesn’t know or care if request is for initial plotting or replotting. By default plotter will use all CPU cores fully all the time.

Let me know if you have a strong reason to not like this behavior, but expectation is that you will have some dedicated resources for plotting purposes (either separate machine or dedicated set of CPU cores for this purpose on machine used for other purposes too).

This is perfect. The expectation is that the plotter would run with all of the resources you provide it.

System Specs

Role: NATS
CPU: Dual Socket 2680v3 24 cores (48 threads)
Link: 10g

Role: Cache, Controller, Plotter, Farmer (Farmer-1) & (Cluster)
Dual Socket 7742 128 cores (256 threads)
Cache: 200GiB
Plots: 46 (81T)
Link: 100g

Role: Farmer (Farmer-2)
Dual Socket 6136 24 Cores (48 Threads)
Plots: 24 (50T)
Link: 10g

Role: Farmer (Farmer-3)
Dual Socket 2687v4 24 cores (48 threads)
Plots 44 (100T)
Link: 10g

Errors

  • INFO async_nats: event: slow consumers for subscription 1- This error almost continuously scrolls the screen and is captured in the cluster.log

  • WARN {farm_index=3}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s sector_index=3735 error=Low-level plotting error: Timed out without ping from plotter I’m not able to reliably get this to occur. I could log with trace, but the logs would be pretty massive to be able to catch them. Let me know if this would be helpful.

Logs
NATS
Cluster
Farmer-1
Farmer-2
Farmer-3

For now it is not helping that you are running everything together in one process, but even if you didn’t, there isn’t an amazing way to debug this it seems like. I created upstream feature request to improve debugging experience and will take a look into it. I’ll likely have a custom build to test the reason for that issue. I think I know what it could be, but will have to experiment.

Still, separating components for cleaner logs would be a bit better.

It is both great and not so great that you have that much space and likely stretch to the limits some of the components.

If you can look at per-thread CPU usage in something like htop (enable custom thread names there) and see if there is something except plotting threads that is using a lot of CPU, like 90%+, I’d like to know those names.

It is posible that if you have just a single controller and cache instance you’re massively overwhelming it with requests that it is unable to process in time, but it needs to be confirmed.

The top process is the cluster and the second is the plotter. Not seeing much aside form those and the plotting threads.

I have been moving my farm to clustering one machine at a time. Farmer-3 should be the heaviest load as it is plotting the ~1% of space reclaimed by removing piece_cache.bin from each plot. Farmer-1 and Farmer-2 should be mostly idle except any new sectors that need to be replotted.

There does not seem to be any iowait on the cache drive, but I could easily migrate that to RAM if necessary.

I have --sector-encoding-concurrency 16 set on the plotter to maximize the CPU usage on the plotter. I can reduce this a bit for testing if you’d like?

Edit: Overall system usage on the NATS server is extremely low. Farmer-2 and Farmer-3 are averaging less than 50% CPU utilization as well.

Did you scroll down all the way? I’m less interested in processes and more in threads. Slow consumer likely means one of the threads is having a hard time processing messages, so it is dropping them instead.

It is not about iowait.

I do not care about that, what I do care about is whether you can reproduce the issue and how easy it is to do. If easy then I can create a debug build that should print more details about what is happening.

If you can run the instance printing event: slow consumers for subscription 1 with trace level for a few minutes (and confirming that is still prints the same message in the meantime), that log should help me.

Running this build with RUST_LOG=info,subspace_farmer=debug should print a bunch of additional information about subscriptions specifically, hopefully helping to identify what is going on: Snapshot build · subspace/subspace@9caa831 · GitHub

System Specs

Role: NATS
CPU: Dual Socket 2680v3 24 cores (48 threads)
Link: 10g

Role: Cache, Controller, Plotter, Farmer (Farmer-1)
Dual Socket 7742 128 cores (256 threads)
Cache: 200GiB
Plots: 46 (81T)
Link: 100g

Role: Farmer (Farmer-2)
Dual Socket 6136 24 Cores (48 Threads)
Plots: 24 (50T)
Link: 10g

Role: Farmer (Farmer-3)
Dual Socket 6136 24 Cores (48 Threads)
Plots: 24 (50T)
Link: 10g

Role: Farmer (Farmer-4)
Dual Socket 2687v4 24 cores (48 threads)
Plots 44 (100T)
Link: 10g

Errors

  • WARN {farm_index=5}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s sector_index=1860 error=Low-level plotting error: Timed out without ping from plotter

Observations

I was able to reproduce by adding a new farmer (+50T) with some sectors to plot. These errors originated on Farmer-3. I was not able to reproduce until I had the other three farmers run at the same time (I did not include those logs).

Logs (55MiB)

NATS
Cache
Controller
Plotter
Farmer-1
Farmer-2
Farmer-3
Farmer-4

I was not able reproduce these errors after restarting (even on the previous 321 build). I can continue to watch for them, but to be sure I capture them, what is the environment variable needed?

New build: Snapshot build · subspace/subspace@6575cb2 · GitHub

It improves support for multiple controllers and implements concurrency for app info and segment headers requests that I didn’t anticipate would be requested this frequently.

@vexr thanks for providing trace log, I wouldn’t be able to see it otherwise. Latest build should help you significantly.

1 Like

Snapshot build #324

System Specs

Role: NATS
CPU: Dual Socket 2680v3 24 cores (48 threads)
Link: 10g

Role: Cache, Controller, Plotter, Farmer (Farmer-1)
Dual Socket 7742 128 cores (256 threads)
Cache: 200GiB
Plots: 46 (81T)
Link: 100g

Role: Farmer (Farmer-2)
Dual Socket 6136 24 Cores (48 Threads)
Plots: 24 (50T)
Link: 10g

Role: Farmer (Farmer-3)
Dual Socket 6136 24 Cores (48 Threads)
Plots: 24 (50T)
Link: 10g

Role: Farmer (Farmer-4)
Dual Socket 2687v4 24 cores (48 threads)
Plots 44 (100T)
Link: 10g

Errors

Controller

  • WARN subspace_farmer::cluster::controller: Failed to send solution response error=ErrorObject { code: ServerError(9001), message: "Solution was ignored", data: None }

Plotter

  • WARN {public_key=f277ea3ca0dd751aeddb4e439e28295cf0489e2714f87cc3ba7f883eb891891c sector_index=3756}: subspace_farmer::cluster::nats_client: Acknowledgement wait timed out request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress

Farmer 3

  • 2024-05-14T03:05:29.213529Z WARN {farm_index=2}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s sector_index=3756 error=Low-level plotting error: Timed out without ping from plotter Seems to still be present.

Questions

  • NATS Server max_payload. What is the benefit/drawback to increasing beyond 2MB?

  • --nats-pool-size <NATS_POOL_SIZE> What is the benefit/drawback to increasing beyond the default of 8? Do some components of the customer benefit with more connections to the NATS server than others? If you run the cluster components together using -- it seems to use the pool size you choose for each component vs being able to be more granular by running the components separately. Would it make more sense to split this out into the cache, plotter, farmer, and controller vs a one-size-fits-all approach for each component and needing to run them all separately to fine tune?

Logs

NATS
Cache
Controller
Plotter
Farmer-1
Farmer-2
Farmer-3
Farmer-4

This basically means something somewhere was too slow. I see it doesn’t happen very often, but it does need to be investigated. Trace logs might help with that, also checking CPU usage of controller’s and farmer’s threads might reveal bottlenecks.

Same as before. I see it is way less frequent now, but still happening. Trace logs might help to find the root cause. Though it is not happening on NATS anymore and it doesn’t seem to be caused by messages dropped locally. Might be just overload of some of the threads on the controller, though not obvious what could it be exactly.

Increasing further will likely increase NATS server memory usage, could cut the time for delivering of plotted sectors to farmers, but will also have a higher chance of getting clogged and resulting in above mentioned errors. There are a few parameters interdependent on each other and I’d recommend leaving it at 2MB.

There should not be a huge reason to increase it either. It will likely help making more concurrent outgoing requests if you tweak piece getter parameters, but at the same time it’ll also cause more load on components responding to requests, which might cause oveload and messages will be dropped with logs similar to those we’ve already observed.

If you run multiple components, they indeed share the same pool of connections to NATS, that is by design. If you want separate connections for separate components then you simply run completely separate application instances like you do right now and fine tune however you like, but ideally after we resolve all the issues first.

1 Like

How do you handle the two problems during cluster development.
Both plotter and farmer needs to read/write plot.bin file.

  1. If they runs on the same node, Linux file can be open only one process for exclusive write.
  2. If the two process runs on different server, one is local file, another one is network file.

There is no such problem because plotter doesn’t need access to plot.bin, see examples provided earlier in this thread.

Often I see the network usage maxing out my NIC’s 1G speed (mostly read) for extended periods of time. Is this expected? With the 12 systems on my wall, they are all sharing a 1G switch so if I have two or more of these systems plotting, it will result in timeouts on both of them.

Now I do have a new switch coming in a few days with a 10G uplink port so the problem might go away on it’s own. Vexr wouldn’t have encountered this as he is running 10G or 100G everywhere. But I would expect many people are still running 1 GB networks.

If you plot a new sector, plotter will need to retrieve ~1G of pieces frm cache through controller to start plotting and then upload ~1G sector to farmer. At 1Gbps speeed it will take about 10 seconds to complete. With more sectors being plotted concurrently it can get a bit congested indeed. So network planning is somewhat important here.

I finally got this running! Seems amazing! Just got all the PC’s (linux and win mixed) hooked up to the nats server. If there is anything I can do to help testing just let me know.

Farming cluster plotting speed test.

  1. x86 server(controller, cache, plotter, farmer) plot a new sector takes 10 minutes
  2. The same x86 server(controller, cache, plotter, farmer), and another ARM server(controller, cache, plotter), the x86 server farmer took 85 minutes to plot a new sector.

It is supposed to help x86 on quick plotting, but actually it slows down a lot.
What’s the reason, and any benefit for farming-cluster feature?