Farming cluster

I’ll pull that build and turn on trace logging for all the cluster components.

I see with your logs where the issue is. It basically appears during higher plotting concurrency. It will eventually time out and resolve itself, but I’ll think how to handle it more gracefully still.

If you have a lot of plotting capacity you should consider running more than one controller with additional cache to spread the load across multiple instances so plotters are not so bottlenecked.

Okay, I had Constrain farmer piece getter concurrency by nazar-pc · Pull Request #2830 · subspace/subspace · GitHub hanging in a branch for some time and wasn’t sure if it’ll be helpful or not, but I think it will be and will give you more flexibility if you decide to add even more plotting capacity in the future.

Snapshot build · subspace/subspace@e5a03f7 · GitHub is a test build that includes ^ and most of the other improvements made so far. I believe it should help with your situation.

1 Like

I can do that, but currently (and in those logs) I am only running a single plotter instance on a single server. It is essentially just keeping up with the farms replots.

Would you still recommend running multiple controllers/cache groups for a single plotter?

Single plotter should be fine, especially with latest build I mentioned (still building)

1 Like

It is less about number of servers and more about number of independent plotting instances (number of cache groups). With last test build that will change to per instance, so it will scale better, though may impact the time it takes to retrieve pieces for plotting. I guess we’ll see how it goes.

I have yet been able to replicate slow consumers for subscription 1 on build 344. I will continue to monitor for that.

You may be aware of the following errors, but I’m providing them for completeness. All cluster logs were captured with trace.

Let me know if you’d like other tests run.

Snapshot build #344

Controller

  • WARN subspace_farmer::cluster::controller: Failed to send solution response error=ErrorObject { code: ServerError(9001), message: "Solution was ignored", data: None } slot=10924024 public_key=54b82ede17e5e9fa110c1728ba560143b58c8b77ad7d00e4acb5a15c329c3875 sector_index=423

Farmer

  • WARN {farm_index=34}:{sector_index=443}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter

Plotter

  • WARN {public_key=82ca88898e9348a81fa6b72f98cf95e9fda29ab4fd86fc46283dc1aa8502a021 sector_index=85}: subspace_farmer::cluster::plotter: Response sending ended early

  • WARN {public_key=82ca88898e9348a81fa6b72f98cf95e9fda29ab4fd86fc46283dc1aa8502a021 sector_index=85}: subspace_farmer::cluster::nats_client: Acknowledgement wait timed out request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress

  • WARN subspace_farmer::plotter::cpu: Failed to send error progress update error=send failed because receiver is gone

System Specs

Role: NATS (nats.log)
Dual Socket 2680v3 24 cores (48 threads)
Link: 20Gbit

Role: Cache, Controller, Plotter, Farmer (cache.log controller.log plotter.log farmer1.log)
Dual Socket 7742 128 cores (256 threads)
Cache: 100GiB
Plots: 109T
Link: 100Gbit

Role: Node (node.log)
EPYC 7F72 24 Cores (48 Threads)
Plots: 0
Link: 100Gbit

Role: Farmer (farmer3.log)
Dual Socket 2687v4 24 cores (48 threads)
Plots:189T
Link: 20Gbit

Role: Farmer (farmer4.log)
Dual Socket 2697A 32 Cores (64 Threads)
Plots:189T
Link: 20Gbit

Role: Farmer (farmer6.log)
Dual Socket 6136 24 Cores (48 Threads)
Plots 91TB
Link: 20Gbit

Logs

I was able to replicate INFO async_nats: event: slow consumers for subscription 2 on gemini-3h-2024-jun-11.

The logs are large but should provide a good picture of what happened. farmer6.log was not captured with debug or trace unfortunately.

Build: gemini-3h-2024-jun-11

System Specs

Role: NATS (nats.log)
Dual Socket 2680v3 24 cores (48 threads)
Link: 20Gbit

Role: Cache, Controller, Plotter, Farmer (cache.log controller.log plotter.log farmer1.log)
Dual Socket 7742 128 cores (256 threads)
Cache: 100GiB
Plots: 109T
Link: 100Gbit

Role: Node (node.log)
EPYC 7F72 24 Cores (48 Threads)
Plots: 0
Link: 100Gbit

Role: Farmer (farmer3.log)
Dual Socket 2687v4 24 cores (48 threads)
Plots:189T
Link: 20Gbit

Role: Farmer (farmer4.log)
Dual Socket 2697A 32 Cores (64 Threads)
Plots:189T
Link: 20Gbit

Role: Farmer (farmer6.log)
Dual Socket 6136 24 Cores (48 Threads)
Plots 91TB
Link: 20Gbit

Logs

I just noticed that Farmer Cluster stopped signing rewards as expected. The issue seems to have been related to an issue with the controller, as none of the farmers appeared to be doing anything.

I was able to notice this (without any cluster components stopping) as I have about 370TB of storage that is presented over the LAN, (NVMeoF) and traffic from each of the farmers stopped (greater than 50gbps). This is the second time I’ve seen this with the June 11th release and haven’t seen this with any previous releases.

Again, none of the cluster components failed, however once the individual controller component was restarted, everything began to work again immediately.

The only thing that stood out was the controller log was looping the following:

2024-06-13T04:30:46.575363Z TRACE subspace_farmer::utils::farmer_piece_getter: Getting piece from node piece_index=77484
2024-06-13T04:30:46.575868Z TRACE subspace_farmer::utils::farmer_piece_getter: Couldn't get a piece fast, retrying... piece_index=77484 current_attempt=4
2024-06-13T04:30:46.821118Z TRACE subspace_farmer::utils::farmer_piece_getter: Got piece from DSN L2 cache piece_index=77354
2024-06-13T04:30:46.821135Z TRACE subspace_farmer::utils::farmer_piece_getter: Got piece fast piece_index=77354 current_attempt=0
2024-06-13T04:30:46.821142Z TRACE subspace_farmer::utils::farmer_piece_getter: Got piece from cache successfully piece_index=77354

Same system configuration as in the previous post.

Update: Looking over some of the logs, it appears that this happened both times around the same time replotting kicked off.

Cluster Log

I also had an issue that restarting the controller seemed to fix. I was trying to get it to happen a second time before reporting it. Seems like it was something to do with my piece cache expiring and a new one needed created but the controlled didn’t pick up the new one? I’m still trying to understand the cluster so forgive me if that doesn’t make sense.

@vexr @PuNkYsHuNgRy appreciate the feedback, but going forward let’s create separate threads for discussions. This one is alrady 90 messages long and becomes unmanageable and not as useful given number of messages.
I will still look into logs though.

I see from logs what subscription it belongs to, will look into ways to resolve it now. Thank you!

As to signing rewards I see 2 or 3 reports like that, looks like from larger farmers, I’ll have to check how that could possibly happen.

Thanks for corresponding logs, they will help!
Let’s continue discussing this particular issue in Farmer jun 11 stop farming after replot start sometimes

Looks like you were not running jun-11 node, which probably broke the connection in an interesting way.

If my farmer restarted, should I restart all the plotters? Is there any rules we should keep in mind in operating the cluster.

Generally different components can be restarted any time

I tested and even with equivalent of a thousand farms I don’t see these issues on my machine.
I think you need to dedicate resources to controller using process priortization or in some other way, it must not have resources to breathe under plotting load. So far I don’t have a better explanation, the code path that processes those messages is very fast.

1 Like

I am trying to get redundant controllers working and I can’t seem to get the controller to not crash.

I’m using the legacy build for 1 of the servers as it’s a Xeon v2. There are two servers, Alpha and Bravo. I’m using Ubuntu 20.04 + Docker. Alpha is running NATs + the Node in addition to the controller and cache, where as Bravo only runs a controller + cache.

version: "3.8"
services:
  node:
    container_name: autonomys_node
    image: ghcr.io/subspace/node:gemini-3h-2024-jun-18
    volumes:
      - node-data:/var/subspace:rw
    ports:
      - "0.0.0.0:30333:30333/tcp"
      - "0.0.0.0:30433:30433/tcp"
      - "9944:9944"
    restart: unless-stopped
    command:
      [
        "run",
        "--chain", "gemini-3h",
        "--base-path", "/var/subspace",
        "--listen-on", "/ip4/0.0.0.0/tcp/30333",
        "--dsn-listen-on", "/ip4/0.0.0.0/tcp/30433",
        "--rpc-cors", "all",
        "--rpc-methods", "unsafe",
        "--rpc-listen-on", "0.0.0.0:9944",
        "--farmer",
        "--name", "hakehardware",
        "--sync", "snap"
      ]
    networks:
      autonomys-network:
        ipv4_address: 172.25.0.101
    labels:
      com.spaceport.name: "Alpha Node"
    environment:
      - TZ=America/Phoenix

  nats:
    image: nats
    container_name: nats
    restart: unless-stopped
    ports:
      - "4222:4222"
    volumes:
      - /home/hakedev/subspace_cluster/nats/nats.config:/nats.config:ro
    command: ["-c", "/nats.config"]
    networks:
      autonomys-network:
        ipv4_address: 172.25.0.102
    environment:
      - TZ=America/Phoenix
    labels:
      com.spaceport.name: "Alpha NATs"

  farmer_controller:
    depends_on:
      node:
        condition: service_healthy
    container_name: autonomys_farmer_controller
    image: ghcr.io/subspace/farmer:gemini-3h-2024-jun-18
    volumes:
      - /home/hakedev/subspace_cluster/controller:/controller
    command:
      [
        "cluster",
        "--nats-server", "nats://172.25.0.102:4222",
        "controller",
        "--cache-group", "alpha-group",
        "--base-path", "/controller",
        "--node-rpc-url", "ws://172.25.0.101:9944"
      ]
    labels:
      com.hakedev.name: "Alpha Farmer Controller"
    environment:
      - TZ=America/Phoenix
    networks:
      autonomys-network:
        ipv4_address: 172.25.0.103

  farmer_cache:
    container_name: autonomys_farmer_cache
    image: ghcr.io/subspace/farmer:gemini-3h-2024-jun-18
    volumes:
      - /home/hakedev/subspace_cluster/cache:/cache
    command:
      [
        "cluster",
        "--nats-server", "nats://172.25.0.102:4222",
        "cache",
        "--cache-group", "alpha-group",
        "path=/cache,size=200GB"
      ]
    labels:
      com.hakedev.name: "Alpha Farmer Cache"
    environment:
      - TZ=America/Phoenix
    networks:
      autonomys-network:
        ipv4_address: 172.25.0.104

networks:
  autonomys-network:
    external: true
    
volumes:
  node-data:

Bravo

version: "3.8"
services:
  farmer_controller:
    container_name: autonomys_farmer_controller
    image: hakehardware/autonomys-legacy-farmer:latest
    volumes:
      - /media/spare0/subspace_cluster/controller:/controller
    command:
      [
        "cluster",
        "--nats-server", "nats://192.168.69.101:4222",
        "controller",
        "--cache-group", "bravo-group",
        "--base-path", "/controller",
        "--node-rpc-url", "ws://192.168.69.101:9944"
      ]
    labels:
      com.spaceport.name: "Bravo Farmer Controller"
    environment:
      - TZ=America/Phoenix
    networks:
      autonomys-network:
        ipv4_address: 172.25.0.101

  farmer_cache:
    container_name: autonomys_farmer_cache
    image: hakehardware/autonomys-legacy-farmer:latest
    volumes:
      - /media/spare0/subspace_cluster/cache:/cache
    command:
      [
        "cluster",
        "--nats-server", "nats://192.168.69.101:4222",
        "cache",
        "--cache-group", "bravo-group",
        "path=/cache,size=200GB"
      ]
    labels:
      com.spaceport.name: "Bravo Farmer Cache"
    environment:
      - TZ=America/Phoenix
    networks:
      autonomys-network:
        ipv4_address: 172.25.0.102

networks:
  autonomys-network:
    external: true

Alpha Controller logs:

2024-06-25T04:20:50.246496Z  INFO subspace_farmer::commands::cluster::controller: Connecting to node RPC url=ws://172.25.0.101:9944
2024-06-25T04:20:50.246509Z  INFO async_nats: event: connected
2024-06-25T04:20:50.247369Z  INFO subspace_farmer::node_client::node_rpc_client: Downloading all segment headers from node...
2024-06-25T04:20:50.250451Z  INFO subspace_farmer::node_client::node_rpc_client: Downloaded all segment headers from node successfully
2024-06-25T04:20:50.252527Z  INFO subspace_networking::constructor: DSN instance configured. allow_non_global_addresses_in_dht=false peer_id=12D3KooWBf7wmyFnqcjLs7qwbvaeHM3oF6Z1p7pjvZSwpDLLaBKX protocol_version=/subspace/2/0c121c75f4ef450f40619e1fca9d1e8e7fbabc42c895bc4790801e85d5a91c34
2024-06-25T04:20:50.252980Z  INFO libp2p_swarm: local_peer_id=12D3KooWBf7wmyFnqcjLs7qwbvaeHM3oF6Z1p7pjvZSwpDLLaBKX
2024-06-25T04:20:50.663866Z  INFO subspace_farmer::commands::shared::network: DSN listening on /ip4/127.0.0.1/tcp/30533/p2p/12D3KooWBf7wmyFnqcjLs7qwbvaeHM3oF6Z1p7pjvZSwpDLLaBKX
2024-06-25T04:20:50.689415Z  INFO subspace_farmer::commands::shared::network: DSN listening on /ip6/::1/tcp/30533/p2p/12D3KooWBf7wmyFnqcjLs7qwbvaeHM3oF6Z1p7pjvZSwpDLLaBKX
2024-06-25T04:20:50.689913Z  INFO subspace_farmer::commands::shared::network: DSN listening on /ip4/172.25.0.103/tcp/30533/p2p/12D3KooWBf7wmyFnqcjLs7qwbvaeHM3oF6Z1p7pjvZSwpDLLaBKX
2024-06-25T04:20:54.749970Z  INFO subspace_farmer::commands::cluster::controller::caches: New cache discovered, scheduling reinitialization cache_id=01J16SZZYMECB5X575HQDK2NNN
2024-06-25T04:20:54.750054Z  INFO subspace_farmer::farmer_cache: Initializing piece cache
2024-06-25T04:21:34.520749Z  INFO subspace_farmer::farmer_cache: Synchronizing piece cache
2024-06-25T04:21:34.605053Z  INFO subspace_farmer::farmer_cache: Finished piece cache synchronization
2024-06-25T04:22:13.667169Z  INFO subspace_farmer::commands::cluster::controller::caches: New cache discovered, scheduling reinitialization cache_id=01J16T2HXH6R26SSHDV2VR170F
2024-06-25T04:22:13.667242Z  INFO subspace_farmer::farmer_cache: Initializing piece cache
2024-06-25T04:22:45.834667Z  INFO subspace_farmer::farmer_cache: Synchronizing piece cache
2024-06-25T04:22:45.911166Z  INFO subspace_farmer::farmer_cache: Finished piece cache synchronization
2024-06-25T04:27:00.665237Z  WARN subspace_farmer::commands::cluster::controller::caches: Cache expired and removed, scheduling reinitialization cache_id=01J16T2HXH6R26SSHDV2VR170F
2024-06-25T04:27:00.665311Z  INFO subspace_farmer::farmer_cache: Initializing piece cache
2024-06-25T04:27:09.820482Z  INFO subspace_farmer::commands::cluster::controller::caches: New cache discovered, scheduling reinitialization cache_id=01J16TBK4HV1XCW36J50XF1A2N
2024-06-25T04:27:30.839846Z  INFO subspace_farmer::farmer_cache: Synchronizing piece cache
2024-06-25T04:27:30.917296Z  INFO subspace_farmer::farmer_cache: Finished piece cache synchronization

Bravo Controller (the one that exits) logs:

2024-06-25T04:22:13.679677Z  INFO subspace_farmer::commands::cluster::controller: Connecting to node RPC url=ws://192.168.69.101:9944
2024-06-25T04:22:13.679707Z  INFO async_nats: event: connected
2024-06-25T04:22:13.684235Z  INFO subspace_networking::constructor: DSN instance configured. allow_non_global_addresses_in_dht=false peer_id=12D3KooWFz9Krp8p75Wm3c3KhMTVXWVgQHKyvCYFF1LCkW7iEYG2 protocol_version=/subspace/2/0c121c75f4ef450f40619e1fca9d1e8e7fbabc42c895bc4790801e85d5a91c34
2024-06-25T04:22:13.685009Z  INFO libp2p_swarm: local_peer_id=12D3KooWFz9Krp8p75Wm3c3KhMTVXWVgQHKyvCYFF1LCkW7iEYG2
2024-06-25T04:22:13.685795Z  INFO subspace_farmer::node_client::caching_proxy_node_client: Downloading all segment headers from node...
2024-06-25T04:22:13.691643Z  INFO subspace_farmer::node_client::caching_proxy_node_client: Downloaded all segment headers from node successfully
2024-06-25T04:27:09.984331Z  INFO subspace_farmer::commands::cluster::controller: Connecting to node RPC url=ws://192.168.69.101:9944
2024-06-25T04:27:09.984367Z  INFO async_nats: event: connected
2024-06-25T04:27:09.988013Z  INFO subspace_networking::constructor: DSN instance configured. allow_non_global_addresses_in_dht=false peer_id=12D3KooWA1Cq6JZGY9U2MhaS5duHpWNRKkSUruVBDM6xUXSURbQL protocol_version=/subspace/2/0c121c75f4ef450f40619e1fca9d1e8e7fbabc42c895bc4790801e85d5a91c34
2024-06-25T04:27:09.988979Z  INFO libp2p_swarm: local_peer_id=12D3KooWA1Cq6JZGY9U2MhaS5duHpWNRKkSUruVBDM6xUXSURbQL
2024-06-25T04:27:09.989602Z  INFO subspace_farmer::node_client::caching_proxy_node_client: Downloading all segment headers from node...
2024-06-25T04:27:09.995008Z  INFO subspace_farmer::node_client::caching_proxy_node_client: Downloaded all segment headers from node successfully

It looks like you are attempting to run two separate NATS servers which will not work. If you want to run more than a single NATS server for redundancy you will need to correctly set up clustering of the the NATS servers themselves. Clustering | NATS Docs

Configuration looks correct to me and there was indeed a mismatch in cache group announcement vs subscriptions, which Snapshot build · subspace/subspace@38767d2 · GitHub should fix.

Let’s create separate threads for future issues though, this thread is already long enough.

For new issues I will create a new thread. Are you able to provide a quick example of how I can build the v2/legacy docker image for this build?

Before I did:

git clone https://github.com/subspace/subspace

Then

docker buildx build --progress plain -t hakehardware/autonomys-legacy-farmer:latest -f Dockerfile-farmer .