The cache component in cluster mode never downloads the cache

Issue Report

Environment

  • Ubuntu 24.04.1 LTS
  • Advanced CLI

The cache component in cluster mode never downloads the cache

Controller logs

2025-01-08T06:30:10.651557Z  INFO subspace_farmer::commands::cluster::controller: Connecting to node RPC url=ws://127.0.0.1:9944
2025-01-08T06:30:10.651567Z  INFO async_nats: event: connected
2025-01-08T06:30:10.654774Z  INFO subspace_farmer::node_client::caching_proxy_node_client: Downloading all segment headers from node...
2025-01-08T06:30:10.700594Z  INFO subspace_farmer::node_client::caching_proxy_node_client: Downloaded all segment headers from node successfully
2025-01-08T06:30:10.702703Z  INFO subspace_networking::constructor: DSN instance configured. allow_non_global_addresses_in_dht=false peer_id=12D3KooWDGYXHqkrGFW7RzheNag4sPMPL9HKBGMSiK9WnHWF7Cc8 protocol_version=/subspace/2/66455a580aabff303720aa83adbe6c44502922251c03ba73686d5245da9e21bd
2025-01-08T06:30:10.703704Z  INFO libp2p_swarm: local_peer_id=12D3KooWDGYXHqkrGFW7RzheNag4sPMPL9HKBGMSiK9WnHWF7Cc8
2025-01-08T06:30:11.249614Z  INFO subspace_metrics: Metrics server started. endpoints=[0.0.0.0:9084]
2025-01-08T06:30:11.249638Z  INFO actix_server::builder: starting 2 workers
2025-01-08T06:30:11.249948Z  INFO actix_server::server: Tokio runtime found; starting in existing Tokio runtime
2025-01-08T06:30:11.250030Z  INFO actix_server::server: starting service: "actix-web-service-0.0.0.0:9084", workers: 2, listening on: 0.0.0.0:9084
2025-01-08T06:30:11.250779Z  INFO subspace_farmer::commands::shared::network: DSN listening on /ip4/127.0.0.1/tcp/30533/p2p/12D3KooWDGYXHqkrGFW7RzheNag4sPMPL9HKBGMSiK9WnHWF7Cc8
2025-01-08T06:30:11.260770Z  INFO subspace_farmer::commands::shared::network: DSN listening on /ip6/::1/tcp/30533/p2p/12D3KooWDGYXHqkrGFW7RzheNag4sPMPL9HKBGMSiK9WnHWF7Cc8
2025-01-08T06:30:11.261610Z  INFO subspace_farmer::commands::shared::network: DSN listening on /ip4/10.10.0.39/tcp/30533/p2p/12D3KooWDGYXHqkrGFW7RzheNag4sPMPL9HKBGMSiK9WnHWF7Cc8
2025-01-08T06:30:11.261664Z  INFO subspace_farmer::commands::shared::network: DSN listening on /ip6/fe80::3824:48ff:fe54:6dc8/tcp/30533/p2p/12D3KooWDGYXHqkrGFW7RzheNag4sPMPL9HKBGMSiK9WnHWF7Cc8
2025-01-08T06:30:40.851817Z  INFO subspace_farmer::cluster::controller::caches: New cache discovered, scheduling reinitialization cache_id=01JH29PE51WTH1K4K2N48C3CKP
2025-01-08T06:31:10.852481Z  INFO {cache_group=group1}: subspace_farmer::farmer_cache: Initializing piece cache
2025-01-08T06:33:15.143620Z  INFO {cache_group=group1}: subspace_farmer::farmer_cache: Synchronizing piece cache
2025-01-08T06:33:15.268684Z  INFO {cache_group=group1}: subspace_farmer::farmer_cache: Finished piece cache synchronization

Cache logs

2025-01-08T06:30:10.669149Z  INFO async_nats: event: connected
2025-01-08T06:30:10.849683Z  INFO subspace_farmer::cluster::cache: Created cache cache_id=01JH29PE51WTH1K4K2N48C3CKP max_num_elements=204773
2025-01-08T06:30:10.849852Z  INFO subspace_metrics: Metrics server started. endpoints=[0.0.0.0:9085]
2025-01-08T06:30:10.849876Z  INFO actix_server::builder: starting 2 workers
2025-01-08T06:30:10.850549Z  INFO actix_server::server: Tokio runtime found; starting in existing Tokio runtime
2025-01-08T06:30:10.850598Z  INFO actix_server::server: starting service: "actix-web-service-0.0.0.0:9085", workers: 2, listening on: 0.0.0.0:9085

This indicates your cache is fully synchronized, hence nothing left to do.

Why do you think it needs to sync again?

I also have this issue. This computer has never run a controller or cache before, there are no other controllers or cache in my cluster. I am using Docker and have set the appropriate permissions for the cache folder (nobody:nobody).

I am using 2025-jan-03 image

Most recently I can’t even get the synchronization to start. I’ve restarted the cache a few times in hopes it may kick start something.

Controller logs

2025-01-12T23:47:54.620529Z  INFO subspace_farmer::cluster::controller::caches: New cache discovered, scheduling reinitialization cache_id=01JHE5FJNDD6ZCHDVH5ADD7SCF
2025-01-12T23:48:24.621806Z  INFO {cache_group=default}: subspace_farmer::farmer_cache: Initializing piece cache
2025-01-12T23:48:24.621841Z DEBUG {cache_group=default}: subspace_farmer::farmer_cache: Collecting pieces that were in the cache before
2025-01-12T23:48:35.632110Z  INFO subspace_farmer::cluster::controller::caches: New cache discovered, scheduling reinitialization cache_id=01JHEEPPNDJ4X0BVBT5RC43PWD
2025-01-12T23:49:54.621015Z  WARN subspace_farmer::cluster::controller::caches: Cache expired and removed, scheduling reinitialization cache_id=01JHE5FJNDD6ZCHDVH5ADD7SCF
2025-01-13T00:00:28.350692Z  INFO subspace_farmer::cluster::controller::caches: New cache discovered, scheduling reinitialization cache_id=01JHEFCENW57ZQ4P4N5319E7HX
2025-01-13T00:01:54.620837Z  WARN subspace_farmer::cluster::controller::caches: Cache expired and removed, scheduling reinitialization cache_id=01JHEEPPNDJ4X0BVBT5RC43PWD

Nothing happens again after that. And the plotter never starts. Previously I was able to at least say it had initialized the piece cache, but now it does not even say that.

2025-01-12T13:34:07.404593Z  INFO {cache_group=default}: subspace_farmer::farmer_cache: Initializing piece cache
2025-01-12T13:34:23.140207Z  INFO {cache_group=default}: subspace_farmer::farmer_cache: Synchronizing piece cache
2025-01-12T13:34:23.170704Z  INFO {cache_group=default}: subspace_farmer::farmer_cache: Finished piece cache synchronization

But again, regarding the above logs, there is no way the piece cache synchronized that fast as this was the first start of the cache and controller. I was starting from nothing.

EDIT: I deleted my piece_cache to force another synchronization and I got 1000s of lines of logs like this:

2025-01-13T00:17:36.052907Z DEBUG {cache_group=default}: subspace_farmer::farmer_cache: Downloaded piece batch=428 index=13 piece_index=87041
2025-01-13T00:17:36.052909Z DEBUG {cache_group=default}: subspace_farmer::farmer_cache: Couldn't find piece batch=428 piece_index=87041
2025-01-13T00:17:36.052910Z DEBUG {cache_group=default}: subspace_farmer::farmer_cache: Downloaded piece batch=428 index=14 piece_index=74794
2025-01-13T00:17:36.052911Z DEBUG {cache_group=default}: subspace_farmer::farmer_cache: Couldn't find piece batch=428 piece_index=74794
2025-01-13T00:17:36.052913Z DEBUG {cache_group=default}: subspace_farmer::farmer_cache: Couldn't find piece batch=428 piece_index=78995
2025-01-13T00:17:36.052912Z DEBUG {cache_group=default}: subspace_farmer::farmer_cache: Downloaded piece batch=428 index=15 piece_index=78995

Can you share full commands you’re using on all components involved and full logs from them, please?

This does certainly look like a bug that needs to be fixed.

services:
  nats:
    image: nats
    container_name: nats
    restart: unless-stopped
    ports:
      - "4222:4222"
    volumes:
      - /home/hakehardware/nats/nats.config:/nats.config:ro
    command: ["-c", "/nats.config"]
    networks:
      autonomys-network:
        ipv4_address: 172.25.0.101
    labels:
      com.spaceport.name: "Optimus NATs"
      
  cluster-controller:
    container_name: cluster-controller
    image: ghcr.io/autonomys/farmer:mainnet-2025-jan-03
    restart: unless-stopped
    volumes:
      - /home/hakehardware/autonomys/controller:/controller
    ports:
      - "9081:9081"
      - "0.0.0.0:30533:30533/tcp"
    command:
      [
        "cluster",
        "--nats-server", "nats://nats:4222",
        "--prometheus-listen-on", "0.0.0.0:9081",
        "controller",
        "--base-path", "/controller",
        "--node-rpc-url", "ws://172.25.0.100:9944"
      ]
    labels:
      com.spaceport.name: "Optimus Controller"
    environment:
      - RUST_LOG=info,subspace_farmer::farmer_cache=trace
      - TZ=America/Phoenix
    networks:
      autonomys-network:
        ipv4_address: 172.25.0.102

  cluster-cache:
    container_name: cluster-cache
    image: ghcr.io/autonomys/farmer:mainnet-2025-jan-03
    restart: unless-stopped
    volumes:
      - /home/hakehardware/autonomys/cache:/cache
    ports:
      - "9082:9082"
    command:
      [
        "cluster",
        "--nats-server", "nats://nats:4222",
        "--prometheus-listen-on", "0.0.0.0:9082",
        "cache",
        "path=/cache,size=200GiB"
      ]
    labels:
      com.spaceport.name: "Optimus Cache"
    environment:
      - RUST_LOG=info,subspace_farmer::farmer_cache=trace
      - TZ=America/Phoenix
    networks:
      autonomys-network:
        ipv4_address: 172.25.0.103

  cluster-plotter:
    container_name: cluster-plotter
    image: ghcr.io/autonomys/farmer:mainnet-2025-jan-03
    restart: unless-stopped
    ports:
      - "9084:9084"
    command:
      [
        "cluster",
        "--nats-server", "nats://nats:4222",
        "--prometheus-listen-on", "0.0.0.0:9084",
        "plotter"
      ]
    environment:
      - TZ=America/Phoenix
    labels:
      com.spaceport.name: "Optimus Plotter"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    runtime: nvidia
    networks:
      autonomys-network:
        ipv4_address: 172.25.0.105

networks:
  autonomys-network:
    external: true

Will add logs momentarily. Can’t upload the full controller logs as they are huge because of the error.
_cluster-controller_logs.txt (1.3 MB)
_cluster-cache_logs.txt (2.7 KB)

I checked the metrics and subspace_farmer_disk_piece_cache_capacity_used_Pieces is always 0.
And the gpu plotter and farmer are not working, so I am sure nothing is cached.

Mine will eventually plot, about 2 sectors per minute with a 4090. So as expected it’s probably operating as if it has 0 pieces cached.

Okay, piece getting turned out to be completely broken for cache sync purposes, meaning all downloads were either going through previously synced caches or through DSN, which is obviously impacting plotting speeds.

Here is a test build with a fix: Snapshot build · autonomys/subspace@5619ee3 · GitHub
Here is PR with fix and explanation of what happened (single line fix): Fix cluster piece getter by nazar-pc · Pull Request #3345 · autonomys/subspace · GitHub

2 Likes

This build is performing exceptionally well and appears to have resolved the issue. It was recently observed across six controllers and cluster groups that, during the initial startup of the cache and controller, the CPU would spike and remain pegged for several minutes. Bandwidth usage would also spike during this period.

Since running this build, and after the piece cache successfully synced from the DSN, this behavior no longer occurs on subsequent startups. It appears that the inability to sync correctly was also the root cause of this behavior.

2 Likes

Deployed the test build. So far the controller appears to actually be downloading pieces.

2025-01-14T06:55:05.294133Z  INFO {cache_group=default}: subspace_farmer::farmer_cache: Synchronizing piece cache
2025-01-14T06:55:05.294413Z DEBUG {cache_group=default}: subspace_farmer::farmer_cache: Identified last segment index last_segment_index=430
2025-01-14T06:55:05.307913Z DEBUG {cache_group=default}: subspace_farmer::farmer_cache: Identified piece indices that should be cached count=110080
2025-01-14T06:55:05.314382Z DEBUG {cache_group=default}: subspace_farmer::farmer_cache: Downloading pieces batch=0 num_pieces=256
2025-01-14T06:55:05.377041Z DEBUG {cache_group=default}: subspace_farmer::farmer_cache: Downloading pieces batch=1 num_pieces=256
2025-01-14T06:55:05.440053Z DEBUG {cache_group=default}: subspace_farmer::farmer_cache: Downloading pieces batch=2 num_pieces=256
2025-01-14T06:55:05.508122Z DEBUG {cache_group=default}: subspace_farmer::farmer_cache: Downloading pieces batch=3 num_pieces=256

Also apparently I have two accounts on this forum. I am also repost.

2 Likes

Shipped in Release mainnet-2025-jan-14 · autonomys/subspace · GitHub

1 Like