The cache component in cluster mode never downloads the cache
Controller logs
2025-01-08T06:30:10.651557Z INFO subspace_farmer::commands::cluster::controller: Connecting to node RPC url=ws://127.0.0.1:9944
2025-01-08T06:30:10.651567Z INFO async_nats: event: connected
2025-01-08T06:30:10.654774Z INFO subspace_farmer::node_client::caching_proxy_node_client: Downloading all segment headers from node...
2025-01-08T06:30:10.700594Z INFO subspace_farmer::node_client::caching_proxy_node_client: Downloaded all segment headers from node successfully
2025-01-08T06:30:10.702703Z INFO subspace_networking::constructor: DSN instance configured. allow_non_global_addresses_in_dht=false peer_id=12D3KooWDGYXHqkrGFW7RzheNag4sPMPL9HKBGMSiK9WnHWF7Cc8 protocol_version=/subspace/2/66455a580aabff303720aa83adbe6c44502922251c03ba73686d5245da9e21bd
2025-01-08T06:30:10.703704Z INFO libp2p_swarm: local_peer_id=12D3KooWDGYXHqkrGFW7RzheNag4sPMPL9HKBGMSiK9WnHWF7Cc8
2025-01-08T06:30:11.249614Z INFO subspace_metrics: Metrics server started. endpoints=[0.0.0.0:9084]
2025-01-08T06:30:11.249638Z INFO actix_server::builder: starting 2 workers
2025-01-08T06:30:11.249948Z INFO actix_server::server: Tokio runtime found; starting in existing Tokio runtime
2025-01-08T06:30:11.250030Z INFO actix_server::server: starting service: "actix-web-service-0.0.0.0:9084", workers: 2, listening on: 0.0.0.0:9084
2025-01-08T06:30:11.250779Z INFO subspace_farmer::commands::shared::network: DSN listening on /ip4/127.0.0.1/tcp/30533/p2p/12D3KooWDGYXHqkrGFW7RzheNag4sPMPL9HKBGMSiK9WnHWF7Cc8
2025-01-08T06:30:11.260770Z INFO subspace_farmer::commands::shared::network: DSN listening on /ip6/::1/tcp/30533/p2p/12D3KooWDGYXHqkrGFW7RzheNag4sPMPL9HKBGMSiK9WnHWF7Cc8
2025-01-08T06:30:11.261610Z INFO subspace_farmer::commands::shared::network: DSN listening on /ip4/10.10.0.39/tcp/30533/p2p/12D3KooWDGYXHqkrGFW7RzheNag4sPMPL9HKBGMSiK9WnHWF7Cc8
2025-01-08T06:30:11.261664Z INFO subspace_farmer::commands::shared::network: DSN listening on /ip6/fe80::3824:48ff:fe54:6dc8/tcp/30533/p2p/12D3KooWDGYXHqkrGFW7RzheNag4sPMPL9HKBGMSiK9WnHWF7Cc8
2025-01-08T06:30:40.851817Z INFO subspace_farmer::cluster::controller::caches: New cache discovered, scheduling reinitialization cache_id=01JH29PE51WTH1K4K2N48C3CKP
2025-01-08T06:31:10.852481Z INFO {cache_group=group1}: subspace_farmer::farmer_cache: Initializing piece cache
2025-01-08T06:33:15.143620Z INFO {cache_group=group1}: subspace_farmer::farmer_cache: Synchronizing piece cache
2025-01-08T06:33:15.268684Z INFO {cache_group=group1}: subspace_farmer::farmer_cache: Finished piece cache synchronization
Cache logs
2025-01-08T06:30:10.669149Z INFO async_nats: event: connected
2025-01-08T06:30:10.849683Z INFO subspace_farmer::cluster::cache: Created cache cache_id=01JH29PE51WTH1K4K2N48C3CKP max_num_elements=204773
2025-01-08T06:30:10.849852Z INFO subspace_metrics: Metrics server started. endpoints=[0.0.0.0:9085]
2025-01-08T06:30:10.849876Z INFO actix_server::builder: starting 2 workers
2025-01-08T06:30:10.850549Z INFO actix_server::server: Tokio runtime found; starting in existing Tokio runtime
2025-01-08T06:30:10.850598Z INFO actix_server::server: starting service: "actix-web-service-0.0.0.0:9085", workers: 2, listening on: 0.0.0.0:9085
I also have this issue. This computer has never run a controller or cache before, there are no other controllers or cache in my cluster. I am using Docker and have set the appropriate permissions for the cache folder (nobody:nobody).
I am using 2025-jan-03 image
Most recently I can’t even get the synchronization to start. I’ve restarted the cache a few times in hopes it may kick start something.
Controller logs
2025-01-12T23:47:54.620529Z INFO subspace_farmer::cluster::controller::caches: New cache discovered, scheduling reinitialization cache_id=01JHE5FJNDD6ZCHDVH5ADD7SCF
2025-01-12T23:48:24.621806Z INFO {cache_group=default}: subspace_farmer::farmer_cache: Initializing piece cache
2025-01-12T23:48:24.621841Z DEBUG {cache_group=default}: subspace_farmer::farmer_cache: Collecting pieces that were in the cache before
2025-01-12T23:48:35.632110Z INFO subspace_farmer::cluster::controller::caches: New cache discovered, scheduling reinitialization cache_id=01JHEEPPNDJ4X0BVBT5RC43PWD
2025-01-12T23:49:54.621015Z WARN subspace_farmer::cluster::controller::caches: Cache expired and removed, scheduling reinitialization cache_id=01JHE5FJNDD6ZCHDVH5ADD7SCF
2025-01-13T00:00:28.350692Z INFO subspace_farmer::cluster::controller::caches: New cache discovered, scheduling reinitialization cache_id=01JHEFCENW57ZQ4P4N5319E7HX
2025-01-13T00:01:54.620837Z WARN subspace_farmer::cluster::controller::caches: Cache expired and removed, scheduling reinitialization cache_id=01JHEEPPNDJ4X0BVBT5RC43PWD
Nothing happens again after that. And the plotter never starts. Previously I was able to at least say it had initialized the piece cache, but now it does not even say that.
2025-01-12T13:34:07.404593Z INFO {cache_group=default}: subspace_farmer::farmer_cache: Initializing piece cache
2025-01-12T13:34:23.140207Z INFO {cache_group=default}: subspace_farmer::farmer_cache: Synchronizing piece cache
2025-01-12T13:34:23.170704Z INFO {cache_group=default}: subspace_farmer::farmer_cache: Finished piece cache synchronization
But again, regarding the above logs, there is no way the piece cache synchronized that fast as this was the first start of the cache and controller. I was starting from nothing.
EDIT: I deleted my piece_cache to force another synchronization and I got 1000s of lines of logs like this:
I checked the metrics and subspace_farmer_disk_piece_cache_capacity_used_Pieces is always 0.
And the gpu plotter and farmer are not working, so I am sure nothing is cached.
Okay, piece getting turned out to be completely broken for cache sync purposes, meaning all downloads were either going through previously synced caches or through DSN, which is obviously impacting plotting speeds.
This build is performing exceptionally well and appears to have resolved the issue. It was recently observed across six controllers and cluster groups that, during the initial startup of the cache and controller, the CPU would spike and remain pegged for several minutes. Bandwidth usage would also spike during this period.
Since running this build, and after the piece cache successfully synced from the DSN, this behavior no longer occurs on subsequent startups. It appears that the inability to sync correctly was also the root cause of this behavior.