Low-level plotting error "out of memory"

bookswapsteve · July 3, 2025, 4:54pm

Issue Report

Farms are reporting out of memory error. Restarting the plotter container fixes this.

Environment

Operating System: Ubuntu 22.04 running Docker.
Docker: Cluster setup 2 farms, 1 plotter, 1 node running mainnet-2025-jun-18
Hardware: Epyc 7282, 128GB Ram, RTX A4000 GPU
NVIDIA Drivers: NVIDIA-SMI 550.163.01, Driver Version: 550.163.01, CUDA Version: 12.4

Problem

I’ve seen this out of memory issue twice recently, but only since upgrading to mainnet-2025-jun-18. Nothing else has changed on the machine (except I took one small SSD out of a farm as I was having hardware issues).

First time I saw this error I restarted the farms, had the same issue, restarted the server, error went away.

Second time I restarted the farms, error stayed, restarted the plotter, error cleared.

No obvious errors in the plotter, node or controller logs as far as I can tell

The machine had been turned off for a few days, then when powered up the error appears within about 1 hour (not sure how long the node sync took to catch up, could well be the hour). As you can see from the logs, it looks like if failed immediately on the first replots (I’ll grab the other logs an attach shortly).

Logs from one of the farmers showing the issue:

2025-07-02T17:08:40.944856Z  INFO subspace_metrics: Metrics server started. endpoints=[0.0.0.0:9086]
2025-07-02T17:08:40.944984Z  INFO actix_server::builder: starting 2 workers
2025-07-02T17:08:40.944963Z  INFO {farm_index=0}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.945113Z  INFO actix_server::server: Tokio runtime found; starting in existing Tokio runtime
2025-07-02T17:08:40.945141Z  INFO actix_server::server: starting service: "actix-web-service-0.0.0.0:9086", workers: 2, listening on: 0.0.0.0:9086
2025-07-02T17:08:40.945389Z  INFO {farm_index=0}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945420Z  INFO {farm_index=0}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945452Z  INFO {farm_index=1}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945466Z  INFO {farm_index=1}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945511Z  INFO {farm_index=2}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945520Z  INFO {farm_index=2}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945536Z  INFO {farm_index=3}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945543Z  INFO {farm_index=3}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945566Z  INFO {farm_index=4}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945577Z  INFO {farm_index=4}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945604Z  INFO {farm_index=5}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945614Z  INFO {farm_index=5}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945633Z  INFO {farm_index=6}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945641Z  INFO {farm_index=6}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945665Z  INFO {farm_index=7}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945675Z  INFO {farm_index=7}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945694Z  INFO {farm_index=8}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945703Z  INFO {farm_index=8}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945722Z  INFO {farm_index=9}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945731Z  INFO {farm_index=9}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945754Z  INFO {farm_index=10}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945815Z  INFO {farm_index=10}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945850Z  INFO {farm_index=11}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945872Z  INFO {farm_index=11}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945908Z  INFO {farm_index=12}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945918Z  INFO {farm_index=12}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945940Z  INFO {farm_index=13}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945950Z  INFO {farm_index=13}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945972Z  INFO {farm_index=14}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945982Z  INFO {farm_index=14}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.946004Z  INFO {farm_index=15}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.946019Z  INFO {farm_index=15}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.946041Z  INFO {farm_index=16}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.946051Z  INFO {farm_index=16}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.946069Z  INFO {farm_index=17}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.946077Z  INFO {farm_index=17}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.946105Z  INFO {farm_index=18}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.946115Z  INFO {farm_index=18}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.946133Z  INFO {farm_index=19}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.946145Z  INFO {farm_index=19}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.946168Z  INFO {farm_index=20}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.946177Z  INFO {farm_index=20}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.946197Z  INFO {farm_index=21}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.946207Z  INFO {farm_index=21}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.946228Z  INFO {farm_index=22}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.946237Z  INFO {farm_index=22}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.956858Z  INFO {farm_index=1}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.957953Z  INFO {farm_index=2}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.957975Z  INFO {farm_index=3}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.958777Z  INFO {farm_index=4}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.959701Z  INFO {farm_index=5}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.960969Z  INFO {farm_index=6}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.961825Z  INFO {farm_index=7}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.963760Z  INFO {farm_index=8}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.963831Z  INFO {farm_index=9}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.964834Z  INFO {farm_index=10}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.965771Z  INFO {farm_index=11}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.966860Z  INFO {farm_index=12}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.969359Z  INFO {farm_index=13}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.969527Z  INFO {farm_index=14}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.972361Z  INFO {farm_index=15}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.972505Z  INFO {farm_index=16}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.974101Z  INFO {farm_index=17}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.975553Z  INFO {farm_index=18}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.981237Z  INFO {farm_index=19}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.983007Z  INFO {farm_index=20}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.996612Z  INFO {farm_index=21}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.999188Z  INFO {farm_index=22}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T18:08:38.732704Z  INFO {farm_index=8}:{sector_index=212}: subspace_farmer::single_disk_farm::plotting: Replotting sector (0.00% complete)
2025-07-02T18:08:38.733321Z  INFO {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Replotting sector (33.33% complete)
2025-07-02T18:08:44.100199Z  WARN {farm_index=8}:{sector_index=212}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:08:44.304939Z  WARN {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:08:45.103074Z  INFO {farm_index=8}:{sector_index=212}: subspace_farmer::single_disk_farm::plotting: Replotting sector retry
2025-07-02T18:08:45.306256Z  INFO {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Replotting sector retry
2025-07-02T18:08:48.734105Z  INFO {farm_index=6}:{sector_index=731}: subspace_farmer::single_disk_farm::plotting: Replotting sector (0.00% complete)
2025-07-02T18:08:50.391146Z  WARN {farm_index=8}:{sector_index=212}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:08:50.532193Z  WARN {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:08:51.402863Z  INFO {farm_index=8}:{sector_index=212}: subspace_farmer::single_disk_farm::plotting: Replotting sector retry
2025-07-02T18:08:51.534151Z  INFO {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Replotting sector retry
2025-07-02T18:08:54.225439Z  WARN {farm_index=6}:{sector_index=731}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:08:57.208434Z  WARN {farm_index=8}:{sector_index=212}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:08:57.663965Z  WARN {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:09:19.579281Z  INFO {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Replotting sector retry
2025-07-02T18:09:26.515983Z  WARN {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:09:27.206278Z  INFO {farm_index=6}:{sector_index=731}: subspace_farmer::single_disk_farm::plotting: Replotting sector retry
2025-07-02T18:09:37.178545Z  WARN {farm_index=6}:{sector_index=731}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:09:37.325126Z  INFO {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Replotting sector retry
2025-07-02T18:09:50.472990Z  WARN {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:09:51.454067Z  INFO {farm_index=6}:{sector_index=731}: subspace_farmer::single_disk_farm::plotting: Replotting sector retry
2025-07-02T18:09:51.475101Z  INFO {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Replotting sector retry
2025-07-02T18:09:51.568005Z  INFO {farm_index=5}:{sector_index=507}: subspace_farmer::single_disk_farm::plotting: Replotting sector (0.00% complete)
2025-07-02T18:10:02.510109Z  WARN {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:10:02.667602Z  WARN {farm_index=5}:{sector_index=507}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:10:02.799961Z  WARN {farm_index=6}:{sector_index=731}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:10:03.514910Z  INFO {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Replotting sector retry
2025-07-02T18:10:16.633832Z  WARN {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:10:17.635951Z  INFO {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Replotting sector retry
2025-07-02T18:10:23.979300Z  WARN {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:10:24.981183Z  INFO {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Replotting sector retry
2025-07-02T18:10:38.348509Z  WARN {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:10:56.621916Z  INFO {farm_index=19}:{sector_index=332}: subspace_farmer::single_disk_farm::plotting: Replotting sector (0.00% complete)

Logs from the controller:

2025-07-02T17:08:25.223432Z  INFO async_nats: event: connected
2025-07-02T17:08:25.223512Z  INFO subspace_farmer::commands::cluster::controller: Connecting to node RPC url=ws://172.25.0.3:9944
Error: Failed to connect to node RPC: Error when opening the TCP socket: Connection refused (os error 111)
2025-07-02T17:08:25.829571Z  INFO subspace_farmer::commands::cluster::controller: Connecting to node RPC url=ws://172.25.0.3:9944
2025-07-02T17:08:25.829585Z  INFO async_nats: event: connected
Error: Failed to connect to node RPC: Error when opening the TCP socket: Connection refused (os error 111)
2025-07-02T17:08:26.540331Z  INFO subspace_farmer::commands::cluster::controller: Connecting to node RPC url=ws://172.25.0.3:9944
2025-07-02T17:08:26.540356Z  INFO async_nats: event: connected
Error: Failed to connect to node RPC: Error when opening the TCP socket: Connection refused (os error 111)
2025-07-02T17:08:27.679201Z  INFO subspace_farmer::commands::cluster::controller: Connecting to node RPC url=ws://172.25.0.3:9944
2025-07-02T17:08:27.679224Z  INFO async_nats: event: connected
Error: Failed to connect to node RPC: Error when opening the TCP socket: Connection refused (os error 111)
2025-07-02T17:08:29.618029Z  INFO subspace_farmer::commands::cluster::controller: Connecting to node RPC url=ws://172.25.0.3:9944
2025-07-02T17:08:29.618063Z  INFO async_nats: event: connected
Error: Failed to connect to node RPC: Error when opening the TCP socket: Connection refused (os error 111)
2025-07-02T17:08:33.115085Z  INFO subspace_farmer::commands::cluster::controller: Connecting to node RPC url=ws://172.25.0.3:9944
2025-07-02T17:08:33.115097Z  INFO async_nats: event: connected
2025-07-02T17:08:33.118809Z  INFO subspace_farmer::node_client::caching_proxy_node_client: Downloading all segment headers from node...
2025-07-02T17:08:33.122716Z  INFO subspace_farmer::node_client::caching_proxy_node_client: Downloaded all segment headers from node successfully
2025-07-02T17:08:33.127614Z  INFO subspace_networking::constructor: DSN instance configured. allow_non_global_addresses_in_dht=false peer_id=12D3KooWE3BjpM7hiF3UpJCZpwkX6Y5oyYcxb4UG9cBWXZUMrYyg protocol_version=/subspace/2/66455a580aabff303720aa83adbe6c44502922251c03ba73686d5245da9e21bd
2025-07-02T17:08:33.130048Z  INFO libp2p_swarm: local_peer_id=12D3KooWE3BjpM7hiF3UpJCZpwkX6Y5oyYcxb4UG9cBWXZUMrYyg
2025-07-02T17:08:33.823510Z  INFO subspace_metrics: Metrics server started. endpoints=[0.0.0.0:9081]
2025-07-02T17:08:33.823527Z  INFO actix_server::builder: starting 2 workers
2025-07-02T17:08:33.823560Z  INFO actix_server::server: Tokio runtime found; starting in existing Tokio runtime
2025-07-02T17:08:33.823632Z  INFO actix_server::server: starting service: "actix-web-service-0.0.0.0:9081", workers: 2, listening on: 0.0.0.0:9081
2025-07-02T17:08:33.826330Z  INFO subspace_farmer::cluster::controller::caches: New cache discovered, scheduling reinitialization 
......
2025-07-02T17:08:45.252051Z  INFO subspace_farmer::cluster::controller::farms: Farm initialized successfully farmer_id=... farm_index=14 farm_id=...

....
2025-07-02T17:09:03.830243Z  INFO {cache_group=default}: subspace_farmer::farmer_cache: Initializing piece cache
2025-07-02T17:09:07.516301Z  INFO subspace_farmer::cluster::controller::farms: Farm initialized successfully farmer_id=... farm_index=27 farm_id=...

...
2025-07-02T17:10:16.429735Z  INFO subspace_networking::node_runner: dsn: actively using 4/83 known peers
2025-07-02T17:10:46.435431Z  INFO subspace_networking::node_runner: dsn: actively using 36/83 known peers
2025-07-02T17:11:16.442139Z  INFO subspace_networking::node_runner: dsn: actively using 8/85 known peers
2025-07-02T17:11:46.451160Z  INFO subspace_networking::node_runner: dsn: actively using 6/85 known peers
2025-07-02T17:12:16.458185Z  INFO subspace_networking::node_runner: dsn: actively using 1/85 known peers
...
2025-07-02T17:29:26.693542Z  INFO {cache_group=default}: subspace_farmer::farmer_cache: Synchronizing piece cache
2025-07-02T17:29:28.312565Z  INFO {cache_group=default}: subspace_farmer::farmer_cache: Piece cache sync 0.00% complete
2025-07-02T17:29:42.122003Z  INFO {cache_group=default}: subspace_farmer::farmer_cache: Piece cache sync 13.02% complete
2025-07-02T17:29:46.694358Z  INFO subspace_networking::node_runner: dsn: actively using 24/89 known peers
2025-07-02T17:30:00.530901Z  INFO {cache_group=default}: subspace_farmer::farmer_cache: Piece cache sync 26.04% complete
2025-07-02T17:30:16.701382Z  INFO subspace_networking::node_runner: dsn: actively using 19/89 known peers
2025-07-02T17:30:21.484728Z  INFO {cache_group=default}: subspace_farmer::farmer_cache: Piece cache sync 39.06% complete
2025-07-02T17:30:35.763171Z  INFO {cache_group=default}: subspace_farmer::farmer_cache: Piece cache sync 52.08% complete
2025-07-02T17:30:45.047705Z  INFO {cache_group=default}: subspace_farmer::farmer_cache: Piece cache sync 65.10% complete
2025-07-02T17:30:46.707605Z  INFO subspace_networking::node_runner: dsn: actively using 27/89 known peers
2025-07-02T17:30:54.034443Z  INFO {cache_group=default}: subspace_farmer::farmer_cache: Piece cache sync 78.12% complete
2025-07-02T17:31:01.515872Z  INFO {cache_group=default}: subspace_farmer::farmer_cache: Piece cache sync 91.15% complete
2025-07-02T17:31:16.713269Z  INFO subspace_networking::node_runner: dsn: actively using 18/89 known peers
2025-07-02T17:31:17.687861Z  INFO {cache_group=default}: subspace_farmer::farmer_cache: Finished piece cache synchronization
2025-07-02T17:31:46.721038Z  INFO subspace_networking::node_runner: dsn: actively using 4/89 known peers
2025-07-02T17:32:16.727946Z  INFO subspace_networking::node_runner: dsn: actively using 4/88 known peers
2025-07-02T17:32:46.735645Z  INFO subspace_networking::node_runner: dsn: actively using 4/88 known peers
2025-07-02T17:33:16.743052Z  INFO subspace_networking::node_runner: dsn: actively using 5/88 known peers
2025-07-02T17:33:46.817733Z  INFO subspace_networking::node_runner: dsn: actively using 10/88 known peers
2025-07-02T17:34:16.823724Z  INFO subspace_networking::node_runner: dsn: actively using 8/88 known peers
2025-07-02T17:34:46.832149Z  INFO subspace_networking::node_runner: dsn: actively using 9/88 known peers
...

Logs from the plotter (struggling to get back to the same time, but the logs shown should be from a time the error was still happening).

2025-07-03T02:45:26.708951Z  INFO {public_key=ea3d0de628c6b4231fe773ea8951b7fdad52a5adeca7685eded337b0ede3b134 sector_index=258}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:45:27.482798Z  INFO {public_key=ace5ca39a806dbb3d00fc75647c5ebbf281b23c61270fe1f772ac59eec603919 sector_index=235}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:45:42.259352Z  INFO {public_key=d24d831bd36bfbe295b7edfd16c9e38f6133d3712824cc0d9c9dc9c7b2e0d93a sector_index=407}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:45:42.656502Z  INFO {public_key=1e221ed8afaf233d358aab5efd1c73718e230a2d45135efc14c23550e658cf28 sector_index=1607}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:45:42.681108Z  INFO {public_key=b6e13915aaeb7c271bee4af3f6679f28edb132cdc91e7da48a57db67d32ebe36 sector_index=408}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:45:51.929896Z  INFO {public_key=f27e1be7177c57ec89f13141c7e6c9388dd43e6b8be0ebceab4363e42a48287a sector_index=404}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:45:52.779346Z  INFO {public_key=1c5e476c5ea6ab4559d5758bebe889dfbfba51df1604d544e94603ba1c428d49 sector_index=2817}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:45:54.217942Z  INFO {public_key=ace5ca39a806dbb3d00fc75647c5ebbf281b23c61270fe1f772ac59eec603919 sector_index=235}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:45:54.838785Z  INFO {public_key=8cdc0c0999d13c194f76c8c48518011371283594db430fa52367df84608d0a19 sector_index=65}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:46:11.093416Z  INFO {public_key=1e221ed8afaf233d358aab5efd1c73718e230a2d45135efc14c23550e658cf28 sector_index=1607}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:46:12.008395Z  INFO {public_key=085d08fd58888bfb4cfc86ff85a7df06b952ecf7416bdf0175259e057c32785b sector_index=22}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:46:20.808482Z  INFO {public_key=1c5e476c5ea6ab4559d5758bebe889dfbfba51df1604d544e94603ba1c428d49 sector_index=2817}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:46:21.185439Z  INFO {public_key=40af7c036d67121c4eb07ab13d63df1632da5d51194c86000657cdccd6a89633 sector_index=997}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:46:23.078108Z  INFO {public_key=8cdc0c0999d13c194f76c8c48518011371283594db430fa52367df84608d0a19 sector_index=65}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:46:23.387349Z  INFO {public_key=5c0aa3e6fbe6611b3adda2cb878b39233fc9d83aa4f24bf0b6fbe1ad023a147b sector_index=473}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:46:41.033035Z  INFO {public_key=085d08fd58888bfb4cfc86ff85a7df06b952ecf7416bdf0175259e057c32785b sector_index=22}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:46:41.495607Z  INFO {public_key=2c8e60bea7332cf31168676dad72cd4d060940014d2c913115a3c7602319fb1e sector_index=1309}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:46:52.978136Z  INFO {public_key=40af7c036d67121c4eb07ab13d63df1632da5d51194c86000657cdccd6a89633 sector_index=997}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:46:53.537795Z  INFO {public_key=58e06d35c3b8a0a2430e2b87ef89b899ca5ef08a12387682646de799be19f27d sector_index=1423}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:46:54.134464Z  INFO {public_key=5c0aa3e6fbe6611b3adda2cb878b39233fc9d83aa4f24bf0b6fbe1ad023a147b sector_index=473}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:46:55.024386Z  INFO {public_key=f27e1be7177c57ec89f13141c7e6c9388dd43e6b8be0ebceab4363e42a48287a sector_index=404}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:47:12.122466Z  INFO {public_key=2c8e60bea7332cf31168676dad72cd4d060940014d2c913115a3c7602319fb1e sector_index=1309}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:47:12.404156Z  INFO {public_key=8cdc0c0999d13c194f76c8c48518011371283594db430fa52367df84608d0a19 sector_index=169}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:47:26.243744Z  INFO {public_key=f27e1be7177c57ec89f13141c7e6c9388dd43e6b8be0ebceab4363e42a48287a sector_index=404}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:47:26.355068Z  INFO {public_key=58e06d35c3b8a0a2430e2b87ef89b899ca5ef08a12387682646de799be19f27d sector_index=1423}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:47:26.606161Z  INFO {public_key=085d08fd58888bfb4cfc86ff85a7df06b952ecf7416bdf0175259e057c32785b sector_index=22}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:47:26.670721Z  INFO {public_key=f27e1be7177c57ec89f13141c7e6c9388dd43e6b8be0ebceab4363e42a48287a sector_index=1015}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:47:29.763426Z  INFO {public_key=8cdc0c0999d13c194f76c8c48518011371283594db430fa52367df84608d0a19 sector_index=169}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:47:30.786085Z  INFO {public_key=8cdc0c0999d13c194f76c8c48518011371283594db430fa52367df84608d0a19 sector_index=169}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:47:34.823193Z  INFO {public_key=085d08fd58888bfb4cfc86ff85a7df06b952ecf7416bdf0175259e057c32785b sector_index=22}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:47:35.077973Z  INFO {public_key=5c0aa3e6fbe6611b3adda2cb878b39233fc9d83aa4f24bf0b6fbe1ad023a147b sector_index=101}: subspace_farmer::cluster::plotter: Plot sector request

Jim-Autonomys · July 4, 2025, 9:27am

Hi! Thanks for reporting this.

A couple of things to start with. First, could you provide your start commands or, even better your docker-compose.yml if you have one please? Second, is there any chance that Docker is imposing some sort of container-level memory limit that’s being hit?

bookswapsteve · July 5, 2025, 2:13pm

Hi,

I don’t have anything that would limit ram intentionally setup, it’s a standard install of docker.

Just in case the memory error is in some way related to disk usage: I run most disks near max capacity (typically 10MB free). Theirs 50G free on the node/cache disk.

I’ve had nvidia-smi running on 1s refresh for a while and subspace-farmer looks to be stable at 888MiB (GPU running at 969MiB / 16376MiB)

Docker compose files below.

Plotter:

services:
  farmer_plotter:
    container_name: autonomys_farmer_plotter
    image: ghcr.io/autonomys/farmer:mainnet-2025-jun-18
    ports:
      - "9084:9084"
    restart: unless-stopped
    command:
      [
        "cluster",
        "--nats-server", "nats://172.25.0.2:4222",
        "--prometheus-listen-on", "0.0.0.0:9084",
        "plotter"
      ]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

    networks:
      nats_cluster_network:
        ipv4_address: 172.25.0.7  
        
networks:
  nats_cluster_network:
    external: true

Node:

services:
  autonomys_node:
    container_name: autonomys_node
    image: ghcr.io/autonomys/node:mainnet-2025-jun-18
    volumes:
      - /mnt/nvme2/autonomys/node-data:/var/subspace:rw
    ports:
      - "0.0.0.0:30333:30333/tcp"
      - "0.0.0.0:30433:30433/tcp"
      - "9080:9080"
      - "9944:9944"     
    restart: unless-stopped
    command:
      [
        "run",
        "--chain", "mainnet",
        "--base-path", "/var/subspace",
        "--sync", "full",
        "--listen-on", "/ip4/0.0.0.0/tcp/30333",
        "--dsn-listen-on", "/ip4/0.0.0.0/tcp/30433",
        "--rpc-cors", "all",
        "--rpc-methods", "unsafe",
        "--rpc-listen-on", "0.0.0.0:9944",
        "--prometheus-listen-on", "0.0.0.0:9080",
        "--farmer",
        "--name", "...removed..."
      ]
    networks:
      nats_cluster_network:
        ipv4_address: 172.25.0.3
    healthcheck:
      timeout: 10s
      interval: 30s
      retries: 60

networks:
  nats_cluster_network:
    external: true

Farm #1:

services:     
  farmer:
    container_name: autonomys_farmer
    image: ghcr.io/autonomys/farmer:mainnet-2025-jun-18
    volumes:
      - /autonomys/nvme1/farm/subspace1:/var/subspace1:rw
      - /autonomys/nvme1/farm/subspace1b:/var/subspace1b:rw
      - /autonomys/nvme1/farm/subspace1c:/var/subspace1c:rw
      - /autonomys/nvme1/farm/subspace1d:/var/subspace1d:rw
      - /autonomys/nvme1/farm/subspace1e:/var/subspace1e:rw
      - /mnt/nvme2/autonomys/farm/subspace2:/var/subspace2:rw
      - /mnt/nvme10/autonomys/farm/farm-10-01:/var/farm-10-01:rw
      - /mnt/nvme11/autonomys/farm/farm-11-01:/var/farm-11-01:rw
      - /mnt/nvme12/autonomys/farm/farm-12-01:/var/farm-12-01:rw
      - /mnt/nvme13/autonomys/farm/farm-13-01:/var/farm-13-01:rw
      - /mnt/nvme30/autonomys/farm/autonomys4:/var/farm-30-01:rw
      - /mnt/nvme31/autonomys/farm/autonomys3:/var/farm-31-01:rw
      - /mnt/nvme32/autonomys/farm/farm-32-01:/var/farm-32-01:rw
      - /mnt/nvme33/autonomys/farm/farm-33-01:/var/farm-33-01:rw
      - /mnt/nvme41/autonomys/farm/farm-41-01:/var/farm-41-01:rw
      - /mnt/nvme42/autonomys/farm/farm-42-01:/var/farm-42-01:rw
      - /mnt/nvme51/autonomys/farm/farm-51-01:/var/farm-51-01:rw
      - /mnt/nvme52/autonomys/farm/farm-52-01:/var/farm-52-01:rw
      - /mnt/Disk-0001/autonomys/farm/farm-0001-1:/var/farm-00-01:rw
      - /mnt/Disk-0002/autonomys/farm/farm-0002-1:/var/farm-00-02:rw
      - /mnt/Disk-0003/autonomys/farm/farm-0003-1:/var/farm-00-03:rw
      - /mnt/Disk-0004/autonomys/farm/farm-0004-1:/var/farm-00-04:rw

    restart: unless-stopped
    ports:
      - "9083:9083"
    command:
      [
        "cluster",
        "--nats-server", "nats://172.25.0.2:4222",
        "--prometheus-listen-on", "0.0.0.0:9083",
        "farmer",
        "--reward-address",  ... removed ...,
        "path=/var/subspace1,size=1000G",
        "path=/var/subspace1b,size=1000G",
        "path=/var/subspace1c,size=1000G",
        "path=/var/subspace1d,size=500G",
        "path=/var/subspace2,size=3560G",
        "path=/var/farm-10-01,size=3937G",
        "path=/var/farm-11-01,size=3937G",
        "path=/var/farm-12-01,size=3937G",
        "path=/var/farm-13-01,size=3937G",
        "path=/var/farm-30-01,size=3937G",
        "path=/var/farm-31-01,size=3937G",
        "path=/var/farm-32-01,size=3937G",
        "path=/var/farm-33-01,size=3937G",
        "path=/var/farm-41-01,size=3937G",
        "path=/var/farm-42-01,size=3937G",
        "path=/var/farm-51-01,size=3937G",
        "path=/var/farm-52-01,size=3937G",
        "path=/var/farm-00-01,size=3770G",
        "path=/var/farm-00-02,size=3935G",
        "path=/var/farm-00-03,size=7935G",
        "path=/var/farm-00-04,size=1965G"
      ]
    networks:
      nats_cluster_network:
        ipv4_address: 172.25.0.6

networks:
  nats_cluster_network:
    external: true

Farm #2 (See if you can spot my stupid moments with the typos…):

version: "3.7"

services:        
  autonomys_farmer_netapp_80:
    container_name: autonomys_farmer_netapp_80
    image: ghcr.io/autonomys/farmer:mainnet-2025-jun-18
    volumes:
      - /mnt/Disk-8000/autonomys/farm/farm-8000-1:/var/farm-80-00:rw
      - /mnt/Disk-8001/autonnomy/farm/farm-8001-1:/var/farm-80-01:rw
      - /mnt/Disk-8002/autonnomy/farm/farm-8002-1:/var/farm-80-02:rw
      - /mnt/Disk-8003/autonnomy/farm/farm-8003-1:/var/farm-80-03:rw
      - /mnt/Disk-8004/autonnomy/farm/farm-8004-1:/var/farm-80-04:rw
      - /mnt/Disk-8005/autonnomy/farm/farm-8005-1:/var/farm-80-05:rw
      - /mnt/Disk-8006/autonnomy/farm/farm-8006-1:/var/farm-80-06:rw
      - /mnt/Disk-8007/autonomys/farm/farm-8007-1:/var/farm-80-07:rw
      - /mnt/Disk-8008/autonnomy/farm/farm-8008-1:/var/farm-80-08:rw
      - /mnt/Disk-8009/autonomys/farm/farm-8009-1:/var/farm-80-09:rw
      - /mnt/Disk-8010/autonnomy/farm/farm-8010-1:/var/farm-80-10:rw
      - /mnt/Disk-8011/autonnomy/farm/farm-8011-1:/var/farm-80-11:rw
      - /mnt/Disk-8012/autonnomy/farm/farm-8012-1:/var/farm-80-12:rw
      - /mnt/Disk-8013/autonnomy/farm/farm-8013-1:/var/farm-80-13:rw
      - /mnt/Disk-8014/autonomys/farm/farm-8014-1:/var/farm-80-14:rw
      - /mnt/Disk-8015/autonnomy/farm/farm-8015-1:/var/farm-80-15:rw
      - /mnt/Disk-8016/autonnomy/farm/farm-8016-1:/var/farm-80-16:rw
      - /mnt/Disk-8017/autonnomy/farm/farm-8017-1:/var/farm-80-17:rw
      - /mnt/Disk-8018/autonnomy/farm/farm-8018-1:/var/farm-80-18:rw
      - /mnt/Disk-8019/autonomys/farm/farm-8019-1:/var/farm-80-19:rw
      - /mnt/Disk-8020/autonnomy/farm/farm-8020-1:/var/farm-80-20:rw
      - /mnt/Disk-8021/autonomys/farm/farm-8021-1:/var/farm-80-21:rw
      - /mnt/Disk-8022/autonomys/farm/farm-8022-1:/var/farm-80-22:rw
      - /mnt/Disk-8023/autonomys/farm/farm-8023-1:/var/farm-80-23:rw

    restart: unless-stopped
    ports:
      - "9086:9086"
    command:
      [
        "cluster",
        "--nats-server", "nats://172.25.0.2:4222",
        "--prometheus-listen-on", "0.0.0.0:9086",
        "farmer",
        "--reward-address", ... removed...,
        "path=/var/farm-80-00,size=787G",
        "path=/var/farm-80-01,size=787G",
        "path=/var/farm-80-02,size=787G",
        "path=/var/farm-80-03,size=787G",
        "path=/var/farm-80-04,size=787G",
        "path=/var/farm-80-05,size=787G",
        "path=/var/farm-80-06,size=787G",
        "path=/var/farm-80-07,size=787G",
        "path=/var/farm-80-08,size=787G",
        "path=/var/farm-80-09,size=787G",
        #"path=/var/farm-80-10,size=787G",
        "path=/var/farm-80-11,size=787G",
        "path=/var/farm-80-12,size=787G",
        "path=/var/farm-80-13,size=787G",
        "path=/var/farm-80-14,size=787G",
        "path=/var/farm-80-15,size=787G",
        "path=/var/farm-80-16,size=787G",
        "path=/var/farm-80-17,size=787G",
        "path=/var/farm-80-18,size=787G",
        "path=/var/farm-80-19,size=3779G",
        "path=/var/farm-80-20,size=787G",
        "path=/var/farm-80-21,size=3140G",
        "path=/var/farm-80-22,size=3149G",
        "path=/var/farm-80-23,size=1103G"
      ] 
    networks:
      nats_cluster_network:
        ipv4_address: 172.25.0.11

networks:
  nats_cluster_network:
    external: true

Controller:

services:
  farmer_controller:
    container_name: autonomys_farmer_controller
    image: ghcr.io/autonomys/farmer:mainnet-2025-jun-18
    volumes:
      - /mnt/nvme2/autonomys/cluster/controller:/controller:rw
    ports:
      - "9081:9081"
      - "0.0.0.0:30533:30533/tcp"
    restart: unless-stopped
    command:
      [
        "cluster",
        "--nats-server", "nats://172.25.0.2:4222",
        "--prometheus-listen-on", "0.0.0.0:9081",
        "controller",
        "--base-path", "/controller",
        "--node-rpc-url", "ws://172.25.0.3:9944"
      ]
    networks:
      nats_cluster_network:
        ipv4_address: 172.25.0.4

networks:
  nats_cluster_network:
    external: true

Cache:

services:
  farmer_cache:
    container_name: autonomys_farmer_cache
    image: ghcr.io/autonomys/farmer:mainnet-2025-jun-18
    volumes:
      - /mnt/nvme2/autonomys/cluster/cache:/cache:rw
    restart: unless-stopped
    ports:
      - "9082:9082"
    command:
      [
        "cluster",
        "--nats-server", "nats://172.25.0.2:4222",
        "--prometheus-listen-on", "0.0.0.0:9082",
        "cache",
        "path=/cache,size=300GB"
      ]
    networks:
      nats_cluster_network:
        ipv4_address: 172.25.0.5


networks:
  nats_cluster_network:
    external: true

Nats (it’s a cluster, but the only one running at the time):

version: '3.8'

services:
  nats:
    image: nats
    container_name: nats
    restart: unless-stopped
    ports:
      - "4222:4222"
      - "4248:4248"
      - "8222:8222"
    volumes:
      - /mnt/nvme2/nats/nats.config:/nats.config:ro
    command: [
      "-c", "/nats.config",
      "-cluster","nats://0.0.0.0:4248",
      "--cluster_name","autonomys-nats-cluster",
      "--http_port","8222",
      ]
    networks:
      cluster_network:
        ipv4_address: 172.25.0.2
        
networks:
  cluster_network:
    driver: bridge
    ipam:
      config:
        - subnet: 172.25.0.0/16

Nats config:

max_payload = 2MB

uuliao · July 6, 2025, 10:14am

How is the network usage of NATS server? Could it be network congestion?

In my personal practical use. It is best to use NATS server together with farmers. This will not occupy LAN bandwidth.

Jim-Autonomys · July 7, 2025, 12:52pm

The only thing I’ve spotted is that your plotter command is missing the runtime: nvidia entry which should be at the same level as the deploy: line. Something like this:

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all  
              capabilities: [gpu]
    runtime: nvidia

Also, have you tried with later driver versions?

bookswapsteve · July 14, 2025, 11:54pm

Hi,

I’ve managed to get the error to happen again. Same sequence of events as the previous time.

I had the server (all farm/plotter is on the one machine) turned off for about 4 days, switched on this evening, and once the sync has completed, replotting starts and then the out of memory issue happened.

Thanks for the feedback on the docker config. I had updated the compose to include the runtime, it had been find for months previously, and repro’d the error with the update, so it wasn’t that.

I have not updated the NVidia drivers yet, I’ll give that a try this evening, I wanted to try this switch the machine off like I had last time before doing that to try and repro the issue rather than it magically going away.

This is still mainnet-2025-jun-18 release. I’ll upgrade to the new one shortly.

Before shutting down, I had added another 3.84TB SSD, this plotted just fine, so clearly the plotting isn’t an issue. I do wonder if the issue is related to the large number of disks I’ve got in these two farms all needing replotting at once. Every farm had multiple expired sectors (between 2 and 21 by the looks of it).

With regard to the NATS bandwidth question - this is all running on one machine (the NATS cluster is just one instance at present), so I don’t think that would be an issue, and the LAN network connection is also 10G.

Log files hopefully attached, brief log from one of the farmers below:

Farmer #1

Replotting starts: 2025-07-14T22:33:36.381161Z
First failure: 2025-07-14T22:33:44.064081Z
Approximate Memory Usage: 12GB
Farmer1-NVMes.txt (37.6 KB)

...
2025-07-14T21:17:35.236604Z  INFO {farm_index=18}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-14T21:17:35.242345Z  INFO {farm_index=19}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-14T21:17:35.260211Z  INFO {farm_index=20}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-14T22:33:36.381161Z  INFO {farm_index=19}:{sector_index=29}: subspace_farmer::single_disk_farm::plotting: Replotting sector (0.00% complete)
2025-07-14T22:33:36.381348Z  INFO {farm_index=3}:{sector_index=50}: subspace_farmer::single_disk_farm::plotting: Replotting sector (0.00% complete)
2025-07-14T22:33:36.382173Z  INFO {farm_index=19}:{sector_index=331}: subspace_farmer::single_disk_farm::plotting: Replotting sector (2.08% complete)
2025-07-14T22:33:36.382223Z  INFO {farm_index=3}:{sector_index=378}: subspace_farmer::single_disk_farm::plotting: Replotting sector (50.00% complete)
2025-07-14T22:33:44.064081Z  WARN {farm_index=19}:{sector_index=331}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-14T22:33:44.182305Z  WARN {farm_index=3}:{sector_index=50}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-14T22:33:44.315698Z  WARN {farm_index=19}:{sector_index=29}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
...

Farmer #2

Replotting starts: 2025-07-14T22:33:14.381644Z
First failure: 2025-07-14T22:33:35.819240Z
Approximate Memory Usage: 7GB
Farmer2-NetApp.txt (54.0 KB)

Plotter:
Approximate Memory Usage: 6GB
GPU Memory Usage: 158MiB

Controller:
Approximate Memory Usage: 1.1GB

Cache:
Sorry, didn’t check memory usage before stopping.

Node:
Logs for just the time period the issues happen attached
Sorry, didn’t check memory usage before stopping.

Nats:
Approximate Memory Usage: 40MB

Forum is limiting what I can post (I’m new here ). I’ll try and post the other logs in another post.

BTW - Congratulations to the team on getting through the audit items to release phase 2, hope all goes smoothly on Wednesday!

bookswapsteve · July 14, 2025, 11:55pm

More logs, trying to work around forum links limit

Plotter.txt (15.8 KB)
Controller.txt (29.8 KB)

bookswapsteve · July 14, 2025, 11:56pm

And yet more logs…

Cache.txt (1.1 KB)
Node.txt (13.0 KB)

Nats logs:

[1] 2025/07/14 21:16:51.913114 [INF] Starting nats-server
[1] 2025/07/14 21:16:51.920661 [INF]   Version:  2.10.22
[1] 2025/07/14 21:16:51.920664 [INF]   Git:      [240e9a4]
[1] 2025/07/14 21:16:51.920666 [INF]   Cluster:  autonomys-nats-cluster
[1] 2025/07/14 21:16:51.920668 [INF]   Name:     NDVBZQM74LQ74MXGQPGRZO2K6CTCSRSJKQUGKSPYHOZP5GDGYPJYNWPJ
[1] 2025/07/14 21:16:51.920671 [INF]   ID:       NDVBZQM74LQ74MXGQPGRZO2K6CTCSRSJKQUGKSPYHOZP5GDGYPJYNWPJ
[1] 2025/07/14 21:16:51.920681 [INF] Using configuration file: /nats.config
[1] 2025/07/14 21:16:51.926801 [INF] Starting http monitor on 0.0.0.0:8222
[1] 2025/07/14 21:16:51.927696 [INF] Listening for client connections on 0.0.0.0:4222
[1] 2025/07/14 21:16:51.927949 [INF] Server is ready
[1] 2025/07/14 21:16:51.928036 [INF] Cluster name is autonomys-nats-cluster
[1] 2025/07/14 21:16:51.928119 [INF] Listening for route connections on 0.0.0.0:4248

Let me know if I missed something and I’ll try and add it.

nazar-pc · July 15, 2025, 1:47am

The error essentially means there was no GPU memory available to allocate whatever plotter needed to allocate there:

github.com/supranational/sppark

util/gpu_t.cuh

9cd5cbd31


      
          ~stream_t()
          {   (void)cudaStreamDestroy(stream);   }
          inline operator decltype(stream)() const    { return stream; }
          inline int id() const                       { return gpu_id; }
          inline operator int() const                 { return gpu_id; }
          inline int sm_count() const
          {   return gpu_props(gpu_id).multiProcessorCount;   }
          
          inline void* Dmalloc(size_t sz) const
          {   void *d_ptr;
              CUDA_OK(cudaMallocAsync(&d_ptr, sz, stream));
              return d_ptr;
          }
          inline void Dfree(void* d_ptr) const
          {   CUDA_OK(cudaFreeAsync(d_ptr, stream));   }
          
          template<typename T>
          inline void bzero(T* dst, size_t nelems) const
          {   CUDA_OK(cudaMemsetAsync(dst, 0, nelems * sizeof(T), stream));   }
          
          template<typename T>

So either there GPU was occupied by something or there a software bug (I don’t remember any GPU side changes between mentioned releases) or was some hardware issue that caused memory corruption, which resulted in CPU requesting inadequate amount of VRAM from GPU. I’m not sure what else could be there.

It would be especially interesting to see what GPU memory usage looks like while those errors are being printed. You can also try to modify the software to see what amount of memory it actually tries to allocate (not sure if there is an environment variable or some other way to debug such CUDA issue, but wouldn’t be surprised if there is).

bookswapsteve · July 15, 2025, 1:59am

Sadly I don’t have the NVidia prometheus exporter stuff installed on that box, but I can see to that.

I had nvidia-smi running with a 1 second refresh and was seeing about 158MiB for subspace-farmer (which it seems to be before any actual plotting happens - i.e. after a reboot).

Previous experience has the plotter at 888MiB post plotting, and 890MiB when some jobs are plotting.

I have upgraded the NVidia drivers to 560.35.05 (couldn’t go to 565 or 570 due to a package issue with that needing libssl 1.1) I also Ubuntu since the error presented and restarted.

After a restart it’s looking healthy again. However, note that a reboot/restart of the machine/plotter also fix the issue on the last 2 times, so it’s more likely that the problem wasn’t fixed by the update, just the restart.

Sample logs now it’s running

2025-07-15T01:00:13.616166Z  INFO {farm_index=18}:{sector_index=388}: subspace_farmer::single_disk_farm::plotting: Replotting sector (0.00% complete)
2025-07-15T01:07:46.350148Z  INFO {farm_index=2}:{sector_index=443}: subspace_farmer::single_disk_farm::plotting: Replotting sector (0.00% complete)
2025-07-15T01:08:07.032411Z  INFO {farm_index=15}:{sector_index=1109}: subspace_farmer::single_disk_farm::plotting: Replotting sector (0.00% complete)
2025-07-15T01:09:46.328824Z  INFO {farm_index=10}:{sector_index=2137}: subspace_farmer::single_disk_farm::plotting: Replotting sector (0.00% complete)
2025-07-15T01:10:13.268669Z  INFO {farm_index=2}: subspace_farmer::single_disk_farm::plotting: Replotting complete
2025-07-15T01:10:27.961139Z  INFO {farm_index=3}:{sector_index=50}: subspace_farmer::single_disk_farm::plotting: Replotting sector (0.00% complete)
2025-07-15T01:10:49.231189Z  INFO {farm_index=6}:{sector_index=9}: subspace_farmer::single_disk_farm::plotting: Replotting sector (0.00% complete)
2025-07-15T01:11:30.883247Z  INFO {farm_index=10}:{sector_index=3301}: subspace_farmer::single_disk_farm::plotting: Replotting sector (11.11% complete)
2025-07-15T01:11:30.885327Z  INFO {farm_index=10}:{sector_index=183}: subspace_farmer::single_disk_farm::plotting: Replotting sector (22.22% complete)
2025-07-15T01:12:13.504307Z  INFO {farm_index=6}:{sector_index=440}: subspace_farmer::single_disk_farm::plotting: Replotting sector (3.45% complete)
2025-07-15T01:12:33.601150Z  INFO {farm_index=7}:{sector_index=394}: subspace_farmer::single_disk_farm::plotting: Replotting sector (0.00% complete)
2025-07-15T01:13:16.710905Z  INFO {farm_index=3}:{sector_index=378}: subspace_farmer::single_disk_farm::plotting: Replotting sector (50.00% complete)
2025-07-15T01:14:34.044161Z  INFO {farm_index=3}: subspace_farmer::single_disk_farm::plotting: Replotting complete
2025-07-15T01:17:06.832643Z  INFO {farm_index=9}:{sector_index=936}: subspace_farmer::single_disk_farm::plotting: Replotting sector (0.00% complete)
2025-07-15T01:17:49.380723Z  INFO {farm_index=13}:{sector_index=641}: subspace_farmer::single_disk_farm::plotting: Replotting sector (0.00% complete)
2025-07-15T01:18:12.566931Z  INFO {farm_index=0}:{sector_index=832}: subspace_farmer::single_disk_farm::plotting: Replotting sector (0.00% complete)
2025-07-15T01:19:44.538012Z  INFO {farm_index=7}:{sector_index=443}: subspace_farmer::single_disk_farm::plotting: Replotting sector (3.33% complete)
2025-07-15T01:20:10.846407Z  INFO {farm_index=0}: subspace_farmer::single_disk_farm::plotting: Replotting complete
2025-07-15T01:22:02.149814Z  INFO {farm_index=18}:{sector_index=407}: subspace_farmer::single_disk_farm::plotting: Replotting sector (3.85% complete)
2025-07-15T01:22:24.496713Z  INFO {farm_index=1}:{sector_index=106}: subspace_farmer::single_disk_farm::plotting: Replotting sector (0.00% complete)
2025-07-15T01:22:24.567441Z  INFO {farm_index=9}:{sector_index=1106}: subspace_farmer::single_disk_farm::plotting: Replotting sector (5.88% complete)
2025-07-15T01:23:10.350445Z  INFO {farm_index=10}:{sector_index=396}: subspace_farmer::single_disk_farm::plotting: Replotting sector (33.33% complete)
2025-07-15T01:23:10.436843Z  INFO {farm_index=17}:{sector_index=2259}: subspace_farmer::single_disk_farm::plotting: Replotting sector (0.00% complete)
2025-07-15T01:23:55.440151Z  INFO {farm_index=5}:{sector_index=1449}: subspace_farmer::single_disk_farm::plotting: Replotting sector (0.00% complete)

Their is nothing else running on the machine, it’s a RTX A4000 GPU (16G memory).

It’s replotting at present, this is a fairly typical nvidia-smi output (was much the same when it was failing - except smaller memory for the subspace-farmer and different driver)

Tue Jul 15 02:53:50 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A4000               Off |   00000000:41:00.0 Off |                  Off |
| 70%   87C    P0            131W /  140W |    1400MiB /  16376MiB |     74%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      2585      G   /usr/lib/xorg/Xorg                              4MiB |
|    0   N/A  N/A      2974      G   /usr/lib/xorg/Xorg                            482MiB |
|    0   N/A  N/A      6592      C   /subspace-farmer                              890MiB |
+-----------------------------------------------------------------------------------------+

nazar-pc · July 15, 2025, 2:11am

That GPU has plenty of VRAM, a single plotter instance doesn’t use that much of it and should never have ran out of memory. So there was something wrong for sure, most likely one of the three options I have mentioned before, I wish there was an easy way to find out which one.

uuliao · July 15, 2025, 2:27am

1: Check plot disk permissions
2: After completely erasing the farmland, try
3: Use CLI directly instead of Docker
4: Check if the disk has hardware faults such as being lost due to overheating. 5: Try switching to another machine. Eliminate hardware issues. For example, if the adapter card adapter cable is faulty.

nazar-pc · July 15, 2025, 2:40am

1-4 are not relevant here at all, I can see no possible way they are correlated with an issue in any way.

Switching to another machine is easier said than done, not everyone has lots of hardware laying around ready to experiment with.

uuliao · July 15, 2025, 2:50am

I have also looked at other questions. The cache download has been completed. The nvdia driver version is also sufficient. I can’t think of any other questions. I see that he has several 1TB disks. So you can try changing the machine

Topic		Replies	Views
Jun-18 subspace farmer plotter ping timeout Support	7	87	June 26, 2024
Cluster farmer timeout while plotting Support nodes , cli , farmer	1	23	June 19, 2025
Farmer jun 11 stop farming after replot start sometimes Support	33	250	June 19, 2024
Timed out without ping from plotter Support	30	346	October 24, 2024
Memory allocation of bytes failed (Advanced CLI May-01) Support	7	149	May 6, 2024

Low-level plotting error "out of memory"

Issue Report

Environment

Problem

Related topics