Low-level plotting error "out of memory"

Issue Report

Farms are reporting out of memory error. Restarting the plotter container fixes this.

Environment

  • Operating System: Ubuntu 22.04 running Docker.
  • Docker: Cluster setup 2 farms, 1 plotter, 1 node running mainnet-2025-jun-18
  • Hardware: Epyc 7282, 128GB Ram, RTX A4000 GPU
  • NVIDIA Drivers: NVIDIA-SMI 550.163.01, Driver Version: 550.163.01, CUDA Version: 12.4

Problem

I’ve seen this out of memory issue twice recently, but only since upgrading to mainnet-2025-jun-18. Nothing else has changed on the machine (except I took one small SSD out of a farm as I was having hardware issues).

First time I saw this error I restarted the farms, had the same issue, restarted the server, error went away.

Second time I restarted the farms, error stayed, restarted the plotter, error cleared.

No obvious errors in the plotter, node or controller logs as far as I can tell

The machine had been turned off for a few days, then when powered up the error appears within about 1 hour (not sure how long the node sync took to catch up, could well be the hour). As you can see from the logs, it looks like if failed immediately on the first replots (I’ll grab the other logs an attach shortly).

Logs from one of the farmers showing the issue:

2025-07-02T17:08:40.944856Z  INFO subspace_metrics: Metrics server started. endpoints=[0.0.0.0:9086]
2025-07-02T17:08:40.944984Z  INFO actix_server::builder: starting 2 workers
2025-07-02T17:08:40.944963Z  INFO {farm_index=0}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.945113Z  INFO actix_server::server: Tokio runtime found; starting in existing Tokio runtime
2025-07-02T17:08:40.945141Z  INFO actix_server::server: starting service: "actix-web-service-0.0.0.0:9086", workers: 2, listening on: 0.0.0.0:9086
2025-07-02T17:08:40.945389Z  INFO {farm_index=0}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945420Z  INFO {farm_index=0}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945452Z  INFO {farm_index=1}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945466Z  INFO {farm_index=1}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945511Z  INFO {farm_index=2}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945520Z  INFO {farm_index=2}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945536Z  INFO {farm_index=3}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945543Z  INFO {farm_index=3}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945566Z  INFO {farm_index=4}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945577Z  INFO {farm_index=4}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945604Z  INFO {farm_index=5}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945614Z  INFO {farm_index=5}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945633Z  INFO {farm_index=6}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945641Z  INFO {farm_index=6}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945665Z  INFO {farm_index=7}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945675Z  INFO {farm_index=7}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945694Z  INFO {farm_index=8}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945703Z  INFO {farm_index=8}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945722Z  INFO {farm_index=9}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945731Z  INFO {farm_index=9}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945754Z  INFO {farm_index=10}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945815Z  INFO {farm_index=10}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945850Z  INFO {farm_index=11}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945872Z  INFO {farm_index=11}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945908Z  INFO {farm_index=12}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945918Z  INFO {farm_index=12}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945940Z  INFO {farm_index=13}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945950Z  INFO {farm_index=13}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.945972Z  INFO {farm_index=14}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.945982Z  INFO {farm_index=14}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.946004Z  INFO {farm_index=15}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.946019Z  INFO {farm_index=15}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.946041Z  INFO {farm_index=16}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.946051Z  INFO {farm_index=16}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.946069Z  INFO {farm_index=17}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.946077Z  INFO {farm_index=17}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.946105Z  INFO {farm_index=18}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.946115Z  INFO {farm_index=18}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.946133Z  INFO {farm_index=19}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.946145Z  INFO {farm_index=19}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.946168Z  INFO {farm_index=20}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.946177Z  INFO {farm_index=20}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.946197Z  INFO {farm_index=21}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.946207Z  INFO {farm_index=21}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.946228Z  INFO {farm_index=22}: subspace_farmer::single_disk_farm::farming: Subscribing to slot info notifications
2025-07-02T17:08:40.946237Z  INFO {farm_index=22}: subspace_farmer::single_disk_farm::reward_signing: Subscribing to reward signing notifications
2025-07-02T17:08:40.956858Z  INFO {farm_index=1}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.957953Z  INFO {farm_index=2}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.957975Z  INFO {farm_index=3}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.958777Z  INFO {farm_index=4}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.959701Z  INFO {farm_index=5}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.960969Z  INFO {farm_index=6}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.961825Z  INFO {farm_index=7}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.963760Z  INFO {farm_index=8}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.963831Z  INFO {farm_index=9}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.964834Z  INFO {farm_index=10}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.965771Z  INFO {farm_index=11}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.966860Z  INFO {farm_index=12}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.969359Z  INFO {farm_index=13}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.969527Z  INFO {farm_index=14}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.972361Z  INFO {farm_index=15}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.972505Z  INFO {farm_index=16}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.974101Z  INFO {farm_index=17}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.975553Z  INFO {farm_index=18}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.981237Z  INFO {farm_index=19}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.983007Z  INFO {farm_index=20}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.996612Z  INFO {farm_index=21}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T17:08:40.999188Z  INFO {farm_index=22}: subspace_farmer::single_disk_farm::plotting: Subscribing to archived segments
2025-07-02T18:08:38.732704Z  INFO {farm_index=8}:{sector_index=212}: subspace_farmer::single_disk_farm::plotting: Replotting sector (0.00% complete)
2025-07-02T18:08:38.733321Z  INFO {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Replotting sector (33.33% complete)
2025-07-02T18:08:44.100199Z  WARN {farm_index=8}:{sector_index=212}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:08:44.304939Z  WARN {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:08:45.103074Z  INFO {farm_index=8}:{sector_index=212}: subspace_farmer::single_disk_farm::plotting: Replotting sector retry
2025-07-02T18:08:45.306256Z  INFO {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Replotting sector retry
2025-07-02T18:08:48.734105Z  INFO {farm_index=6}:{sector_index=731}: subspace_farmer::single_disk_farm::plotting: Replotting sector (0.00% complete)
2025-07-02T18:08:50.391146Z  WARN {farm_index=8}:{sector_index=212}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:08:50.532193Z  WARN {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:08:51.402863Z  INFO {farm_index=8}:{sector_index=212}: subspace_farmer::single_disk_farm::plotting: Replotting sector retry
2025-07-02T18:08:51.534151Z  INFO {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Replotting sector retry
2025-07-02T18:08:54.225439Z  WARN {farm_index=6}:{sector_index=731}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:08:57.208434Z  WARN {farm_index=8}:{sector_index=212}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:08:57.663965Z  WARN {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:09:19.579281Z  INFO {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Replotting sector retry
2025-07-02T18:09:26.515983Z  WARN {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:09:27.206278Z  INFO {farm_index=6}:{sector_index=731}: subspace_farmer::single_disk_farm::plotting: Replotting sector retry
2025-07-02T18:09:37.178545Z  WARN {farm_index=6}:{sector_index=731}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:09:37.325126Z  INFO {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Replotting sector retry
2025-07-02T18:09:50.472990Z  WARN {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:09:51.454067Z  INFO {farm_index=6}:{sector_index=731}: subspace_farmer::single_disk_farm::plotting: Replotting sector retry
2025-07-02T18:09:51.475101Z  INFO {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Replotting sector retry
2025-07-02T18:09:51.568005Z  INFO {farm_index=5}:{sector_index=507}: subspace_farmer::single_disk_farm::plotting: Replotting sector (0.00% complete)
2025-07-02T18:10:02.510109Z  WARN {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:10:02.667602Z  WARN {farm_index=5}:{sector_index=507}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:10:02.799961Z  WARN {farm_index=6}:{sector_index=731}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:10:03.514910Z  INFO {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Replotting sector retry
2025-07-02T18:10:16.633832Z  WARN {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:10:17.635951Z  INFO {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Replotting sector retry
2025-07-02T18:10:23.979300Z  WARN {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:10:24.981183Z  INFO {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Replotting sector retry
2025-07-02T18:10:38.348509Z  WARN {farm_index=8}:{sector_index=703}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Failed to encode sector: Records encoder error: cudaMallocAsync(&d_ptr, sz, stream)@sppark-5de920c9eba024c7/b2a181e/util/gpu_t.cuh:73 failed: "out of memory"
2025-07-02T18:10:56.621916Z  INFO {farm_index=19}:{sector_index=332}: subspace_farmer::single_disk_farm::plotting: Replotting sector (0.00% complete)

Logs from the controller:

2025-07-02T17:08:25.223432Z  INFO async_nats: event: connected
2025-07-02T17:08:25.223512Z  INFO subspace_farmer::commands::cluster::controller: Connecting to node RPC url=ws://172.25.0.3:9944
Error: Failed to connect to node RPC: Error when opening the TCP socket: Connection refused (os error 111)
2025-07-02T17:08:25.829571Z  INFO subspace_farmer::commands::cluster::controller: Connecting to node RPC url=ws://172.25.0.3:9944
2025-07-02T17:08:25.829585Z  INFO async_nats: event: connected
Error: Failed to connect to node RPC: Error when opening the TCP socket: Connection refused (os error 111)
2025-07-02T17:08:26.540331Z  INFO subspace_farmer::commands::cluster::controller: Connecting to node RPC url=ws://172.25.0.3:9944
2025-07-02T17:08:26.540356Z  INFO async_nats: event: connected
Error: Failed to connect to node RPC: Error when opening the TCP socket: Connection refused (os error 111)
2025-07-02T17:08:27.679201Z  INFO subspace_farmer::commands::cluster::controller: Connecting to node RPC url=ws://172.25.0.3:9944
2025-07-02T17:08:27.679224Z  INFO async_nats: event: connected
Error: Failed to connect to node RPC: Error when opening the TCP socket: Connection refused (os error 111)
2025-07-02T17:08:29.618029Z  INFO subspace_farmer::commands::cluster::controller: Connecting to node RPC url=ws://172.25.0.3:9944
2025-07-02T17:08:29.618063Z  INFO async_nats: event: connected
Error: Failed to connect to node RPC: Error when opening the TCP socket: Connection refused (os error 111)
2025-07-02T17:08:33.115085Z  INFO subspace_farmer::commands::cluster::controller: Connecting to node RPC url=ws://172.25.0.3:9944
2025-07-02T17:08:33.115097Z  INFO async_nats: event: connected
2025-07-02T17:08:33.118809Z  INFO subspace_farmer::node_client::caching_proxy_node_client: Downloading all segment headers from node...
2025-07-02T17:08:33.122716Z  INFO subspace_farmer::node_client::caching_proxy_node_client: Downloaded all segment headers from node successfully
2025-07-02T17:08:33.127614Z  INFO subspace_networking::constructor: DSN instance configured. allow_non_global_addresses_in_dht=false peer_id=12D3KooWE3BjpM7hiF3UpJCZpwkX6Y5oyYcxb4UG9cBWXZUMrYyg protocol_version=/subspace/2/66455a580aabff303720aa83adbe6c44502922251c03ba73686d5245da9e21bd
2025-07-02T17:08:33.130048Z  INFO libp2p_swarm: local_peer_id=12D3KooWE3BjpM7hiF3UpJCZpwkX6Y5oyYcxb4UG9cBWXZUMrYyg
2025-07-02T17:08:33.823510Z  INFO subspace_metrics: Metrics server started. endpoints=[0.0.0.0:9081]
2025-07-02T17:08:33.823527Z  INFO actix_server::builder: starting 2 workers
2025-07-02T17:08:33.823560Z  INFO actix_server::server: Tokio runtime found; starting in existing Tokio runtime
2025-07-02T17:08:33.823632Z  INFO actix_server::server: starting service: "actix-web-service-0.0.0.0:9081", workers: 2, listening on: 0.0.0.0:9081
2025-07-02T17:08:33.826330Z  INFO subspace_farmer::cluster::controller::caches: New cache discovered, scheduling reinitialization 
......
2025-07-02T17:08:45.252051Z  INFO subspace_farmer::cluster::controller::farms: Farm initialized successfully farmer_id=... farm_index=14 farm_id=...

....
2025-07-02T17:09:03.830243Z  INFO {cache_group=default}: subspace_farmer::farmer_cache: Initializing piece cache
2025-07-02T17:09:07.516301Z  INFO subspace_farmer::cluster::controller::farms: Farm initialized successfully farmer_id=... farm_index=27 farm_id=...

...
2025-07-02T17:10:16.429735Z  INFO subspace_networking::node_runner: dsn: actively using 4/83 known peers
2025-07-02T17:10:46.435431Z  INFO subspace_networking::node_runner: dsn: actively using 36/83 known peers
2025-07-02T17:11:16.442139Z  INFO subspace_networking::node_runner: dsn: actively using 8/85 known peers
2025-07-02T17:11:46.451160Z  INFO subspace_networking::node_runner: dsn: actively using 6/85 known peers
2025-07-02T17:12:16.458185Z  INFO subspace_networking::node_runner: dsn: actively using 1/85 known peers
...
2025-07-02T17:29:26.693542Z  INFO {cache_group=default}: subspace_farmer::farmer_cache: Synchronizing piece cache
2025-07-02T17:29:28.312565Z  INFO {cache_group=default}: subspace_farmer::farmer_cache: Piece cache sync 0.00% complete
2025-07-02T17:29:42.122003Z  INFO {cache_group=default}: subspace_farmer::farmer_cache: Piece cache sync 13.02% complete
2025-07-02T17:29:46.694358Z  INFO subspace_networking::node_runner: dsn: actively using 24/89 known peers
2025-07-02T17:30:00.530901Z  INFO {cache_group=default}: subspace_farmer::farmer_cache: Piece cache sync 26.04% complete
2025-07-02T17:30:16.701382Z  INFO subspace_networking::node_runner: dsn: actively using 19/89 known peers
2025-07-02T17:30:21.484728Z  INFO {cache_group=default}: subspace_farmer::farmer_cache: Piece cache sync 39.06% complete
2025-07-02T17:30:35.763171Z  INFO {cache_group=default}: subspace_farmer::farmer_cache: Piece cache sync 52.08% complete
2025-07-02T17:30:45.047705Z  INFO {cache_group=default}: subspace_farmer::farmer_cache: Piece cache sync 65.10% complete
2025-07-02T17:30:46.707605Z  INFO subspace_networking::node_runner: dsn: actively using 27/89 known peers
2025-07-02T17:30:54.034443Z  INFO {cache_group=default}: subspace_farmer::farmer_cache: Piece cache sync 78.12% complete
2025-07-02T17:31:01.515872Z  INFO {cache_group=default}: subspace_farmer::farmer_cache: Piece cache sync 91.15% complete
2025-07-02T17:31:16.713269Z  INFO subspace_networking::node_runner: dsn: actively using 18/89 known peers
2025-07-02T17:31:17.687861Z  INFO {cache_group=default}: subspace_farmer::farmer_cache: Finished piece cache synchronization
2025-07-02T17:31:46.721038Z  INFO subspace_networking::node_runner: dsn: actively using 4/89 known peers
2025-07-02T17:32:16.727946Z  INFO subspace_networking::node_runner: dsn: actively using 4/88 known peers
2025-07-02T17:32:46.735645Z  INFO subspace_networking::node_runner: dsn: actively using 4/88 known peers
2025-07-02T17:33:16.743052Z  INFO subspace_networking::node_runner: dsn: actively using 5/88 known peers
2025-07-02T17:33:46.817733Z  INFO subspace_networking::node_runner: dsn: actively using 10/88 known peers
2025-07-02T17:34:16.823724Z  INFO subspace_networking::node_runner: dsn: actively using 8/88 known peers
2025-07-02T17:34:46.832149Z  INFO subspace_networking::node_runner: dsn: actively using 9/88 known peers
...

Logs from the plotter (struggling to get back to the same time, but the logs shown should be from a time the error was still happening).

2025-07-03T02:45:26.708951Z  INFO {public_key=ea3d0de628c6b4231fe773ea8951b7fdad52a5adeca7685eded337b0ede3b134 sector_index=258}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:45:27.482798Z  INFO {public_key=ace5ca39a806dbb3d00fc75647c5ebbf281b23c61270fe1f772ac59eec603919 sector_index=235}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:45:42.259352Z  INFO {public_key=d24d831bd36bfbe295b7edfd16c9e38f6133d3712824cc0d9c9dc9c7b2e0d93a sector_index=407}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:45:42.656502Z  INFO {public_key=1e221ed8afaf233d358aab5efd1c73718e230a2d45135efc14c23550e658cf28 sector_index=1607}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:45:42.681108Z  INFO {public_key=b6e13915aaeb7c271bee4af3f6679f28edb132cdc91e7da48a57db67d32ebe36 sector_index=408}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:45:51.929896Z  INFO {public_key=f27e1be7177c57ec89f13141c7e6c9388dd43e6b8be0ebceab4363e42a48287a sector_index=404}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:45:52.779346Z  INFO {public_key=1c5e476c5ea6ab4559d5758bebe889dfbfba51df1604d544e94603ba1c428d49 sector_index=2817}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:45:54.217942Z  INFO {public_key=ace5ca39a806dbb3d00fc75647c5ebbf281b23c61270fe1f772ac59eec603919 sector_index=235}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:45:54.838785Z  INFO {public_key=8cdc0c0999d13c194f76c8c48518011371283594db430fa52367df84608d0a19 sector_index=65}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:46:11.093416Z  INFO {public_key=1e221ed8afaf233d358aab5efd1c73718e230a2d45135efc14c23550e658cf28 sector_index=1607}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:46:12.008395Z  INFO {public_key=085d08fd58888bfb4cfc86ff85a7df06b952ecf7416bdf0175259e057c32785b sector_index=22}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:46:20.808482Z  INFO {public_key=1c5e476c5ea6ab4559d5758bebe889dfbfba51df1604d544e94603ba1c428d49 sector_index=2817}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:46:21.185439Z  INFO {public_key=40af7c036d67121c4eb07ab13d63df1632da5d51194c86000657cdccd6a89633 sector_index=997}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:46:23.078108Z  INFO {public_key=8cdc0c0999d13c194f76c8c48518011371283594db430fa52367df84608d0a19 sector_index=65}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:46:23.387349Z  INFO {public_key=5c0aa3e6fbe6611b3adda2cb878b39233fc9d83aa4f24bf0b6fbe1ad023a147b sector_index=473}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:46:41.033035Z  INFO {public_key=085d08fd58888bfb4cfc86ff85a7df06b952ecf7416bdf0175259e057c32785b sector_index=22}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:46:41.495607Z  INFO {public_key=2c8e60bea7332cf31168676dad72cd4d060940014d2c913115a3c7602319fb1e sector_index=1309}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:46:52.978136Z  INFO {public_key=40af7c036d67121c4eb07ab13d63df1632da5d51194c86000657cdccd6a89633 sector_index=997}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:46:53.537795Z  INFO {public_key=58e06d35c3b8a0a2430e2b87ef89b899ca5ef08a12387682646de799be19f27d sector_index=1423}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:46:54.134464Z  INFO {public_key=5c0aa3e6fbe6611b3adda2cb878b39233fc9d83aa4f24bf0b6fbe1ad023a147b sector_index=473}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:46:55.024386Z  INFO {public_key=f27e1be7177c57ec89f13141c7e6c9388dd43e6b8be0ebceab4363e42a48287a sector_index=404}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:47:12.122466Z  INFO {public_key=2c8e60bea7332cf31168676dad72cd4d060940014d2c913115a3c7602319fb1e sector_index=1309}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:47:12.404156Z  INFO {public_key=8cdc0c0999d13c194f76c8c48518011371283594db430fa52367df84608d0a19 sector_index=169}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:47:26.243744Z  INFO {public_key=f27e1be7177c57ec89f13141c7e6c9388dd43e6b8be0ebceab4363e42a48287a sector_index=404}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:47:26.355068Z  INFO {public_key=58e06d35c3b8a0a2430e2b87ef89b899ca5ef08a12387682646de799be19f27d sector_index=1423}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:47:26.606161Z  INFO {public_key=085d08fd58888bfb4cfc86ff85a7df06b952ecf7416bdf0175259e057c32785b sector_index=22}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:47:26.670721Z  INFO {public_key=f27e1be7177c57ec89f13141c7e6c9388dd43e6b8be0ebceab4363e42a48287a sector_index=1015}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:47:29.763426Z  INFO {public_key=8cdc0c0999d13c194f76c8c48518011371283594db430fa52367df84608d0a19 sector_index=169}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:47:30.786085Z  INFO {public_key=8cdc0c0999d13c194f76c8c48518011371283594db430fa52367df84608d0a19 sector_index=169}: subspace_farmer::cluster::plotter: Plot sector request
2025-07-03T02:47:34.823193Z  INFO {public_key=085d08fd58888bfb4cfc86ff85a7df06b952ecf7416bdf0175259e057c32785b sector_index=22}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2025-07-03T02:47:35.077973Z  INFO {public_key=5c0aa3e6fbe6611b3adda2cb878b39233fc9d83aa4f24bf0b6fbe1ad023a147b sector_index=101}: subspace_farmer::cluster::plotter: Plot sector request

Hi! Thanks for reporting this.

A couple of things to start with. First, could you provide your start commands or, even better your docker-compose.yml if you have one please? Second, is there any chance that Docker is imposing some sort of container-level memory limit that’s being hit?

Hi,

I don’t have anything that would limit ram intentionally setup, it’s a standard install of docker.

Just in case the memory error is in some way related to disk usage: I run most disks near max capacity (typically 10MB free). Theirs 50G free on the node/cache disk.

I’ve had nvidia-smi running on 1s refresh for a while and subspace-farmer looks to be stable at 888MiB (GPU running at 969MiB / 16376MiB)

Docker compose files below.

Plotter:

services:
  farmer_plotter:
    container_name: autonomys_farmer_plotter
    image: ghcr.io/autonomys/farmer:mainnet-2025-jun-18
    ports:
      - "9084:9084"
    restart: unless-stopped
    command:
      [
        "cluster",
        "--nats-server", "nats://172.25.0.2:4222",
        "--prometheus-listen-on", "0.0.0.0:9084",
        "plotter"
      ]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

    networks:
      nats_cluster_network:
        ipv4_address: 172.25.0.7  
        
networks:
  nats_cluster_network:
    external: true

Node:

services:
  autonomys_node:
    container_name: autonomys_node
    image: ghcr.io/autonomys/node:mainnet-2025-jun-18
    volumes:
      - /mnt/nvme2/autonomys/node-data:/var/subspace:rw
    ports:
      - "0.0.0.0:30333:30333/tcp"
      - "0.0.0.0:30433:30433/tcp"
      - "9080:9080"
      - "9944:9944"     
    restart: unless-stopped
    command:
      [
        "run",
        "--chain", "mainnet",
        "--base-path", "/var/subspace",
        "--sync", "full",
        "--listen-on", "/ip4/0.0.0.0/tcp/30333",
        "--dsn-listen-on", "/ip4/0.0.0.0/tcp/30433",
        "--rpc-cors", "all",
        "--rpc-methods", "unsafe",
        "--rpc-listen-on", "0.0.0.0:9944",
        "--prometheus-listen-on", "0.0.0.0:9080",
        "--farmer",
        "--name", "...removed..."
      ]
    networks:
      nats_cluster_network:
        ipv4_address: 172.25.0.3
    healthcheck:
      timeout: 10s
      interval: 30s
      retries: 60

networks:
  nats_cluster_network:
    external: true

Farm #1:

services:     
  farmer:
    container_name: autonomys_farmer
    image: ghcr.io/autonomys/farmer:mainnet-2025-jun-18
    volumes:
      - /autonomys/nvme1/farm/subspace1:/var/subspace1:rw
      - /autonomys/nvme1/farm/subspace1b:/var/subspace1b:rw
      - /autonomys/nvme1/farm/subspace1c:/var/subspace1c:rw
      - /autonomys/nvme1/farm/subspace1d:/var/subspace1d:rw
      - /autonomys/nvme1/farm/subspace1e:/var/subspace1e:rw
      - /mnt/nvme2/autonomys/farm/subspace2:/var/subspace2:rw
      - /mnt/nvme10/autonomys/farm/farm-10-01:/var/farm-10-01:rw
      - /mnt/nvme11/autonomys/farm/farm-11-01:/var/farm-11-01:rw
      - /mnt/nvme12/autonomys/farm/farm-12-01:/var/farm-12-01:rw
      - /mnt/nvme13/autonomys/farm/farm-13-01:/var/farm-13-01:rw
      - /mnt/nvme30/autonomys/farm/autonomys4:/var/farm-30-01:rw
      - /mnt/nvme31/autonomys/farm/autonomys3:/var/farm-31-01:rw
      - /mnt/nvme32/autonomys/farm/farm-32-01:/var/farm-32-01:rw
      - /mnt/nvme33/autonomys/farm/farm-33-01:/var/farm-33-01:rw
      - /mnt/nvme41/autonomys/farm/farm-41-01:/var/farm-41-01:rw
      - /mnt/nvme42/autonomys/farm/farm-42-01:/var/farm-42-01:rw
      - /mnt/nvme51/autonomys/farm/farm-51-01:/var/farm-51-01:rw
      - /mnt/nvme52/autonomys/farm/farm-52-01:/var/farm-52-01:rw
      - /mnt/Disk-0001/autonomys/farm/farm-0001-1:/var/farm-00-01:rw
      - /mnt/Disk-0002/autonomys/farm/farm-0002-1:/var/farm-00-02:rw
      - /mnt/Disk-0003/autonomys/farm/farm-0003-1:/var/farm-00-03:rw
      - /mnt/Disk-0004/autonomys/farm/farm-0004-1:/var/farm-00-04:rw

    restart: unless-stopped
    ports:
      - "9083:9083"
    command:
      [
        "cluster",
        "--nats-server", "nats://172.25.0.2:4222",
        "--prometheus-listen-on", "0.0.0.0:9083",
        "farmer",
        "--reward-address",  ... removed ...,
        "path=/var/subspace1,size=1000G",
        "path=/var/subspace1b,size=1000G",
        "path=/var/subspace1c,size=1000G",
        "path=/var/subspace1d,size=500G",
        "path=/var/subspace2,size=3560G",
        "path=/var/farm-10-01,size=3937G",
        "path=/var/farm-11-01,size=3937G",
        "path=/var/farm-12-01,size=3937G",
        "path=/var/farm-13-01,size=3937G",
        "path=/var/farm-30-01,size=3937G",
        "path=/var/farm-31-01,size=3937G",
        "path=/var/farm-32-01,size=3937G",
        "path=/var/farm-33-01,size=3937G",
        "path=/var/farm-41-01,size=3937G",
        "path=/var/farm-42-01,size=3937G",
        "path=/var/farm-51-01,size=3937G",
        "path=/var/farm-52-01,size=3937G",
        "path=/var/farm-00-01,size=3770G",
        "path=/var/farm-00-02,size=3935G",
        "path=/var/farm-00-03,size=7935G",
        "path=/var/farm-00-04,size=1965G"
      ]
    networks:
      nats_cluster_network:
        ipv4_address: 172.25.0.6

networks:
  nats_cluster_network:
    external: true

Farm #2 (See if you can spot my stupid moments with the typos…):

version: "3.7"

services:        
  autonomys_farmer_netapp_80:
    container_name: autonomys_farmer_netapp_80
    image: ghcr.io/autonomys/farmer:mainnet-2025-jun-18
    volumes:
      - /mnt/Disk-8000/autonomys/farm/farm-8000-1:/var/farm-80-00:rw
      - /mnt/Disk-8001/autonnomy/farm/farm-8001-1:/var/farm-80-01:rw
      - /mnt/Disk-8002/autonnomy/farm/farm-8002-1:/var/farm-80-02:rw
      - /mnt/Disk-8003/autonnomy/farm/farm-8003-1:/var/farm-80-03:rw
      - /mnt/Disk-8004/autonnomy/farm/farm-8004-1:/var/farm-80-04:rw
      - /mnt/Disk-8005/autonnomy/farm/farm-8005-1:/var/farm-80-05:rw
      - /mnt/Disk-8006/autonnomy/farm/farm-8006-1:/var/farm-80-06:rw
      - /mnt/Disk-8007/autonomys/farm/farm-8007-1:/var/farm-80-07:rw
      - /mnt/Disk-8008/autonnomy/farm/farm-8008-1:/var/farm-80-08:rw
      - /mnt/Disk-8009/autonomys/farm/farm-8009-1:/var/farm-80-09:rw
      - /mnt/Disk-8010/autonnomy/farm/farm-8010-1:/var/farm-80-10:rw
      - /mnt/Disk-8011/autonnomy/farm/farm-8011-1:/var/farm-80-11:rw
      - /mnt/Disk-8012/autonnomy/farm/farm-8012-1:/var/farm-80-12:rw
      - /mnt/Disk-8013/autonnomy/farm/farm-8013-1:/var/farm-80-13:rw
      - /mnt/Disk-8014/autonomys/farm/farm-8014-1:/var/farm-80-14:rw
      - /mnt/Disk-8015/autonnomy/farm/farm-8015-1:/var/farm-80-15:rw
      - /mnt/Disk-8016/autonnomy/farm/farm-8016-1:/var/farm-80-16:rw
      - /mnt/Disk-8017/autonnomy/farm/farm-8017-1:/var/farm-80-17:rw
      - /mnt/Disk-8018/autonnomy/farm/farm-8018-1:/var/farm-80-18:rw
      - /mnt/Disk-8019/autonomys/farm/farm-8019-1:/var/farm-80-19:rw
      - /mnt/Disk-8020/autonnomy/farm/farm-8020-1:/var/farm-80-20:rw
      - /mnt/Disk-8021/autonomys/farm/farm-8021-1:/var/farm-80-21:rw
      - /mnt/Disk-8022/autonomys/farm/farm-8022-1:/var/farm-80-22:rw
      - /mnt/Disk-8023/autonomys/farm/farm-8023-1:/var/farm-80-23:rw

    restart: unless-stopped
    ports:
      - "9086:9086"
    command:
      [
        "cluster",
        "--nats-server", "nats://172.25.0.2:4222",
        "--prometheus-listen-on", "0.0.0.0:9086",
        "farmer",
        "--reward-address", ... removed...,
        "path=/var/farm-80-00,size=787G",
        "path=/var/farm-80-01,size=787G",
        "path=/var/farm-80-02,size=787G",
        "path=/var/farm-80-03,size=787G",
        "path=/var/farm-80-04,size=787G",
        "path=/var/farm-80-05,size=787G",
        "path=/var/farm-80-06,size=787G",
        "path=/var/farm-80-07,size=787G",
        "path=/var/farm-80-08,size=787G",
        "path=/var/farm-80-09,size=787G",
        #"path=/var/farm-80-10,size=787G",
        "path=/var/farm-80-11,size=787G",
        "path=/var/farm-80-12,size=787G",
        "path=/var/farm-80-13,size=787G",
        "path=/var/farm-80-14,size=787G",
        "path=/var/farm-80-15,size=787G",
        "path=/var/farm-80-16,size=787G",
        "path=/var/farm-80-17,size=787G",
        "path=/var/farm-80-18,size=787G",
        "path=/var/farm-80-19,size=3779G",
        "path=/var/farm-80-20,size=787G",
        "path=/var/farm-80-21,size=3140G",
        "path=/var/farm-80-22,size=3149G",
        "path=/var/farm-80-23,size=1103G"
      ] 
    networks:
      nats_cluster_network:
        ipv4_address: 172.25.0.11

networks:
  nats_cluster_network:
    external: true

Controller:

services:
  farmer_controller:
    container_name: autonomys_farmer_controller
    image: ghcr.io/autonomys/farmer:mainnet-2025-jun-18
    volumes:
      - /mnt/nvme2/autonomys/cluster/controller:/controller:rw
    ports:
      - "9081:9081"
      - "0.0.0.0:30533:30533/tcp"
    restart: unless-stopped
    command:
      [
        "cluster",
        "--nats-server", "nats://172.25.0.2:4222",
        "--prometheus-listen-on", "0.0.0.0:9081",
        "controller",
        "--base-path", "/controller",
        "--node-rpc-url", "ws://172.25.0.3:9944"
      ]
    networks:
      nats_cluster_network:
        ipv4_address: 172.25.0.4

networks:
  nats_cluster_network:
    external: true

Cache:

services:
  farmer_cache:
    container_name: autonomys_farmer_cache
    image: ghcr.io/autonomys/farmer:mainnet-2025-jun-18
    volumes:
      - /mnt/nvme2/autonomys/cluster/cache:/cache:rw
    restart: unless-stopped
    ports:
      - "9082:9082"
    command:
      [
        "cluster",
        "--nats-server", "nats://172.25.0.2:4222",
        "--prometheus-listen-on", "0.0.0.0:9082",
        "cache",
        "path=/cache,size=300GB"
      ]
    networks:
      nats_cluster_network:
        ipv4_address: 172.25.0.5


networks:
  nats_cluster_network:
    external: true

Nats (it’s a cluster, but the only one running at the time):

version: '3.8'

services:
  nats:
    image: nats
    container_name: nats
    restart: unless-stopped
    ports:
      - "4222:4222"
      - "4248:4248"
      - "8222:8222"
    volumes:
      - /mnt/nvme2/nats/nats.config:/nats.config:ro
    command: [
      "-c", "/nats.config",
      "-cluster","nats://0.0.0.0:4248",
      "--cluster_name","autonomys-nats-cluster",
      "--http_port","8222",
      ]
    networks:
      cluster_network:
        ipv4_address: 172.25.0.2
        
networks:
  cluster_network:
    driver: bridge
    ipam:
      config:
        - subnet: 172.25.0.0/16

Nats config:

max_payload = 2MB

How is the network usage of NATS server? Could it be network congestion?

In my personal practical use. It is best to use NATS server together with farmers. This will not occupy LAN bandwidth.

The only thing I’ve spotted is that your plotter command is missing the runtime: nvidia entry which should be at the same level as the deploy: line. Something like this:

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all  
              capabilities: [gpu]
    runtime: nvidia      

Also, have you tried with later driver versions?