Nats server invoked OOM killer

So I am running some other processes on the same system that I am running my farming cluster on - a Threadripper with 512GB of RAM. Normally, both my farming cluster and these other processes are using about 45GB of memory total. But today, my other process was killed by the nats server for no apparent reason. Here’s what I pulled out of the system logs:

Jul 01 20:00:01 Sleipnir CRON[2540902]: pam_unix(cron:session): session closed for user root
Jul 01 20:02:15 Sleipnir kernel: nats-server invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
Jul 01 20:02:15 Sleipnir kernel: CPU: 11 PID: 2361008 Comm: nats-server Tainted: P           OE      6.5.0-41-generic #41~22.04.2-Ubuntu
Jul 01 20:02:15 Sleipnir kernel: Hardware name: ASUS System Product Name/Pro WS WRX90E-SAGE SE, BIOS 0404 12/20/2023
Jul 01 20:02:15 Sleipnir kernel: Call Trace:
Jul 01 20:02:15 Sleipnir kernel:  <TASK>
Jul 01 20:02:15 Sleipnir kernel:  dump_stack_lvl+0x48/0x70
Jul 01 20:02:15 Sleipnir kernel:  dump_stack+0x10/0x20
Jul 01 20:02:15 Sleipnir kernel:  dump_header+0x50/0x290
Jul 01 20:02:15 Sleipnir kernel:  oom_kill_process+0x10d/0x1c0
Jul 01 20:02:15 Sleipnir kernel:  out_of_memory+0x103/0x350
Jul 01 20:02:15 Sleipnir kernel:  __alloc_pages_may_oom+0x112/0x1e0
Jul 01 20:02:15 Sleipnir kernel:  __alloc_pages_slowpath.constprop.0+0x46f/0x9a0
Jul 01 20:02:15 Sleipnir kernel:  __alloc_pages+0x31d/0x350
Jul 01 20:02:15 Sleipnir kernel:  alloc_pages+0x91/0x1a0
Jul 01 20:02:15 Sleipnir kernel:  folio_alloc+0x1d/0x60
Jul 01 20:02:15 Sleipnir kernel:  filemap_alloc_folio+0x31/0x40
Jul 01 20:02:15 Sleipnir kernel:  __filemap_get_folio+0xd8/0x230
Jul 01 20:02:15 Sleipnir kernel:  filemap_fault+0x454/0x750
Jul 01 20:02:15 Sleipnir kernel:  ? srso_alias_return_thunk+0x5/0x7f
Jul 01 20:02:15 Sleipnir kernel:  __do_fault+0x36/0x150
Jul 01 20:02:15 Sleipnir kernel:  do_read_fault+0x11d/0x170
Jul 01 20:02:15 Sleipnir kernel:  do_fault+0xf3/0x170
Jul 01 20:02:15 Sleipnir kernel:  handle_pte_fault+0x74/0x170
Jul 01 20:02:15 Sleipnir kernel:  __handle_mm_fault+0x65c/0x720
Jul 01 20:02:15 Sleipnir kernel:  handle_mm_fault+0x164/0x360
Jul 01 20:02:15 Sleipnir kernel:  do_user_addr_fault+0x212/0x6b0
Jul 01 20:02:15 Sleipnir kernel:  ? srso_alias_return_thunk+0x5/0x7f
Jul 01 20:02:15 Sleipnir kernel:  exc_page_fault+0x83/0x1b0
Jul 01 20:02:15 Sleipnir kernel:  asm_exc_page_fault+0x27/0x30
Jul 01 20:02:15 Sleipnir kernel: RIP: 0033:0x448fff
Jul 01 20:02:15 Sleipnir kernel: Code: Unable to access opcode bytes at 0x448fd5.
Jul 01 20:02:15 Sleipnir kernel: RSP: 002b:000000c00002df38 EFLAGS: 00010246
Jul 01 20:02:15 Sleipnir kernel: RAX: 0001ceb5b562d611 RBX: 000000c0000de008 RCX: 0001ceb5a8fab751
Jul 01 20:02:15 Sleipnir kernel: RDX: 000000002ff02211 RSI: 0000000000000000 RDI: 0000000000000000
Jul 01 20:02:15 Sleipnir kernel: RBP: 000000c00002df90 R08: 0000000000000000 R09: 0000000000000000
Jul 01 20:02:15 Sleipnir kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 000000c00002df18
Jul 01 20:02:15 Sleipnir kernel: R13: 000000c000249008 R14: 000000c000006540 R15: 003fffffffffffff
Jul 01 20:02:15 Sleipnir kernel:  </TASK>
Jul 01 20:02:15 Sleipnir kernel: Mem-Info:
Jul 01 20:02:15 Sleipnir kernel: active_anon:62514260 inactive_anon:61867808 isolated_anon:0
                                  active_file:7343 inactive_file:9250 isolated_file:0
                                  unevictable:8 dirty:1261 writeback:0
                                  slab_reclaimable:5026246 slab_unreclaimable:253451
                                  mapped:21069 shmem:22372 pagetables:276515
                                  sec_pagetables:0 bounce:0
                                  kernel_misc_reclaimable:0
                                  free:908756 free_pcp:0 free_cma:0
Jul 01 20:02:15 Sleipnir kernel: Node 0 active_anon:250057040kB inactive_anon:247471232kB active_file:29372kB inactive_file:37000kB unevictable:32kB isolated(anon):0kB isolated(file):0kB mapped:84276kB dirty:5044kB writeback:0kB shmem:89488kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0k>
Jul 01 20:02:15 Sleipnir kernel: Node 0 DMA free:11260kB boost:0kB min:0kB low:12kB high:24kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB loca>
Jul 01 20:02:15 Sleipnir kernel: lowmem_reserve[]: 0 1261 515004 515004 515004
Jul 01 20:02:15 Sleipnir kernel: Node 0 DMA32 free:1513096kB boost:0kB min:224kB low:1512kB high:2800kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1586344kB managed:1519240kB mlocked:0kB bounce:0kB fre>
Jul 01 20:02:15 Sleipnir kernel: lowmem_reserve[]: 0 0 513742 513742 513742
Jul 01 20:02:15 Sleipnir kernel: Node 0 Normal free:2110668kB boost:0kB min:91628kB low:617700kB high:1143772kB reserved_highatomic:2019328KB active_anon:250057040kB inactive_anon:247471232kB active_file:29372kB inactive_file:36496kB unevictable:32kB writepending:5044kB present:534736640kB >
Jul 01 20:02:15 Sleipnir kernel: lowmem_reserve[]: 0 0 0 0 0
Jul 01 20:02:15 Sleipnir kernel: Node 0 DMA: 1*4kB (U) 1*8kB (U) 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 1*512kB (U) 0*1024kB 1*2048kB (M) 2*4096kB (M) = 11260kB
Jul 01 20:02:15 Sleipnir kernel: Node 0 DMA32: 12*4kB (UM) 11*8kB (UM) 10*16kB (UM) 7*32kB (UM) 10*64kB (UM) 8*128kB (UM) 8*256kB (UM) 9*512kB (UM) 9*1024kB (UM) 12*2048kB (UM) 359*4096kB (M) = 1513096kB
Jul 01 20:02:15 Sleipnir kernel: Node 0 Normal: 21714*4kB (ME) 11631*8kB (UME) 8497*16kB (UME) 51660*32kB (UME) 2243*64kB (UME) 8*128kB (UM) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2113808kB
Jul 01 20:02:15 Sleipnir kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jul 01 20:02:15 Sleipnir kernel: Node 0 hugepages_total=1000 hugepages_free=728 hugepages_surp=0 hugepages_size=2048kB
Jul 01 20:02:15 Sleipnir kernel: 58424 total pagecache pages
Jul 01 20:02:15 Sleipnir kernel: 0 pages in swap cache
Jul 01 20:02:15 Sleipnir kernel: Free swap  = 0kB
Jul 01 20:02:15 Sleipnir kernel: Total swap = 0kB
Jul 01 20:02:15 Sleipnir kernel: 134084745 pages RAM
Jul 01 20:02:15 Sleipnir kernel: 0 pages HighMem/MovableOnly
Jul 01 20:02:15 Sleipnir kernel: 2180892 pages reserved
Jul 01 20:02:15 Sleipnir kernel: 0 pages hwpoisoned
Jul 01 20:02:15 Sleipnir kernel: Tasks state (memory values in pages):
Jul 01 20:02:15 Sleipnir kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Jul 01 20:02:15 Sleipnir kernel: [    966]     0   966    31747     2146   253952        0          -250 systemd-journal
Jul 01 20:02:15 Sleipnir kernel: [   1003]     0  1003     6874     1408    77824        0         -1000 systemd-udevd
Jul 01 20:02:15 Sleipnir kernel: [   1934]   101  1934     6451     2816    94208        0             0 systemd-resolve
Jul 01 20:02:15 Sleipnir kernel: [   1936]   103  1936    22380     1536    77824        0             0 systemd-timesyn
Jul 01 20:02:15 Sleipnir kernel: [   1949]     0  1949     1282      384    49152        0             0 blkmapd
Jul 01 20:02:15 Sleipnir kernel: [   1951]     0  1951      777      512    45056        0             0 rpc.idmapd
Jul 01 20:02:15 Sleipnir kernel: [   1953]     0  1953     1366      512    45056        0             0 nfsdcld
Jul 01 20:02:15 Sleipnir kernel: [   1958]     0  1958    60061     1536   106496        0             0 accounts-daemon
Jul 01 20:02:15 Sleipnir kernel: [   1959]     0  1959      704      384    45056        0             0 acpid
Jul 01 20:02:15 Sleipnir kernel: [   1964]     0  1964     2374      640    61440        0             0 cron
Jul 01 20:02:15 Sleipnir kernel: [   1965]   102  1965     2572     1024    65536        0          -900 dbus-daemon
Jul 01 20:02:15 Sleipnir kernel: [   1971]     0  1971    20998     1024    65536        0             0 irqbalance
Jul 01 20:02:15 Sleipnir kernel: [   1980]     0  1980    10288     4096   118784        0             0 networkd-dispat
Jul 01 20:02:15 Sleipnir kernel: [   1988]     0  1988    60865     2176   118784        0             0 polkitd
Jul 01 20:02:15 Sleipnir kernel: [   1991]   131  1991     3285      384    61440        0             0 nvidia-persiste
Jul 01 20:02:15 Sleipnir kernel: [   1992]   104  1992    55601     1024    81920        0             0 rsyslogd
Jul 01 20:02:15 Sleipnir kernel: [   1996]     0  1996     3131     1280    65536        0             0 smartd
Jul 01 20:02:15 Sleipnir kernel: [   2000]     0  2000    59151     1280    98304        0             0 switcheroo-cont
Jul 01 20:02:15 Sleipnir kernel: [   2003]     0  2003    12080     3210    98304        0             0 systemd-logind
Jul 01 20:02:15 Sleipnir kernel: [   2005]     0  2005    46268     2048   122880        0             0 touchegg
Jul 01 20:02:15 Sleipnir kernel: [   2007]     0  2007   117179     2944   143360        0             0 udisksd
Jul 01 20:02:15 Sleipnir kernel: [   2010]     0  2010     4126     1280    77824        0             0 wpa_supplicant
Jul 01 20:02:15 Sleipnir kernel: [   2013]     0  2013    25322      896    77824        0             0 zed
Jul 01 20:02:15 Sleipnir kernel: [   2052]     0  2052    79492     2304   118784        0             0 ModemManager
Jul 01 20:02:15 Sleipnir kernel: [   2055]     0  2055    60065     1408   106496        0             0 boltd
Jul 01 20:02:15 Sleipnir kernel: [   2157]     0  2157   102727     2560   167936        0             0 NetworkManager
Jul 01 20:02:15 Sleipnir kernel: [   2226]   132  2226     1353      640    53248        0             0 vnstatd
Jul 01 20:02:15 Sleipnir kernel: [   2227]     0  2227   764643     5589   466944        0          -999 containerd
Jul 01 20:02:15 Sleipnir kernel: [   2234]     0  2234    76698     1280    98304        0             0 lightdm
Jul 01 20:02:15 Sleipnir kernel: [   2239]   124  2239    61429     2432   118784        0             0 colord
Jul 01 20:02:15 Sleipnir kernel: [   2349]     0  2349  6389839    15468   393216        0             0 Xorg
Jul 01 20:02:15 Sleipnir kernel: [   2367]     0  2367     2194      512    49152        0             0 agetty
Jul 01 20:02:15 Sleipnir kernel: [   2479]   107  2479    38501      768    65536        0             0 rtkit-daemon
Jul 01 20:02:15 Sleipnir kernel: [   2528]     0  2528    60579     1408   106496        0             0 upowerd
Jul 01 20:02:15 Sleipnir kernel: [   2542]     0  2542    41232     1670    86016        0             0 lightdm
Jul 01 20:02:15 Sleipnir kernel: [   2715]   133  2715   437470    15207   733184        0             0 grafana
Jul 01 20:02:15 Sleipnir kernel: [   2718]   999  2718  1013398    27672  1105920        0             0 prometheus
Jul 01 20:02:15 Sleipnir kernel: [   2719]     0  2719     1716     1013    53248        0             0 rpc.mountd
Jul 01 20:02:15 Sleipnir kernel: [   2721]   109  2721     3273      768    65536        0             0 kerneloops
Jul 01 20:02:15 Sleipnir kernel: [   2725]   109  2725     3273      384    65536        0             0 kerneloops
Jul 01 20:02:15 Sleipnir kernel: [   3015]  1000  3015     4589     2304    81920        0             0 systemd
Jul 01 20:02:15 Sleipnir kernel: [   3016]  1000  3016    43124     1711   106496        0             0 (sd-pam)
Jul 01 20:02:15 Sleipnir kernel: [   3231]  1000  3231     9822     1152    73728        0             0 pipewire
Jul 01 20:02:15 Sleipnir kernel: [   3232]  1000  3232   155572     4396   172032        0             0 pulseaudio
Jul 01 20:02:15 Sleipnir kernel: [   3238]  1000  3238    60225     1414    94208        0             0 gnome-keyring-d
Jul 01 20:02:15 Sleipnir kernel: [   3241]  1000  3241     2425     1152    61440        0             0 dbus-daemon
Jul 01 20:02:15 Sleipnir kernel: [   3245]  1000  3245   101180     2304   147456        0             0 cinnamon-sessio
Jul 01 20:02:15 Sleipnir kernel: [   3548]  1000  3548    93764     3200   180224        0             0 csd-automount
Jul 01 20:02:15 Sleipnir kernel: [   3549]  1000  3549   149494     3456   208896        0             0 csd-color
Jul 01 20:02:15 Sleipnir kernel: [   3550]  1000  3550    56722     3456   163840        0             0 csd-clipboard
Jul 01 20:02:15 Sleipnir kernel: [   3552]  1000  3552    75205     3200   172032        0             0 csd-keyboard
Jul 01 20:02:15 Sleipnir kernel: [   3554]  1000  3554    75323     3200   172032        0             0 csd-wacom
Jul 01 20:02:15 Sleipnir kernel: [   3557]  1000  3557    77436     1408   106496        0             0 at-spi-bus-laun
Jul 01 20:02:15 Sleipnir kernel: [   3559]  1000  3559   108321     5857   208896        0             0 csd-background
Jul 01 20:02:15 Sleipnir kernel: [   3560]  1000  3560    57985     1152    86016        0             0 csd-a11y-settin
Jul 01 20:02:15 Sleipnir kernel: [   3566]  1000  3566   178665     3968   204800        0             0 csd-media-keys
Jul 01 20:02:15 Sleipnir kernel: [   3569]  1000  3569     2174      896    61440        0             0 dbus-daemon
Jul 01 20:02:15 Sleipnir kernel: [   3571]  1000  3571    75268     3200   172032        0             0 csd-housekeepin
Jul 01 20:02:15 Sleipnir kernel: [   3572]  1000  3572    60190     1408   102400        0             0 gvfsd
Jul 01 20:02:15 Sleipnir kernel: [   3582]  1000  3582    57630     1024    81920        0             0 csd-screensaver
Jul 01 20:02:15 Sleipnir kernel: [   3584]  1000  3584    57769     1152    77824        0             0 csd-settings-re
Jul 01 20:02:15 Sleipnir kernel: [   3589]  1000  3589    75584     3328   172032        0             0 csd-xsettings
Jul 01 20:02:15 Sleipnir kernel: [   3594]  1000  3594    61025     1536   114688        0             0 csd-print-notif
Jul 01 20:02:15 Sleipnir kernel: [   3595]  1000  3595   134911     4819   241664        0             0 csd-power
Jul 01 20:02:15 Sleipnir kernel: [   3611]  1000  3611    85590     2176   155648        0             0 csd-printer
Jul 01 20:02:15 Sleipnir kernel: [   3618]  1000  3618    39335     1024    69632        0             0 dconf-service
Jul 01 20:02:15 Sleipnir kernel: [   3625]  1000  3625    95224     1024   110592        0             0 gvfsd-fuse
Jul 01 20:02:15 Sleipnir kernel: [   3642]  1000  3642    40700     1408    81920        0             0 at-spi2-registr
Jul 01 20:02:15 Sleipnir kernel: [   3645]  1000  3645   232744     2048   184320        0             0 gvfs-udisks2-vo
Jul 01 20:02:15 Sleipnir kernel: [   3675]  1000  3675   116838     6400   229376        0             0 cinnamon-launch
Jul 01 20:02:15 Sleipnir kernel: [   3687]  1000  3687  1193685    55959  1433600        0             0 cinnamon
Jul 01 20:02:15 Sleipnir kernel: [   3697]  1000  3697    59147     1280    98304        0             0 gvfs-mtp-volume
Jul 01 20:02:15 Sleipnir kernel: [   3701]  1000  3701    78805     1408   118784        0             0 gvfs-afc-volume
Jul 01 20:02:15 Sleipnir kernel: [   3706]  1000  3706    59420     1408    98304        0             0 gvfs-gphoto2-vo
Jul 01 20:02:15 Sleipnir kernel: [   3710]  1000  3710    59162     1152    98304        0             0 gvfs-goa-volume
Jul 01 20:02:15 Sleipnir kernel: [   3714]  1000  3714   145516     4736   282624        0             0 goa-daemon
Jul 01 20:02:15 Sleipnir kernel: [   3721]  1000  3721    84601     2304   147456        0             0 goa-identity-se
Jul 01 20:02:15 Sleipnir kernel: [   3739]  1000  3739    40770     1024    90112        0             0 gvfsd-metadata
Jul 01 20:02:15 Sleipnir kernel: [   3742]  1000  3742   116277     4991   229376        0             0 xapp-sn-watcher
Jul 01 20:02:15 Sleipnir kernel: [   3760]  1000  3760    59279     1024    94208        0             0 agent
Jul 01 20:02:15 Sleipnir kernel: [   3762]  1000  3762   114849     4864   204800        0             0 polkit-gnome-au
Jul 01 20:02:15 Sleipnir kernel: [   3766]  1000  3766   103499     8832   258048        0             0 blueman-applet
Jul 01 20:02:15 Sleipnir kernel: [   3769]  1000  3769   215676    12009   360448        0             0 nemo-desktop
Jul 01 20:02:15 Sleipnir kernel: [   3772]  1000  3772   156902     6272   274432        0             0 nm-applet
Jul 01 20:02:15 Sleipnir kernel: [   3774]  1000  3774   186216     6656   380928        0             0 evolution-alarm
Jul 01 20:02:15 Sleipnir kernel: [   3778]  1000  3778    78683     4992   196608        0             0 cinnamon-killer
Jul 01 20:02:15 Sleipnir kernel: [   3809]  1000  3809   268088     3456   274432        0             0 evolution-sourc
Jul 01 20:02:15 Sleipnir kernel: [   3838]  1000  3838   231161     3840   282624        0             0 evolution-calen
Jul 01 20:02:15 Sleipnir kernel: [   3851]  1000  3851    11740     1280    86016        0             0 obexd
Jul 01 20:02:15 Sleipnir kernel: [   3859]  1000  3859   168129     3840   253952        0             0 evolution-addre
Jul 01 20:02:15 Sleipnir kernel: [   3887]  1000  3887    97512     1792   122880        0             0 gvfsd-trash
Jul 01 20:02:15 Sleipnir kernel: [   3960]  1000  3960   145087     7875   282624        0             0 gnome-terminal-
Jul 01 20:02:15 Sleipnir kernel: [   3981]  1000  3981     2852     1152    57344        0             0 bash
Jul 01 20:02:15 Sleipnir kernel: [   4047]  1000  4047      723      384    45056        0             0 sh
Jul 01 20:02:15 Sleipnir kernel: [   4049]  1000  4049    58327     1280    86016        0             0 pxgsettings
Jul 01 20:02:15 Sleipnir kernel: [   4063]  1000  4063   208828    17494   434176        0             0 mintUpdate
Jul 01 20:02:15 Sleipnir kernel: [   4134]  1000  4134    14988     6016   159744        0             0 applet.py
Jul 01 20:02:15 Sleipnir kernel: [   4135]  1000  4135   204477    32000   512000        0             0 psensor
Jul 01 20:02:15 Sleipnir kernel: [   4170]  1000  4170   126129    12103   307200        0             0 mintreport-tray
Jul 01 20:02:15 Sleipnir kernel: [   4328]  1000  4328     1788      128    57344        0             0 epmd
Jul 01 20:02:15 Sleipnir kernel: [   5735]  1000  5735     2852     1152    57344        0             0 bash
Jul 01 20:02:15 Sleipnir kernel: [   6231]  1000  6231     2493      768    61440        0             0 startnode
Jul 01 20:02:15 Sleipnir kernel: [   6232]  1000  6232 457787034   578645 61497344        0             0 subspace-node-u
Jul 01 20:02:15 Sleipnir kernel: [   6478]  1000  6478     2852     1152    65536        0             0 bash
Jul 01 20:02:15 Sleipnir kernel: [   7080]  1000  7080     2101      512    57344        0             0 tail
Jul 01 20:02:15 Sleipnir kernel: [   7315]  1000  7315     2852     1152    61440        0             0 bash
Jul 01 20:02:15 Sleipnir kernel: [   7352]  1000  7352     2493      768    61440        0             0 startcontroller
Jul 01 20:02:15 Sleipnir kernel: [   7353]  1000  7353  1976736   276351  6090752        0             0 subspace-farmer
Jul 01 20:02:15 Sleipnir kernel: [   7444]  1000  7444     2852     1152    65536        0             0 bash
Jul 01 20:02:15 Sleipnir kernel: [   7596]  1000  7596     2101      512    57344        0             0 tail
Jul 01 20:02:15 Sleipnir kernel: [   7796]  1000  7796     2852     1152    69632        0             0 bash
Jul 01 20:02:15 Sleipnir kernel: [   8048]  1000  8048     2852      769    61440        0             0 bash
Jul 01 20:02:15 Sleipnir kernel: [   8049]  1000  8049  1401941     4754   815104        0             0 subspace-farmer
Jul 01 20:02:15 Sleipnir kernel: [   8204]  1000  8204     2852     1152    65536        0             0 bash
Jul 01 20:02:15 Sleipnir kernel: [   8391]  1000  8391     2493      640    57344        0             0 startfarm
Jul 01 20:02:15 Sleipnir kernel: [   8392]  1000  8392 15985139  4393276 66285568        0             0 subspace-farmer
Jul 01 20:02:15 Sleipnir kernel: [   8753]  1000  8753     2852     1152    69632        0             0 bash
Jul 01 20:02:15 Sleipnir kernel: [   9014]  1000  9014     2101      512    57344        0             0 tail
Jul 01 20:02:15 Sleipnir kernel: [  12145]  1000 12145     2852     1152    61440        0             0 bash
Jul 01 20:02:15 Sleipnir kernel: [  13540]     0 13540    74594     2816   167936        0             0 packagekitd
Jul 01 20:02:15 Sleipnir kernel: [  34110]     0 34110   112654    16271   323584        0             0 fwupd
Jul 01 20:02:15 Sleipnir kernel: [ 228985]  1000 228985     2852     1152    65536        0             0 bash
Jul 01 20:02:15 Sleipnir kernel: [ 743518]  1000 743518   142157     2304   200704        0             0 xdg-desktop-por
Jul 01 20:02:15 Sleipnir kernel: [ 743588]  1000 743588   134226     1280   135168        0             0 xdg-document-po
Jul 01 20:02:15 Sleipnir kernel: [ 743591]  1000 743591    59038     1152    98304        0             0 xdg-permission-
Jul 01 20:02:15 Sleipnir kernel: [ 743597]  1000 743597      699      384    45056        0             0 fusermount3
Jul 01 20:02:15 Sleipnir kernel: [ 743600]  1000 743600    77619     1024   102400        0             0 xdg-desktop-por
Jul 01 20:02:15 Sleipnir kernel: [ 743670]  1000 743670      723      384    45056        0             0 sh
Jul 01 20:02:15 Sleipnir kernel: [ 743755]  1000 743755    58327     1408    86016        0             0 pxgsettings
Jul 01 20:02:15 Sleipnir kernel: [ 743759]  1000 743759    93922     3456   184320        0             0 xdg-desktop-por
Jul 01 20:02:15 Sleipnir kernel: [1061641]  1000 1061641     2852     1152    65536        0             0 bash
Jul 01 20:02:15 Sleipnir kernel: [1153515]  1000 1153515     2852     1152    65536        0             0 bash
Jul 01 20:02:15 Sleipnir kernel: [2173498]     0 2173498    19686     2176   135168        0             0 cupsd
Jul 01 20:02:15 Sleipnir kernel: [2173753]  1000 2173753     2493      768    65536        0             0 arstart
Jul 01 20:02:15 Sleipnir kernel: [2173754]  1000 2173754     2493      768    65536        0             0 bash
Jul 01 20:02:15 Sleipnir kernel: [2173755]  1000 2173755     2095      512    53248        0             0 tee
Jul 01 20:02:15 Sleipnir kernel: [2173763]  1000 2173763 137479665 118813073 978194432        0             0 beam.smp
Jul 01 20:02:15 Sleipnir kernel: [2174018]  1000 2174018      696      256    45056        0             0 erl_child_setup
Jul 01 20:02:15 Sleipnir kernel: [2174351]  1000 2174351      723      384    45056        0             0 sh
Jul 01 20:02:15 Sleipnir kernel: [2174352]  1000 2174352      662      256    40960        0             0 memsup
Jul 01 20:02:15 Sleipnir kernel: [2174353]  1000 2174353      695      256    49152        0             0 cpu_sup
Jul 01 20:02:15 Sleipnir kernel: [2174355]  1000 2174355      936      384    49152        0             0 inet_gethost
Jul 01 20:02:15 Sleipnir kernel: [2174356]  1000 2174356      942      384    49152        0             0 inet_gethost
Jul 01 20:02:15 Sleipnir kernel: [2174357]  1000 2174357      723      384    45056        0             0 sh
Jul 01 20:02:15 Sleipnir kernel: [2360384]     0 2360384     3858     1792    73728        0         -1000 sshd
Jul 01 20:02:15 Sleipnir kernel: [2360591]     0 2360591   847847     8536   585728        0          -500 dockerd
Jul 01 20:02:15 Sleipnir kernel: [2360823]     0 2360823   418098     1152   155648        0          -500 docker-proxy
Jul 01 20:02:15 Sleipnir kernel: [2360959]     0 2360959   309615     3200   114688        0          -998 containerd-shim
Jul 01 20:02:15 Sleipnir kernel: [2360978]     0 2360978   311330     4351   163840        0             0 nats-server
Jul 01 20:02:15 Sleipnir kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=docker-4f932ac17f61271ab3f964ecd793b7d885ff3c06e568f01830d929c1a54a5164.scope,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice>
Jul 01 20:02:15 Sleipnir kernel: Out of memory: Killed process 2173763 (beam.smp) total-vm:549918660kB, anon-rss:475218500kB, file-rss:5120kB, shmem-rss:28672kB, UID:1000 pgtables:955268kB oom_score_adj:0
Jul 01 20:02:15 Sleipnir systemd[1]: user@1000.service: A process of this unit has been killed by the OOM killer.
░░ Subject: A process of user@1000.service unit has been killed by the OOM killer.
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support

It seems unlikely that I could’ve come anywhere near actually being OOM, given I had about 460GB free RAM. I even have a grafana dashboard monitoring that process’s memory usage and it shows no unusual memory usage at the time of this event. Neither the farming cluster nor my other process has ever shown any sign of memory leakage. Any idea what could be going on here?

Additional information: From my research on how the linux OOM killer operates, it would’ve been the nats-server that attempted to use more memory than was available, and the only reason “beam.smp” was targeted was because it was the next highest memory-usage process running at the time.

So, while I have asked the other process’s support forums about this, it seems likely that the fault lies in an erroneous call for an absurd amount of memory from the nats server itself (or, perhaps, docker).

For the record, I checked docker logs nats and it doesn’t show any output at the time of this event.

Probably a far stretch, but it you have noticed any hardware instability, it can result in attempt to allocate some absurd amount of memory all of a sudden.

It is unlikely for NATS to use a lot of memory though, it tends to fire messages ASAP and not use a lot of RAM. How many farms/plotters do you have connected to this NATS server and what is the typical memory usage that you see for this NATS server?

Single farmer, about 120TB of plots, all plots and processes of the cluster are local to the Threadripper. Not running plotter. Not sure what memory the NATS server uses by itself, but during normal operation my entire RAM usage generally maxes out at around 52GB (out of 512GB).

This system has been as stable as a rock. No instability of any other kind detected at all yet. And the memory is all ECC.

I meant you can check processes and see how much RAM is NATS process using right now

Per pmap, nats process is using 1.25GB.

That makes sense to me. I’d say let’s wait and see if it happens again.

What OS/kernel are you using and what version of Subspace software are you on? I just had OOM killed a bunch of software including user session on Ubuntu 24.04 with 6.9.7 kernel on AMD Threadripper 7970X (just 128G of memory though).

It killed a bunch of components, but none of them seem to use a particularly large amount of data.

Farmer and node were not killed and still use a reasonable amount of memory, Tasks state printed also looks fine, node and farmer use about the same amount of memory after I re-logged in as in kernel logs:

[12253.434313] Tasks state (memory values in pages):
[12253.434314] [  pid  ]   uid  tgid total_vm      rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
[12253.434429] [   4000] 65534  4000 348993216   291630   290509      680       441 31412224        0             0 subspace-node
[12253.434442] [   5651] 65534  5651  9525541   741604   741604        0         0 14614528        0             0 subspace-farmer

Not running farming cluster, but I do run current version of main branch already for testing purposes.

What is interesting is that there are many farming threads that seem to invoke OOM killing in my case, but not just farming threads, so I think it is a coincidence caused by farmer having activity all the time :thinking: :

[12253.434056] farming-4.14 invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[12253.442613] farming-0.5 invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[12253.443754] farming-4.13 invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[12253.444678] farming-4.13 invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[12253.446111] farming-3.13 invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[12253.447155] farming-1.18 invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[12253.448654] farming-0.15 invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[12258.241726] qbittorrent invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[12258.242803] qbittorrent invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[12258.243442] qbittorrent invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[12263.520514] Socket Thread invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[12263.942332] farming-5.23 invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[12263.944854] farming-5.31 invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[12263.945774] BHMgr Processor invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=167
[12263.946514] tokio-runtime-w invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[12263.958199] farming-5.5 invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[12263.959096] farming-5.5 invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0

Memory usage just went through the roof all of a sudden (yes, monitoring on desktop too):

screenshot

I’m wondering if it is a bug due to some recent Ubuntu updates or Subspace software causes it somehow. Certainly an odd coincidence that I have just upgraded to the latest main revision (was running a bit older version of farmer and jun-18 before that). Also both our systems are running Threadrippers.

Will keep monitoring situation.

Linux Mint 21.3 (which rides on top of Ubuntu 22.04). Kernel is 6.5.0-41.

The problem has not recurred since I posted. But I have a feeling it happened before and I just didn’t notice (since my other processes autorestart).

When was it the first time? If it is related to Subspace in any way, what release did you run back then? I understand it is hypothetical, but still.

I can’t be sure my OP wasn’t the first time. I just know something has previously caused my other processes to restart (because some counters were zeroed out upon the auto restart). Since all system logs get wiped on reboot, I can’t go back to check.

Right now my original post is pretty much all the info I can provide. But I will certainly post again if it happens again.

1 Like

So this is interesting. On another box, one that was only running the farmer connecting to the remote controller, I just had tokio-runtime-w invoke the oom killer. Pretty sure tokio-runtime is part of the farmer process, right? No plotter running at the time on any box, by the way.

Do you want the stack dump like I provided the last time? Let me know what if anything I can provide from this one.

Once again, monitoring the memory usage on the other process that was killed doesn’t show anything odd going on there.

I think it only shows tokio threads because they are the most active since farmer is doing something every single second, but I don’t think farmer is the reason here.

Not sure how to debug such a thing, we need to figure out what uses all that memory. So far looks like a bug in the kernel that the software is triggering somehow.

Only thing I can think of to suggest is perhaps running a glances server with (very frequent) prometheus scraping.

Judging from the fact that I have not seen any of the processes using a substantional amount of memory, I’m suspecting it is not an app, but the kernel (driver maybe) uses that memory, in which case we’ll not necessarily see it in the process list.

I concur that I also haven’t spotted anything actually using excessive memory.

Forgot to mention - the last box I got the tokio kill on is running kernel 5.15. So it’s not a bleeding-edge-kernels-only thing.

It is always possible that something in Subspace software is indirectly causing it. So far I have no clue what or why it could be though.

Happened again, same box as last time (5.15 kernel), but this time…

Jul 07 16:14:12 Huginn rtkit-daemon[2896]: The canary thread is apparently starving. Taking action.
Jul 07 16:14:42 Huginn kernel: systemd-timesyn invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
Jul 07 16:14:42 Huginn kernel: CPU: 18 PID: 2380 Comm: systemd-timesyn Tainted: P           OE     5.15.0-113-generic #123-Ubuntu
Jul 07 16:14:42 Huginn kernel: Hardware name: ASUS System Product Name/PRIME Z490-A, BIOS 2801 10/27/2023
Jul 07 16:14:42 Huginn kernel: Call Trace:
Jul 07 16:14:42 Huginn kernel:  <TASK>
Jul 07 16:14:42 Huginn kernel:  show_stack+0x52/0x5c
Jul 07 16:14:42 Huginn kernel:  dump_stack_lvl+0x4a/0x63
Jul 07 16:14:42 Huginn kernel:  dump_stack+0x10/0x16
Jul 07 16:14:42 Huginn kernel:  dump_header+0x53/0x228
Jul 07 16:14:42 Huginn kernel:  oom_kill_process.cold+0xb/0x10
Jul 07 16:14:42 Huginn kernel:  out_of_memory+0x106/0x2e0

Subspace not directly cited.

Some details on rtkit here.

https://bazaar.launchpad.net/~ubuntu-branches/ubuntu/precise/rtkit/precise/view/head:/README#L39

Interesting:

WHY: If processes that have real-time scheduling privileges enter a busy loop they can freeze the entire the system. To make sure such run-away processes cannot do this RLIMIT_RTTIME has been introduced. Being a per-process limit it is however easily cirumvented by combining a fork bomb with a busy loop.

rtkit is not the reason here either I think.

This should probably be reported to the distribution (Mint/Ubuntu), maybe even Kernel. I’m not sure what we could possibly do to cause this weird behavior.

What was the last potentially good build of farmer on each machine before it started happening? We can try to narrow it down by testing all builds between last good and first bad builds.