Cluster Farm plotter/farmer panic

It may be a pain to read. I use a lot of environment variables to make updating easy. This is an include in all of them.

/mnt/sub/space/subspace.i:

NODEIP="192.168.4.21"
NATSIP="192.168.4.21"
NODERPC="9944"
CONTROLLERIP="192.168.4.21"
FARMPROMETHEUS="2222"
NODEEXEC="subspace-node-ubuntu-x86_64-skylake-gemini-3h-"
FARMEXEC="subspace-farmer-ubuntu-x86_64-skylake-gemini-3h-"
CURRVER="2024-jun-11"
DISKPARMS="record-chunks-mode=ConcurrentChunks"
REWARDADDRESS="blahblahblah"

/mnt/sub/space/startcontroller:

#!/bin/bash

. /mnt/sub/space/subspace.i

$PWD/$FARMEXEC$CURRVER cluster --nats-server nats://127.0.0.1:4222 \
controller \
--base-path /mnt/sub/space/controller \
--node-rpc-url ws://127.0.0.1:$NODERPC \
--listen-on /ip4/$CONTROLLERIP/tcp/30533 \
>> $PWD/controllerS-$CURRVER.log 2>&1

/mnt/sub/space/startcache:

. /mnt/sub/space/subspace.i

$PWD/$FARMEXEC$CURRVER cluster --nats-server nats://127.0.0.1:4222 \
cache \
path=$PWD/cache,size=200G

/mnt/sub/space/startplotter:

. /mnt/sub/space/subspace.i

./subspace-farmer cluster --nats-server nats://$NATSIP:4222 \
    plotter \
    --plotting-thread-pool-size 8 \
    >> $PWD/plotterS-$CURRVER.log 2>&1

Wait a minute. I just noticed that last one, the plotter, is callilng ā€œsubspace-farmerā€. That must be the #344 build. Ugh.

Let me try fixing that.

EDIT: Sigh, yeah, it looks like that was the issue. Sorry. Trying unusually named builds with my setup makes me prone to this sort of thing. I’ve solved it since by just renaming unusual builds to the current date as if they are a normal release, then letting my scripts run unaltered.

Right, that is why I asked for full commands :slightly_smiling_face:
One more mystery resolved then :tada:

I should have been more careful and make sure it either works or doesn’t rather than crashing that way but was lazy.

Yeah, good call on having me go over those. Hopefully with my new method for how I run snapshot builds it shouldn’t happen again.

One very nice improvement in this jun-11 version is I’m not getting occasional timeout errors in the plotter anymore, at least so far. Those were quite annoying.

Test build mentioned above fixes more reasons that can result in timeouts, though they are definitely more rare now. Appreciate all the testing, it helps to make software better!

Quite welcome.

Off topic tip: Running farmer-executable cluster plotter -help. The description of the plotting-cpu-cores option mentions a requirement to set --replotting-cpu-cores a certain way, but that option doesn’t appear to exist in cluster plotting (which makes sense).

I’m really impressed at how well jun-11 cluster is working. Very smooth, no errors, good performance and not heavily impacting other processes. Nice work.

(That said, I still haven’t enabled farming or plotting across the LAN. But locally, working great. I’ll start that when all my local disks are fully plotted and replotted. That may take a few days.)

1 Like

I do have same problem on 2 out of my 3 plotters.

How I’ll get this fix (snapshot build) in use as I am using Portainer / Docker Stacks config?

My current plotter config:

version: ā€œ3.8ā€
services:
ss_plot_zen:
image: Package farmer Ā· GitHub
command:
[
ā€œclusterā€,
ā€œā€“nats-serverā€, ā€œnats_edge:4222ā€,
ā€œplotterā€
]
deploy:
mode: replicated
replicas: 1
placement:
constraints:
- node.hostname == sub-zen
environment:
- TZ=Europe/Finland
labels:
js-subspace-plotter.name: ā€œSubspace Plotter - Zenā€
networks:
subspace_nwk:

Copy-paste typos: Fix cluster plotter docs by nazar-pc Ā· Pull Request #2853 Ā· subspace/subspace Ā· GitHub

As mentioned in solution above you need to make sure to run the same release of node and all farmer components. If you have mismatches you may get these issues. Going forward we’ll try to avoid them though.

Server 1 is Dell Precision 5820 (IntelĀ® XeonĀ® Processor W-2000 Family) with DDR4 EEC memory;
Everything works fine on this PC (Ubuntu 24.04 LTS as a Proxmox VM); cache, controller, farmer, node and plotter.
I am using gemini-3h-2024-jun-11.
Problem is that this server has pretty weak CPU compared other servers, but it has most SSD capacity.

Server 2 is Dell Poweredge R720 - Xeon with DDR3 ECC.
None of official builds works on this; farmer, cache, controller and plotter crashes on startup (no node on this server)
But my own custom build works on this, except plotter has this ā€˜Invalid Scalar’ error.

Server 3 is Threadripper 1950x with DDR4 ECC memory. This has only plotter and farmer (controller and cache used from server 2 and node from server 1). Plotter has this ā€˜Invalid Scalar’ issue.

It is a bit challenge to use exactly same versions on all servers, but I can continue testing. Otherwise I’ll put on hold these Server 2 and 3.

When upgrading to jun-11 you have to use the same versions or you will definitely have issues