Farmer stops signing rewards once it starts plotting

Hello, I have several farmers. Most work well, but some don’t and the reasons don’t appear obvious, for examples, one farmer has way more CPU or way more memory or way more bandwidth or way more plots etc.

The system that is not signing rewards once it’s plotting has 16x 8 TB nvme SSD, 2x 64 core Epyc Milan CPUs and 1 TB of DDR4-3200.

It’s signing rewards right after the restart of the farmers, but once it’s starting to plot no more rewards. I have build sept-11 myself and use the binary also on machines where it’s working as expected. This is running Centos9. I have 1 gbps fiber internet. The overarching internet load see screenshot from firewall (FW) doesn’t seem to be the bottleneck for anything.

On the multitail output you see three different farmer’s stdout because on the same machine I have 1 node and 3 farmers (under different ports) running. Each farmer has 5-6plots of 7,500 GB size.

You also see the “little” memory this actually uses… Plus you see that across 16x NVME the sum is only like 400mb/s read… IO isn’t the obvious bottleneck…




So I changed the farmer to only farm one single 1 TB plot, still once it’s plotting no more signing of rewards

Also I noticed, a single farmer process with one plot makes ~200 threads?! Now this is a 24 core CPU, for this very new example, next…

Blockquote
[root@node-12 ~]# ps -T -p $(pidof subspace-farmer) |wc -l
201

24 core CPU makes 48 hyper threads… There are

48 tokio-runtime-w
48 cache-worker
48 farming-0.n
48 plotting-0

I don’t think it should simply scale up with the CPU cores.

Blockquote
PID SPID TTY TIME CMD
237408 237408 pts/3 00:00:05 subspace-farmer
237408 237409 pts/3 00:00:04 tokio-runtime-w
237408 237410 pts/3 00:00:01 tokio-runtime-w
237408 237411 pts/3 00:00:01 tokio-runtime-w
237408 237412 pts/3 00:00:01 tokio-runtime-w
237408 237413 pts/3 00:00:01 tokio-runtime-w
237408 237414 pts/3 00:00:01 tokio-runtime-w
237408 237415 pts/3 00:00:02 tokio-runtime-w
237408 237416 pts/3 00:00:01 tokio-runtime-w
237408 237417 pts/3 00:00:02 tokio-runtime-w
237408 237418 pts/3 00:00:02 tokio-runtime-w
237408 237419 pts/3 00:00:01 tokio-runtime-w
237408 237420 pts/3 00:00:03 tokio-runtime-w
237408 237421 pts/3 00:00:04 tokio-runtime-w
237408 237422 pts/3 00:00:04 tokio-runtime-w
237408 237423 pts/3 00:00:01 tokio-runtime-w
237408 237424 pts/3 00:00:01 tokio-runtime-w
237408 237425 pts/3 00:00:02 tokio-runtime-w
237408 237426 pts/3 00:00:02 tokio-runtime-w
237408 237427 pts/3 00:00:01 tokio-runtime-w
237408 237428 pts/3 00:00:01 tokio-runtime-w
237408 237429 pts/3 00:00:01 tokio-runtime-w
237408 237430 pts/3 00:00:01 tokio-runtime-w
237408 237431 pts/3 00:00:01 tokio-runtime-w
237408 237432 pts/3 00:00:01 tokio-runtime-w
237408 237433 pts/3 00:00:01 tokio-runtime-w
237408 237434 pts/3 00:00:02 tokio-runtime-w
237408 237435 pts/3 00:00:03 tokio-runtime-w
237408 237436 pts/3 00:00:02 tokio-runtime-w
237408 237437 pts/3 00:00:01 tokio-runtime-w
237408 237438 pts/3 00:00:01 tokio-runtime-w
237408 237439 pts/3 00:00:01 tokio-runtime-w
237408 237440 pts/3 00:00:03 tokio-runtime-w
237408 237441 pts/3 00:00:04 tokio-runtime-w
237408 237442 pts/3 00:00:04 tokio-runtime-w
237408 237443 pts/3 00:00:01 tokio-runtime-w
237408 237444 pts/3 00:00:01 tokio-runtime-w
237408 237445 pts/3 00:00:01 tokio-runtime-w
237408 237446 pts/3 00:00:01 tokio-runtime-w
237408 237447 pts/3 00:00:01 tokio-runtime-w
237408 237448 pts/3 00:00:03 tokio-runtime-w
237408 237449 pts/3 00:00:02 tokio-runtime-w
237408 237450 pts/3 00:00:01 tokio-runtime-w
237408 237451 pts/3 00:00:01 tokio-runtime-w
237408 237452 pts/3 00:00:01 tokio-runtime-w
237408 237453 pts/3 00:00:01 tokio-runtime-w
237408 237454 pts/3 00:00:01 tokio-runtime-w
237408 237455 pts/3 00:00:01 tokio-runtime-w
237408 237456 pts/3 00:00:02 tokio-runtime-w
237408 237457 pts/3 00:00:02 futures-timer
237408 237459 pts/3 00:00:08 cache-worker
237408 237460 pts/3 00:17:32 plotting-0
237408 237461 pts/3 00:00:00 farming-0
237408 237462 pts/3 00:00:22 reading-0
237408 237463 pts/3 00:05:13 cache-worker
237408 237464 pts/3 00:05:02 cache-worker
237408 237465 pts/3 00:04:52 cache-worker
237408 237466 pts/3 00:04:43 cache-worker
237408 237467 pts/3 00:04:38 cache-worker
237408 237468 pts/3 00:04:25 cache-worker
237408 237469 pts/3 00:04:25 cache-worker
237408 237470 pts/3 00:04:19 cache-worker
237408 237471 pts/3 00:04:18 cache-worker
237408 237472 pts/3 00:04:16 cache-worker
237408 237473 pts/3 00:04:16 cache-worker
237408 237474 pts/3 00:04:11 cache-worker
237408 237475 pts/3 00:04:12 cache-worker
237408 237476 pts/3 00:04:07 cache-worker
237408 237477 pts/3 00:04:06 cache-worker
237408 237478 pts/3 00:04:07 cache-worker
237408 237479 pts/3 00:04:02 cache-worker
237408 237480 pts/3 00:04:01 cache-worker
237408 237481 pts/3 00:03:59 cache-worker
237408 237482 pts/3 00:03:57 cache-worker
237408 237483 pts/3 00:03:56 cache-worker
237408 237484 pts/3 00:03:54 cache-worker
237408 237485 pts/3 00:03:53 cache-worker
237408 237486 pts/3 00:03:51 cache-worker
237408 237487 pts/3 00:03:50 cache-worker
237408 237488 pts/3 00:03:46 cache-worker
237408 237489 pts/3 00:03:46 cache-worker
237408 237490 pts/3 00:03:45 cache-worker
237408 237491 pts/3 00:03:42 cache-worker
237408 237492 pts/3 00:03:40 cache-worker
237408 237493 pts/3 00:03:41 cache-worker
237408 237494 pts/3 00:03:38 cache-worker
237408 237495 pts/3 00:03:37 cache-worker
237408 237496 pts/3 00:03:35 cache-worker
237408 237497 pts/3 00:03:34 cache-worker
237408 237498 pts/3 00:03:32 cache-worker
237408 237499 pts/3 00:03:28 cache-worker
237408 237500 pts/3 00:03:29 cache-worker
237408 237501 pts/3 00:03:27 cache-worker
237408 237502 pts/3 00:03:24 cache-worker
237408 237503 pts/3 00:03:25 cache-worker
237408 237504 pts/3 00:03:22 cache-worker
237408 237505 pts/3 00:03:21 cache-worker
237408 237506 pts/3 00:03:16 cache-worker
237408 237507 pts/3 00:03:17 cache-worker
237408 237508 pts/3 00:03:16 cache-worker
237408 237509 pts/3 00:03:13 cache-worker
237408 237510 pts/3 00:03:10 cache-worker
237408 237511 pts/3 00:00:00 farmer-farm
237408 237512 pts/3 00:01:33 farmer-networki
237408 237513 pts/3 00:00:26 farming-0.0
237408 237514 pts/3 00:00:25 farming-0.1
237408 237515 pts/3 00:00:25 farming-0.2
237408 237516 pts/3 00:00:25 farming-0.3
237408 237517 pts/3 00:00:25 farming-0.4
237408 237518 pts/3 00:00:25 farming-0.5
237408 237519 pts/3 00:00:25 farming-0.6
237408 237520 pts/3 00:00:25 farming-0.7
237408 237521 pts/3 00:00:25 farming-0.8
237408 237522 pts/3 00:00:25 farming-0.9
237408 237523 pts/3 00:00:25 farming-0.10
237408 237524 pts/3 00:00:25 farming-0.11
237408 237525 pts/3 00:00:25 farming-0.12
237408 237526 pts/3 00:00:24 farming-0.13
237408 237527 pts/3 00:00:24 farming-0.14
237408 237528 pts/3 00:00:24 farming-0.15
237408 237529 pts/3 00:00:25 farming-0.16
237408 237530 pts/3 00:00:25 farming-0.17
237408 237531 pts/3 00:00:24 farming-0.18
237408 237532 pts/3 00:00:24 farming-0.19
237408 237533 pts/3 00:00:24 farming-0.20
237408 237534 pts/3 00:00:24 farming-0.21
237408 237535 pts/3 00:00:24 farming-0.22
237408 237536 pts/3 00:00:24 farming-0.23
237408 237537 pts/3 00:00:24 farming-0.24
237408 237538 pts/3 00:00:24 farming-0.25
237408 237539 pts/3 00:00:24 farming-0.26
237408 237540 pts/3 00:00:24 farming-0.27
237408 237541 pts/3 00:00:24 farming-0.28
237408 237542 pts/3 00:00:24 farming-0.29
237408 237543 pts/3 00:00:24 farming-0.30
237408 237544 pts/3 00:00:24 farming-0.31
237408 237545 pts/3 00:00:24 farming-0.32
237408 237546 pts/3 00:00:24 farming-0.33
237408 237547 pts/3 00:00:24 farming-0.34
237408 237548 pts/3 00:00:24 farming-0.35
237408 237549 pts/3 00:00:24 farming-0.36
237408 237550 pts/3 00:00:24 farming-0.37
237408 237551 pts/3 00:00:24 farming-0.38
237408 237552 pts/3 00:00:24 farming-0.39
237408 237553 pts/3 00:00:24 farming-0.40
237408 237554 pts/3 00:00:24 farming-0.41
237408 237555 pts/3 00:00:24 farming-0.42
237408 237556 pts/3 00:00:23 farming-0.43
237408 237557 pts/3 00:00:23 farming-0.44
237408 237558 pts/3 00:00:23 farming-0.45
237408 237559 pts/3 00:00:23 farming-0.46
237408 237560 pts/3 00:00:23 farming-0.47
237408 237849 pts/3 00:00:22 plotting-0
237408 237850 pts/3 00:00:26 plotting-0
237408 237851 pts/3 00:00:26 plotting-0
237408 237852 pts/3 00:00:25 plotting-0
237408 237853 pts/3 00:00:23 plotting-0
237408 237854 pts/3 00:00:24 plotting-0
237408 237855 pts/3 00:00:24 plotting-0
237408 237856 pts/3 00:00:26 plotting-0
237408 237857 pts/3 00:00:24 plotting-0
237408 237858 pts/3 00:00:26 plotting-0
237408 237859 pts/3 00:00:27 plotting-0
237408 237860 pts/3 00:00:26 plotting-0
237408 237861 pts/3 00:00:23 plotting-0
237408 237862 pts/3 00:00:28 plotting-0
237408 237863 pts/3 00:00:26 plotting-0
237408 237864 pts/3 00:00:27 plotting-0
237408 237865 pts/3 00:00:25 plotting-0
237408 237866 pts/3 00:00:24 plotting-0
237408 237867 pts/3 00:00:24 plotting-0
237408 237868 pts/3 00:00:23 plotting-0
237408 237869 pts/3 00:00:26 plotting-0
237408 237870 pts/3 00:00:28 plotting-0
237408 237871 pts/3 00:00:23 plotting-0
237408 237872 pts/3 00:00:25 plotting-0
237408 237873 pts/3 00:00:26 plotting-0
237408 237874 pts/3 00:00:26 plotting-0
237408 237875 pts/3 00:00:25 plotting-0
237408 237876 pts/3 00:00:25 plotting-0
237408 237877 pts/3 00:00:27 plotting-0
237408 237878 pts/3 00:00:26 plotting-0
237408 237879 pts/3 00:00:27 plotting-0
237408 237880 pts/3 00:00:26 plotting-0
237408 237881 pts/3 00:00:25 plotting-0
237408 237882 pts/3 00:00:26 plotting-0
237408 237883 pts/3 00:00:21 plotting-0
237408 237884 pts/3 00:00:28 plotting-0
237408 237885 pts/3 00:00:25 plotting-0
237408 237886 pts/3 00:00:26 plotting-0
237408 237887 pts/3 00:00:25 plotting-0
237408 237888 pts/3 00:00:25 plotting-0
237408 237889 pts/3 00:00:24 plotting-0
237408 237890 pts/3 00:00:25 plotting-0
237408 237891 pts/3 00:00:25 plotting-0
237408 237892 pts/3 00:00:27 plotting-0
237408 237893 pts/3 00:00:25 plotting-0
237408 237894 pts/3 00:00:26 plotting-0
237408 237895 pts/3 00:00:27 plotting-0
237408 237896 pts/3 00:00:26 plotting-0
237408 280397 pts/3 00:00:00 plotting-0

My 2x 64 CORE system right now has

Blockquote
[root@dnode-4 ~]# ps -T -p $(pidof subspace-farmer) |wc -l
6719

Over 6,700 threads going…

I changed the value const PARALLELISM_LEVEL: usize = 4; to the value 4 from 20, now i don’t see any plotting anymore but only voting. Going to set it to 256 and try again.

Ok it didn’t like the value 256… Going to try 64

Blockquote
2023-09-14T02:42:49.477102Z INFO single_disk_farm{disk_farm_index=5}: subspace_farmer_components::segment_reconstruction: Recovering missing piece… missing_piece_index=7257
thread ‘plotting-5’ panicked at ‘called Result::unwrap() on an Err value: “A list with power-of-two length expected”’, /home/srv_subspace/.cargo/git/checkouts/rust-kzg-f871e5172d08c353/1058cc8/blst/src/recovery.rs:106:79
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
thread ‘plotting-3’ panicked at ‘called Result::unwrap() on an Err value: “A list with power-of-two length expected”’, /home/srv_subspace/.cargo/git/checkouts/rust-kzg-f871e5172d08c353/1058cc8/blst/src/recovery.rs:106:79
thread ‘plotting-4’ panicked at ‘called Result::unwrap() on an Err value: “A list with power-of-two length expected”’, /home/srv_subspace/.cargo/git/checkouts/rust-kzg-f871e5172d08c353/1058cc8/blst/src/recovery.rs:106:79
thread ‘plotting-2’ panicked at ‘called Result::unwrap() on an Err value: “A list with power-of-two length expected”’, /home/srv_subspace/.cargo/git/checkouts/rust-kzg-f871e5172d08c353/1058cc8/blst/src/recovery.rs:106:79
2023-09-14T02:46:08.649380Z INFO subspace_networking::node_runner: Added observed address as external: /ip4/192.168.1.1/tcp/31105
thread ‘plotting-1’ panicked at ‘called Result::unwrap() on an Err value: “A list with power-of-two length expected”’, /home/srv_subspace/.cargo/git/checkouts/rust-kzg-f871e5172d08c353/1058cc8/blst/src/recovery.rs:106:79
thread ‘plotting-0’ panicked at ‘called Result::unwrap() on an Err value: “A list with power-of-two length expected”’, /home/srv_subspace/.cargo/git/checkouts/rust-kzg-f871e5172d08c353/1058cc8/blst/src/recovery.rs:106:79
2023-09-14T02:47:50.302746Z INFO subspace_farmer::utils: Received SIGTERM, shutting down farmer…
thread ‘farmer-farm’ panicked at ‘Panic if background thread panicked: Any { … }’, crates/subspace-farmer/src/utils.rs:58:14
thread ‘farmer-farm’ panicked at ‘future still here when dropping’, /home/srv_subspace/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.28/src/stream/futures_unordered/abort.rs:11:5
stack backtrace:
0: 0x5562571cc2fa - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h2145a6517487203d
1: 0x556256df779c - core::fmt::write::he72229d1ad2a7226
2: 0x55625719864d - std::io::Write::write_fmt::h41d1b94b06fd62c5
3: 0x5562571cdf7f - std::sys_common::backtrace::print::h0c2a716a3c511332
4: 0x5562571cd3be - std::panicking::panic_hook_with_disk_dump::{{closure}}::h424532024cf7d6dd
5: 0x5562571ce6a6 - std::panicking::rust_panic_with_hook::h13f72c491a2dea91
6: 0x5562571ce132 - std::panicking::begin_panic_handler::{{closure}}::hc0e76e3d6836e9e4
7: 0x5562571ce096 - std::sys_common::backtrace::__rust_end_short_backtrace::hc8fec2d67f0244d7
8: 0x5562571ce081 - rust_begin_unwind
9: 0x556256a326f2 - core::panicking::panic_fmt::h9299ac304fd0176e
10: 0x556256e459f6 - futures_util::stream::futures_unordered::abort::abort::he3fbc29066ba1f49
11: 0x556256ce6ed9 - alloc::sync::Arc<T,A>::drop_slow::h4c1ff116113fda1d
12: 0x556256c9f0b1 - core::ptr::drop_in_place<futures_util::stream::futures_unordered::ready_to_run_queue::ReadyToRunQueue<subspace_farmer::single_disk_farm::SingleDiskFarm::run::{{closure}}>>::h85f4c6a4867be0c2
13: 0x556256ce7d6d - alloc::sync::Arc<T,A>::drop_slow::h81c620fde274d9a4
14: 0x556256c9b0f9 - core::ptr::drop_in_place<futures_util::stream::futures_unordered::FuturesUnordered<subspace_farmer::single_disk_farm::SingleDiskFarm::run::{{closure}}>>::h9b4c05428ba3056c
15: 0x556256c72e5d - std::sys_common::backtrace::__rust_begin_short_backtrace::ha2098a56a645fd4b
16: 0x556256c811bc - core::ops::function::FnOnce::call_once{{vtable.shim}}::hbf4f9fa3fb6ce87e
17: 0x5562571cf9f5 - std::sys::unix::thread::thread::new::thread_start::he8402b1f873c60a0
18: 0x7f9ec929f802 - start_thread
19: 0x7f9ec923f450 - __clone3
20: 0x0 -
fatal runtime error: failed to initiate panic, error 3
./farmer.sh: line 30: 1750212 Aborted (core dumped) $cmd

You should use latest release that has fixes for farming during plotting and you should configure one farm per physical disk as documentation and numerous forum and Discord threads suggest. Looks like you’re not following those recommendations right now.

Thanks for your reply. There is only one farm per disk on dnode-4d. The machine has 16x drives, and in the search of reasons on why it’s not working I have split the 16x drives across three farmers, where each farmer still has 5-6 farms with dedicated drives.

Single farmer or multiple farmers shouldn’t really make a difference. Farmers run most of the stuff related to different farms in dedicated threads.

There are more threads created than necessary (equal to number of cores for each farm), this will be resolved in the future in Make plot reads, at least for auditing, async · Issue #1946 · subspace/subspace · GitHub, you can subscribe to it for updates.

Plotting generally is an expensive process and may impact audits, so not much can be done there.

On this specific system one single farming process cannot nearly exhaust the system’s resources, while three concurrent farmer processes get much closer to that. by observation three farmer processes plot each on 5-6 drives faster than a single process with 16 drives.

But that is not the issue, on this particular system no matter which configuration one or three farmers, as soon as plotting starts no more voting/rewards.

Right now trying to see if 13-2 will make a difference.

Single farmer should actually be better because it will have bigger cache of unique pieces, hence plotting will accelerate once cache is populated. For separate farmers you will have disjoined caches, which will potentially make things a bit smaller.

You can run things however you want, but it makes little sense the way I see it from implementation point of view.

I really like to stick to the original issue. One farmer or three, once the farmer starts plotting it’s not voting/signing anymore. I changed from one farmer to three as a test to see if that makes a difference. But it doesn’t. While doing so observed a couple more things.

I can change easily back to one farmer, with 16x drives and it’s still not voting/signing, once it’s plotting.

What can I do to find out why it’s not voting/signing while plotting?

Ah here I forgot, the 16x machine, I cannot run 16x 7500G plots, as I get the error message

Blockquote
Directory: /sub/plots/disk09/
Error: I/O error: Cannot allocate memory (os error 12)

Caused by:
Cannot allocate memory (os error 12)
[srv_subspace@dnode-4 subspace]$

Not sure why that is, the system has 1 TB of RAM and the process monitor doesn’t show that the process is running out of memory.

That’s why I had originally split the the 16 drives amongst 3 farmer processes

I guess you’ll have to wait for plotting to finish. That many plots is quite heavy for CPU and it will impact farming. The next iteration of the protocol (likely Gemini 3g) will have a prediction window that will make it less likely.

This has already been discussed, but, sadly, on Discord, so I can’t link it here directly. Basically there is some kind of limit that you reach at ~70TiB of total space allocated for farming. This should go away with Make plot reads, at least for auditing, async · Issue #1946 · subspace/subspace · GitHub as well in next iteration of the protocol.

So I changed that system to have 2 farmer processes, each 8x 8TB drives, 8 farms for each. The split needed bc of the 70 TiB limit

For each farmer, I know from previous log file they had plotted about 14% of the 7500 GB plot size. For each farmer I changed 7 of 8 plots down to 1 TB that should make them fully plotted.
For each farmer I left one plot at the original 7500 GB size.

What I see what happened over night, each 7500 GB plot continues to plot, but all 2x farmers x 1 TB plots are re-plotting and no voting/signing.

So there seems something going on with plotting, re-plotting and blocking the voting/signing?

If you see a lot of replotting, it means sectors were plotted a while ago and likely not eligible for rewards anymore until they are replotted. Also check CPU usage, if machine is overwhelmed there will not be enough CPU power left for farming until plotting/replotting is over.

The system is loaded, but I would not call it overloaded. 2x 64 cores and 1 tb ram

It is still re-plotting, it feels as if it had to catch up with re-plotting. And the re-plotted sectors aren’t immediately available for farming.

88.4% CPU load can be overloaded. There are other bottlenecks like memory bandwidth as well. So I’d just wait for CPU usage to calm down at least. Plotting is designed to use reasonable amount of RAM and writes things to disk sequentially, but otherwise is CPU-bound (unless piece retrieval is slow, then it’ll be network-bound).

I am not worried at all about 84% CPU load. But there are some parts that play a role with this dual epyc system, maybe it would benefit from running on one numa only. I will try that our over night.

Blockquote
numactl --cpunodebind=0 --membind=0 farmer-bin1…
numactl --cpunodebind=1 --membind=1 farmer-bin2…
That will limit the amount of data going between numa nodes (sockets)

Yeah, that would be a reasonable tweak to do. Depending on BIOS and generation of EPYC even single socket can have NUMA configuration.

I think that’s the challenge with the “chiplet” CPUs. Running one farmer per socket has improved the situation to now each process is plotting, re-plotting and signing. If this is really memory bandwidth bound problem, should actually try setting NPS4. I may try, but not now, I keep watching it.

That would limit data to stay on fewer memory controllers, instead of even going via memory/io die through many.