Farmer jun 11 stop farming after replot start sometimes

Ubuntu 22.04, 12 farmers on 1 node, some of them stopped farming after replot started, SSD read also stopped i uploaded the rar file as log is big as debug is enabled
On 2nd PC I monitored SSD usage when replot started it paused for 2-5sec but resumed, that is windows one but just 12TB

node startup:

./subspace-node-ubuntu-x86_64-skylake-gemini-3h-2024-jun-1 run --farmer --chain gemini-3h --name reb0rnkuc2 --base-path /root/subnode --rpc-listen-on 192.168.5.12:9944 --rpc-cors all --rpc-methods unsafe --prometheus-listen-on 192.168.5.12:11111

farmer:

screen -S subspace -L -Logfile farmerfull.log ./subspace-farmer-ubuntu-x86_64-skylake-gemini-3h-2024-jun-11 farm --node-rpc-url ws://192.168.5.12:9944 --listen-on /ip4/0.0.0.0/tcp/32535  --replotting-cpu-cores 0-7,16-23 --plotting-cpu-cores 0-7,16-23 --reward-address st9PB769iY44Jno7So5NU1wSfE6MMSv3xULnu1Nn2NWAG1Cm3 path=/sam1/sub,size=7990GB path=/sam2/sub,size=7990GB path=/king1/sub,size=1995GB path=/king1/sub2,size=1995GB path=/sam3/sub,size=7990GB path=/sam4/sub,size=7990GB path=/sam5/sub,size=7990GB path=/sam6/sub,size=7990GB path=/sam7/sub,size=7990GB path=/sam8/sub,size=7990GB path=/sam9/sub,size=7990GB path=/sam10/sub,size=7990GB path=/sam11/sub,size=7990GB path=/sam12/sub,size=7990GB path=/sam13/sub,size=7990GB path=/sam14/sub,size=7990GB path=/sam15/sub,size=7990GB path=/sam16/sub,size=7990GB --metrics-endpoints 192.168.5.17:18585

Could be too many farmers on one node, have you tried spinning up a second node and splitting them 6 and 6?

I had 20 when i was plotting and no issue as this, it all started from the update to jun-11 both farmers and node was updated

I do know there were changes to the node as well as the farmer. I just thought it might be worth a shot.

What about node logs though?

UPD: Have logs from another user for now.

The reason here is likely the same as for other user: you were not running jun-11 node and since encoding of pieces changes in jun-11 RPC connection broke in an interesting way.

Reported upstream here: Failed response decoding breaks RPC connection · Issue #1409 · paritytech/jsonrpsee · GitHub

@reb0rn and @vexr I’d like you to collect node and farmer (only controller in case of cluster) logs with RUST_LOG=info,subspace_farmer=trace,jsonrpsee=trace,soketto=trace,sc_consensus_subspace::slot_worker=debug.

We should have another segment header relatively soon, so would be great to collect corresponding logs then.

1 Like

Also if any of you can organize remote access to one of the farmes stuck in this way, I’d be able to collect low-level information there, but I understand that may not be a viable option for you.

replot started in log from 2:55 to 3:00 somewhere now the big farm not all SSD stopped farming but half did, its kinda random on rest farm some stopped in full and some keep farming

the node and farm log are in debug

files are big as i go to sleep not sure how to cut them easy/fast

1 Like

Hm… looks like you just set log level to debug for everything rather than what I asked specifically above.
This way it both doesn’t contain a lot of the critical information I was looking for and also contains a lot of unnecessary logs on the node that I don’t care about.

It did give me a bit of additional information, but not enough to understand what happened. I really need either the log level I requested (@vexr maybe you happen to have it?) or remote access to the machine (I suspect there is a deadlock somewhere, but I do not see it and it is not possible to find without access to glitched farmer process).

I presume debug is all, will reset to specified but that log will be here on 12h

trace is even lower level than debug and setting only for specific components makes logs more manageable, hence the suggested version above.

new log with trace on node and farmer, replot started at about 01:01 AM at the log
farming stopped on most drive

I left one farmer in “stuck” stage if need for access i can make it

Those logs were extremely useful, thank you!

While I don’t know why some subscriptions fail and others remain functional exactly, I think Faster piece getting via RPC by nazar-pc · Pull Request #2854 · subspace/subspace · GitHub should help significantly, test build with it and other improvements on top of jun-11: Snapshot build · subspace/subspace@c03d6ab · GitHub

Please try test build and let me know if it works any better for you. Same log level will be helpful in case it happens again.

updated, will post results when replot start

I have the requested logging set, but have not been able to replicate yet. I will update to the new snapshot.

Replot started, 11 farm PC are OK, but on that one farm with 17, 4 SSD dropped out so it defo helped, I am not sure does it have anything to do as I think some farms where still replotting before replots started… will upload logs later

replot started at 22:42

Okay, so helped, but didn’t fix it, got it.

In case it is interesting to anyone here is what I belive is happening.

In pre-jun-11 releases we were limiting how many piece requests we’re doing once new segment is archived to 10. In jun-11 we upgraded the RPC librarywith some improvements and removed that limit completely as seemingly no longer necessary, which apparently works fine on smaller setups, but still causes issues on larger ones.

Above test build improved performance of piece decoding on receiving side, such that farmer can more quickly clear-up the queue of received pieces and let subscription messages through, which helped, but the fact that we send ~0.5G of JSON concurrently seems to still cause subscription issues there.

Here is another test build that restored old 10 pieces at a time limit in addition of optimization I did earlier, this should be even better than pre-jun-11 release as the result :crossed_fingers: : Snapshot build · subspace/subspace@2f5fa47 · GitHub

Let’s see how it behaves at the next segment, but I believe we narrowed issue down much better now. Of course it would have been better if subscriptions didn’t break on RPC server in the first place, but that is less trivial and needs to be addressed upstream.

1 Like