Farmer jun 11 stop farming after replot start sometimes

reb0rn · June 13, 2024, 4:40am

Ubuntu 22.04, 12 farmers on 1 node, some of them stopped farming after replot started, SSD read also stopped i uploaded the rar file as log is big as debug is enabled
On 2nd PC I monitored SSD usage when replot started it paused for 2-5sec but resumed, that is windows one but just 12TB

node startup:

./subspace-node-ubuntu-x86_64-skylake-gemini-3h-2024-jun-1 run --farmer --chain gemini-3h --name reb0rnkuc2 --base-path /root/subnode --rpc-listen-on 192.168.5.12:9944 --rpc-cors all --rpc-methods unsafe --prometheus-listen-on 192.168.5.12:11111

farmer:

screen -S subspace -L -Logfile farmerfull.log ./subspace-farmer-ubuntu-x86_64-skylake-gemini-3h-2024-jun-11 farm --node-rpc-url ws://192.168.5.12:9944 --listen-on /ip4/0.0.0.0/tcp/32535  --replotting-cpu-cores 0-7,16-23 --plotting-cpu-cores 0-7,16-23 --reward-address st9PB769iY44Jno7So5NU1wSfE6MMSv3xULnu1Nn2NWAG1Cm3 path=/sam1/sub,size=7990GB path=/sam2/sub,size=7990GB path=/king1/sub,size=1995GB path=/king1/sub2,size=1995GB path=/sam3/sub,size=7990GB path=/sam4/sub,size=7990GB path=/sam5/sub,size=7990GB path=/sam6/sub,size=7990GB path=/sam7/sub,size=7990GB path=/sam8/sub,size=7990GB path=/sam9/sub,size=7990GB path=/sam10/sub,size=7990GB path=/sam11/sub,size=7990GB path=/sam12/sub,size=7990GB path=/sam13/sub,size=7990GB path=/sam14/sub,size=7990GB path=/sam15/sub,size=7990GB path=/sam16/sub,size=7990GB --metrics-endpoints 192.168.5.17:18585

PuNkYsHuNgRy · June 13, 2024, 11:51am

Could be too many farmers on one node, have you tried spinning up a second node and splitting them 6 and 6?

reb0rn · June 13, 2024, 12:39pm

I had 20 when i was plotting and no issue as this, it all started from the update to jun-11 both farmers and node was updated

PuNkYsHuNgRy · June 13, 2024, 12:40pm

I do know there were changes to the node as well as the farmer. I just thought it might be worth a shot.

nazar-pc · June 13, 2024, 1:44pm

What about node logs though?

UPD: Have logs from another user for now.

nazar-pc · June 13, 2024, 4:25pm

The reason here is likely the same as for other user: you were not running jun-11 node and since encoding of pieces changes in jun-11 RPC connection broke in an interesting way.

nazar-pc · June 13, 2024, 4:31pm

Reported upstream here: Failed response decoding breaks RPC connection · Issue #1409 · paritytech/jsonrpsee · GitHub

nazar-pc · June 13, 2024, 5:15pm

@reb0rn and @vexr I’d like you to collect node and farmer (only controller in case of cluster) logs with RUST_LOG=info,subspace_farmer=trace,jsonrpsee=trace,soketto=trace,sc_consensus_subspace::slot_worker=debug.

We should have another segment header relatively soon, so would be great to collect corresponding logs then.

nazar-pc · June 13, 2024, 5:18pm

Also if any of you can organize remote access to one of the farmes stuck in this way, I’d be able to collect low-level information there, but I understand that may not be a viable option for you.

reb0rn · June 14, 2024, 3:19am

replot started in log from 2:55 to 3:00 somewhere now the big farm not all SSD stopped farming but half did, its kinda random on rest farm some stopped in full and some keep farming

the node and farm log are in debug

files are big as i go to sleep not sure how to cut them easy/fast

nazar-pc · June 14, 2024, 5:47pm

Hm… looks like you just set log level to debug for everything rather than what I asked specifically above.
This way it both doesn’t contain a lot of the critical information I was looking for and also contains a lot of unnecessary logs on the node that I don’t care about.

It did give me a bit of additional information, but not enough to understand what happened. I really need either the log level I requested (@vexr maybe you happen to have it?) or remote access to the machine (I suspect there is a deadlock somewhere, but I do not see it and it is not possible to find without access to glitched farmer process).

reb0rn · June 14, 2024, 6:09pm

I presume debug is all, will reset to specified but that log will be here on 12h

nazar-pc · June 14, 2024, 6:21pm

trace is even lower level than debug and setting only for specific components makes logs more manageable, hence the suggested version above.

reb0rn · June 15, 2024, 2:12am

new log with trace on node and farmer, replot started at about 01:01 AM at the log
farming stopped on most drive

reb0rn · June 15, 2024, 2:51am

I left one farmer in “stuck” stage if need for access i can make it

nazar-pc · June 15, 2024, 7:13pm

Those logs were extremely useful, thank you!

While I don’t know why some subscriptions fail and others remain functional exactly, I think Faster piece getting via RPC by nazar-pc · Pull Request #2854 · subspace/subspace · GitHub should help significantly, test build with it and other improvements on top of jun-11: Snapshot build · subspace/subspace@c03d6ab · GitHub

Please try test build and let me know if it works any better for you. Same log level will be helpful in case it happens again.

reb0rn · June 15, 2024, 8:59pm

updated, will post results when replot start

vexr · June 15, 2024, 9:33pm

I have the requested logging set, but have not been able to replicate yet. I will update to the new snapshot.

reb0rn · June 15, 2024, 11:01pm

Replot started, 11 farm PC are OK, but on that one farm with 17, 4 SSD dropped out so it defo helped, I am not sure does it have anything to do as I think some farms where still replotting before replots started… will upload logs later

replot started at 22:42

nazar-pc · June 15, 2024, 11:20pm

Okay, so helped, but didn’t fix it, got it.

In case it is interesting to anyone here is what I belive is happening.

In pre-jun-11 releases we were limiting how many piece requests we’re doing once new segment is archived to 10. In jun-11 we upgraded the RPC librarywith some improvements and removed that limit completely as seemingly no longer necessary, which apparently works fine on smaller setups, but still causes issues on larger ones.

Above test build improved performance of piece decoding on receiving side, such that farmer can more quickly clear-up the queue of received pieces and let subscription messages through, which helped, but the fact that we send ~0.5G of JSON concurrently seems to still cause subscription issues there.

Here is another test build that restored old 10 pieces at a time limit in addition of optimization I did earlier, this should be even better than pre-jun-11 release as the result : Snapshot build · subspace/subspace@2f5fa47 · GitHub

Let’s see how it behaves at the next segment, but I believe we narrowed issue down much better now. Of course it would have been better if subscriptions didn’t break on RPC server in the first place, but that is less trivial and needs to be addressed upstream.

Topic		Replies	Views
Cluster Farm plotter/farmer panic Support	50	352	June 15, 2024
Farmer 15-feb stop plotting, farming keep on Support	11	231	March 11, 2024
Farming not working on gemini-3d Support	18	927	April 28, 2023
Farming cluster documents Support	12	244	May 23, 2024
Abnormal plotting progress in farmer cluster Support	21	214	June 27, 2024

Farmer jun 11 stop farming after replot start sometimes

Related topics