Ubuntu 22.04, 12 farmers on 1 node, some of them stopped farming after replot started, SSD read also stopped i uploaded the rar file as log is big as debug is enabled
On 2nd PC I monitored SSD usage when replot started it paused for 2-5sec but resumed, that is windows one but just 12TB
node startup:
./subspace-node-ubuntu-x86_64-skylake-gemini-3h-2024-jun-1 run --farmer --chain gemini-3h --name reb0rnkuc2 --base-path /root/subnode --rpc-listen-on 192.168.5.12:9944 --rpc-cors all --rpc-methods unsafe --prometheus-listen-on 192.168.5.12:11111
The reason here is likely the same as for other user: you were not running jun-11 node and since encoding of pieces changes in jun-11 RPC connection broke in an interesting way.
@reb0rn and @vexr I’d like you to collect node and farmer (only controller in case of cluster) logs with RUST_LOG=info,subspace_farmer=trace,jsonrpsee=trace,soketto=trace,sc_consensus_subspace::slot_worker=debug.
We should have another segment header relatively soon, so would be great to collect corresponding logs then.
Also if any of you can organize remote access to one of the farmes stuck in this way, I’d be able to collect low-level information there, but I understand that may not be a viable option for you.
replot started in log from 2:55 to 3:00 somewhere now the big farm not all SSD stopped farming but half did, its kinda random on rest farm some stopped in full and some keep farming
the node and farm log are in debug
files are big as i go to sleep not sure how to cut them easy/fast
Hm… looks like you just set log level to debug for everything rather than what I asked specifically above.
This way it both doesn’t contain a lot of the critical information I was looking for and also contains a lot of unnecessary logs on the node that I don’t care about.
It did give me a bit of additional information, but not enough to understand what happened. I really need either the log level I requested (@vexr maybe you happen to have it?) or remote access to the machine (I suspect there is a deadlock somewhere, but I do not see it and it is not possible to find without access to glitched farmer process).
Replot started, 11 farm PC are OK, but on that one farm with 17, 4 SSD dropped out so it defo helped, I am not sure does it have anything to do as I think some farms where still replotting before replots started… will upload logs later
In case it is interesting to anyone here is what I belive is happening.
In pre-jun-11 releases we were limiting how many piece requests we’re doing once new segment is archived to 10. In jun-11 we upgraded the RPC librarywith some improvements and removed that limit completely as seemingly no longer necessary, which apparently works fine on smaller setups, but still causes issues on larger ones.
Above test build improved performance of piece decoding on receiving side, such that farmer can more quickly clear-up the queue of received pieces and let subscription messages through, which helped, but the fact that we send ~0.5G of JSON concurrently seems to still cause subscription issues there.
Here is another test build that restored old 10 pieces at a time limit in addition of optimization I did earlier, this should be even better than pre-jun-11 release as the result : Snapshot build · subspace/subspace@2f5fa47 · GitHub
Let’s see how it behaves at the next segment, but I believe we narrowed issue down much better now. Of course it would have been better if subscriptions didn’t break on RPC server in the first place, but that is less trivial and needs to be addressed upstream.