Probably this will help.
Subscpace node have a lot of errors with memory pages
After I stopped node
mmap
is not a solution, in fact it was a problem we specifically moved away from in the past that was preventing usage of large files on Windows at all. You can find discussions on the forum and GitHub about this.
Leak is on farmer, not node. And those are called āpage faultsā in English, they are not āerrorsā in a sense that you might expect from the name. They are not necessary an issue to resolve and not relevant in this case.
In Windows is leaking memory when reading random chunks of a large file - Microsoft Q&A I have discovered that it is kernel-initiated memory-mapping (that I didnāt ask for and specifically instructed Windows to NOT do) that is a problem here
Here is an experimental build based on the upcoming release of Advanced CLI, please give it a try when you have time: Snapshot build Ā· subspace/subspace@dddd321 Ā· GitHub
It doesnāt fix the issue globally, but it should bypass the issue for farming specifically and hopefully that is good enough for now.
NOTE: After first start it might seem to not plot anything for a while, that is expected on first start, just let it finish.
Iāve tested on one of my PC: CPU 13500, Win 11, 15.8 TiB in total, 64GB RAM
The result is amazing, total RAM is now only 11GB. And itās stable for more than 30 minutes.
The plotting time is very slow for the first 2 sectors, but after that, itās back to normal. Maybe slightly longer than Feb 19 ver but Iām not really sure.
I will have to watch out the reward / reward miss. I will report later in couple of hours.
Iām also testing on the second PC with Windows 10 and will report also.
RAM usage is low but the result from 2 PCās are very bad in term of reward miss. On both of my PC, the miss rate is up to 60-70% after 6 hours run. Iām so frustrated to say but this change is a āno goā.
As below, M0 and M4 are 2 machines Iāve tested with the new release. M1, M2 and M3 have almost % rate of miss reward.
M0 has a missed reward of 7/11. M4 has missed reward of 4/6.
I have no idea what tool that screenshot is from (not familiar) and what all of the numbers mean (it would have been more helpful if you posted it as text without truncating columns, etc), but looks like all of your farms are missing rewards, arenāt they?
It would also be helpful to know with which arguments you run each farmer.
Sorry for that.
M0, running 6 hours, missing reward is 7 out of 11.
M1, running for 6 days, missing reward is 2 out of 331.
M2, running for 21 hours, missing reward is 0 out of 46
M3, running for 6 days, missing reward is 4 out of 299
M4, running for almost 6 hours (this is the Win 10 I started testing later), missing reward is 4 out of 6.
As mentioned, M0 and M4 were used to test this build. M1, M2, M3 have been running Feb 19 release. The reward miss rate at M1, M2, M3 are acceptable to me as you can see itās just 1-2%.
This is subspace monitoring tool almost all of us using right now, very helpful to monitor node/farmer status, plot time, reward miss. I thought youāre also using it.
What machines are those, which arguments did you use for farmers, what was the miss rate before upgrade? Is the miss rate the same over time or you had a few missed and it stopped missing after that?
The miss rate before upgrade is 1 to 2%, as we can see at M1-M3. M0 to M4 have same build: CPU 13500 and 64GB RAM. I was running at record concurrency 6 for M0-M4, so I think itās the reason sometimes the reward is missed. M4 is 10900 one, also 64GB, I had no missed reward on this one with Feb 19 release since I set record concurrency value only 1 (the CPU uses too much electricity and it easily get hot so I set record as 1 only)
The missed rate with Feb 19 was random and rare.
The missed rate with the new build is around 2/3, so itās frequent. The win and miss are mixed, more are missed as number shown.
Can you post all the arguments you use on the farmer? It gives me much more information at once than multiple comments explaining it.
Also Iād be helpful if you can compare reward misses with Snapshot build Ā· subspace/subspace@df919f9 Ā· GitHub, which is the same as above test build, but without Windows-specific change of unbuffered I/O.
Below is the full CLI, same options from M0 to M4. They have same build. I also change the farmer name for M0.
.\subspace-farmer3h_19Feb.exe farm path=C:\1,size=1600G path=E:\1,size=1900G path=F:\1,size=1900G path=H:\1,size=1900G path=I:\1,size=1900G path=J:\1,size=1900G path=K:\1,size=1900G path=L:\1,size=3900G --farm-during-initial-plotting true --in-connections 25 --out-connections 25 --pending-in-connections 25 --pending-out-connections 25 --node-rpc-url ws://192.168.2.205:9945 --metrics-endpoints 192.168.2.200:2222 --plotting-thread-pool-size 20 --replotting-thread-pool-size 20 --farming-thread-pool-size 20 --sector-downloading-concurrency 3 --sector-encoding-concurrency 1 --record-encoding-concurrency 6 --reward-address stxxxxxxx
Iād try to run with defaults for all plotting/replotting/farming/concurrency stuff, what you did can make things worse for rewards. Also that build works slightly differently than feb-19
. New builds are not the same as old versions and de-prioritize plotting threads to leave room for farming. For all testing Iām asking, please run defaults, donāt mess with it.
Iām running this build now. Letās wait for the result after few hours. FYI, I run with the same options as the CLI that I shared above.
Please just use defaults for all testing purposes, otherwise results are not necessary representative of what they should be
Iāve just re-run again with all default. Will report after my sleep. Thanks.
subspace-farmer3h_4Mar.exe farm path=C:\1,size=1600G path=E:\1,size=1900G path=F:\1,size=1900G path=H:\1,size=1900G path=I:\1,size=1900G path=J:\1,size=1900G path=K:\1,size=1900G path=L:\1,size=3900G --farm-during-initial-plotting true --node-rpc-url ws://192.168.2.205:9945 --metrics-endpoints 192.168.2.200:2222 --reward-address stxxxxx
Iāve got no hits so far on either of the testing machine, just misses. After an hour and a half. 3 misses on one machine and 2 on the other. Ram usage has decreased to acceptable levels though. Will continue to monitor.
I am also testing first build posted, ram usage is ok, 2 misses at start, but SSD usage/utilization is quite huge 60-90%, NVME is about 7%
I am not sure if it doing something and will usage drop down, but it was like 10x less with feb-19
plotting and farming in progress default command line
will test new bild now
Are you still plotting? Are you using defaults for concurrency and things like that? Full farmer command would be helpful as well.
Snapshot build Ā· subspace/subspace@df919f9 Ā· GitHub wold be worth testing too to narrow down the root cause.
Also for those who are testing this, it would be helpful if you can do auditing/proving benchmarks. From my testing they were a bit faster with new code, but maybe it depends on setup. There will be rayon/unbuffered
(new) and rayon/regular
(old) implementations in there.