There’s a significant issue with the May 15th release that I’ve observed, and it’s causing abnormally high RAM usage. To put it into perspective, I have around 83TiB of plots on this machine. When the issue occurred, the farmer was nearing completion of the last 5 of the 33 plots. The RAM usage for May 6th and prior releases was around 40GB. However, with the May 15th release, after just 10 hours of farming and plotting, the usage was 100GB; after 17 hours, it was 188GB.
Also I’d like to double-check that you’re back on may-06 and no longer have memory usage issue, not just something you think was not the case back then.
Yes, I ran the May-6th release before creating this thread to confirm no memory issue. I also switched back to May 15th to confirm the memory issue I observed was reproducible. I will run the the test release and share the result . Thanks
I tried the custom build you shared, but I finished plotting around the same time I started testing this build. For farming alone, I don’t see any difference between the April 25th, May 6th, May 15th builds, and the custom build. All the builds used around 30GB of RAM. Is there anything I can do to simulate the plotting process?
I see, interesting. So it must be somehow plotting-related and Windows-specific. I still see nothing that would indicate regression from may-06 to may-15 though.
Depending on how much time it takes to consume a lot of RAM you can decrease and then increase size of the farm to create a few sectors that need to be plotted just to try it out.
P.S. There is no apr-25 release just like there was no may-16 release (I have edited the original post).
I changed the plot sizes and plotted using the custom release that you shared, and it doesn’t have the memory issue.
Moreover, I updated my windows to the latest, and ran the May 15th release to rule out any window issues. The memory usage is high even after the update.
Thanks for fixing the release dates. I have kept local copies of releases I have tried, and some of them were indexed by the download dates.
Thanks for testing!
So I see the only change that might potentially impact it, which Snapshot build · subspace/subspace@4d4ccd5 · GitHub adds on top of the previous build. Please give it a try once it is build and let me know if the issue is present. I’ll look into the changes included there in the meantime to see what might be going on there.
Also if you can provide exact full command you’re running farmer and logs that’d be great!
Same here, I only noticed this after farmer printed something like “allocation some bytes of memory failed” and exited. This happened on several machines, all Win11, 64GB memory, no more than 60-70TB plots one machine, never happened before, had to reverse back to May 6 version.
So 647b363 introduced a simple change that allows actually concurrent piece cache reads in many cases. This should improve performance of piece cache reads and unless concurrency parameters were heavily customized should not result in higher memory usage.
Since this is only happening while plotting, it is reads that are causing issues, but I see no reason why this should be happening. On your CPU there is 8 CCDs, so it will plot up to 8 sectors concurrently and there should be no way for it to use over 100G of RAM and I know we have even larger farmers on Linux that do not report such issues.
Can you set environment variable RUST_LOG to info,subspace_farmer::utils::farmer_piece_cache=trace,subspace_farmer_components::plotting=trace and run one of the problematic builds and upload collected logs somewhere? That should give me a bit more details about what your farmer is doing during plotting and why it might use so much memory.
Right now I don’t think that particular first bad commit is wrong on its own, but it likely uncovered an issue elsewhere that wasn’t reproducible before (at least not easily).
Please try it with environment variable RUST_LOG set to info,subspace_farmer::utils::farmer_piece_getter=trace,subspace_farmer_components::plotting=trace and share logs like you did last time (note that environment variable has slightly different value this time).
So far I don’t see any other issues, but it is also hard to blame over 100G of RAM just on memory allocator behavior alone, so there must be something somewhere.
I am happy to hear that the testing is getting us closer to finding the root cause of the leak. I ran 7f94a49 with the changed value for RUST_LOG. This version also has the problem, and I stopped it at 70GB RAM. The good version stabilizes around 40-50GB for my setup. LOG
Checked logs and didn’t see anything unusual. Given the fact that it worked before with piece cache reads being blocking, I decided to try and constrain piece getting concurrency, please give it a try: Snapshot build · subspace/subspace@5321620 · GitHub
It does seem like allocator misbehavior of sorts to me so far, not that it helps you as a user.