High RAM usage for May 15th release

Issue Report

There’s a significant issue with the May 15th release that I’ve observed, and it’s causing abnormally high RAM usage. To put it into perspective, I have around 83TiB of plots on this machine. When the issue occurred, the farmer was nearing completion of the last 5 of the 33 plots. The RAM usage for May 6th and prior releases was around 40GB. However, with the May 15th release, after just 10 hours of farming and plotting, the usage was 100GB; after 17 hours, it was 188GB.

Environment

  • Win11 Pro for Workstation
  • Advance CLI
  • AMD EPYC 9004 64 cores
  • RAM 4800MT/s

Problem

[Paste any errors or relevant logs here]

Thanks for reporting. I checked the changes between may-06 and may-15 and didn’t see anything obvious.

I triggered a test build for code in between those two releases, can you check and tell me if it works better or not?: Snapshot build · subspace/subspace@2097fcb · GitHub

Also I’d like to double-check that you’re back on may-06 and no longer have memory usage issue, not just something you think was not the case back then.

Yes, I ran the May-6th release before creating this thread to confirm no memory issue. I also switched back to May 15th to confirm the memory issue I observed was reproducible. I will run the the test release and share the result . Thanks

I tried the custom build you shared, but I finished plotting around the same time I started testing this build. For farming alone, I don’t see any difference between the April 25th, May 6th, May 15th builds, and the custom build. All the builds used around 30GB of RAM. Is there anything I can do to simulate the plotting process?

I see, interesting. So it must be somehow plotting-related and Windows-specific. I still see nothing that would indicate regression from may-06 to may-15 though.

Depending on how much time it takes to consume a lot of RAM you can decrease and then increase size of the farm to create a few sectors that need to be plotted just to try it out.

P.S. There is no apr-25 release just like there was no may-16 release (I have edited the original post).

I changed the plot sizes and plotted using the custom release that you shared, and it doesn’t have the memory issue.

Moreover, I updated my windows to the latest, and ran the May 15th release to rule out any window issues. The memory usage is high even after the update.

Thanks for fixing the release dates. I have kept local copies of releases I have tried, and some of them were indexed by the download dates.

Thanks for testing!
So I see the only change that might potentially impact it, which Snapshot build · subspace/subspace@4d4ccd5 · GitHub adds on top of the previous build. Please give it a try once it is build and let me know if the issue is present. I’ll look into the changes included there in the meantime to see what might be going on there.

Also if you can provide exact full command you’re running farmer and logs that’d be great!

The farmer command is :

.\subspace-farmer.exe farm --node-rpc-url ws://192.168.10.105:9944 --metrics-endpoints 192.168.10.105:7879 --reward-address subspace:address path=H:\plot1,size=1.8TiB path=H:\plot2,size=1.8TiB path=G:\plot1,size=1.8TiB path=G:\plot2,size=1.8TiB path=I:\plot1,size=2.75TiB path=I:\plot2,size=2.75TiB path=I:\plot3,size=2.75TiB path=I:\plot4,size=2.75TiB path=I:\plot5,size=2.75TiB | Tee-Object -file subspace_farmer.log -append

I will test the new build tomorrow, and update.

1 Like

Same here, I only noticed this after farmer printed something like “allocation some bytes of memory failed” and exited. This happened on several machines, all Win11, 64GB memory, no more than 60-70TB plots one machine, never happened before, had to reverse back to May 6 version.

just noticed there is new release today, but didn’t mention the memory issue, so I suppose it’s not fixed right?

This means it ran out of system memory.

Nothing was done intentionally to address it because I do not yet know the root cause.

Okay, two more test builds that I think will narrow it down to a single commit.

I think this should be last good commit: Snapshot build · subspace/subspace@eccc821 · GitHub
And this will be the first bad commit: Snapshot build · subspace/subspace@647b363 · GitHub

Can you confirm?

So here is the report for all the test releases :

2097fcb - good
4d4ccd5 - bad
eccc821 - good
647b363 -bad

So 647b363 introduced a simple change that allows actually concurrent piece cache reads in many cases. This should improve performance of piece cache reads and unless concurrency parameters were heavily customized should not result in higher memory usage.

Since this is only happening while plotting, it is reads that are causing issues, but I see no reason why this should be happening. On your CPU there is 8 CCDs, so it will plot up to 8 sectors concurrently and there should be no way for it to use over 100G of RAM and I know we have even larger farmers on Linux that do not report such issues.

Can you set environment variable RUST_LOG to info,subspace_farmer::utils::farmer_piece_cache=trace,subspace_farmer_components::plotting=trace and run one of the problematic builds and upload collected logs somewhere? That should give me a bit more details about what your farmer is doing during plotting and why it might use so much memory.

I ran the 647b363 with the environment variable set as you suggested. Please find the logs for farmer and node at logs

Right now I don’t think that particular first bad commit is wrong on its own, but it likely uncovered an issue elsewhere that wasn’t reproducible before (at least not easily).

After re-reading code I found an edge-case addressed in Improve `FarmerPieceGetter` by nazar-pc · Pull Request #2793 · subspace/subspace · GitHub. I don’t think it was actually possible to trigger, but just in case I initiated a test build with those changes anyway: Snapshot build · subspace/subspace@7f94a49 · GitHub

Please try it with environment variable RUST_LOG set to info,subspace_farmer::utils::farmer_piece_getter=trace,subspace_farmer_components::plotting=trace and share logs like you did last time (note that environment variable has slightly different value this time).

So far I don’t see any other issues, but it is also hard to blame over 100G of RAM just on memory allocator behavior alone, so there must be something somewhere.

Thanks for all of the tests so far!

I am happy to hear that the testing is getting us closer to finding the root cause of the leak. I ran 7f94a49 with the changed value for RUST_LOG. This version also has the problem, and I stopped it at 70GB RAM. The good version stabilizes around 40-50GB for my setup. LOG

Okay, then it was not triggered indeed. I’ll keep looking then and post once I find something relevant.

Checked logs and didn’t see anything unusual. Given the fact that it worked before with piece cache reads being blocking, I decided to try and constrain piece getting concurrency, please give it a try: Snapshot build · subspace/subspace@5321620 · GitHub

It does seem like allocator misbehavior of sorts to me so far, not that it helps you as a user.

The logs for the latest trial build are : logs

This build also has the runaway memory problem. I am happy to try more builds if you have any other ideas.