Cluster Farm plotter/farmer panic

You’re welcome. It’s easy enough at this point - cache sync doesn’t seem to be burdening the CPU or WAN bandwidth much, and plotting is currently impossible, so no reason not to try it.

For the record, at the moment my entire subspace processing (NATS, controller, cache, farms) is confined to the single Threadripper box. I have other disks on other boxes I’d like to employ eventually, but for the time being I’m not using them as I am running and evaluating unrelated processes that depend on LAN bandwidth. I figured I could at least keep running everything local to that box until that’s done, anyway. The TR has about 60% of my roughly 225TB of plots.

1 Like

I found likely explanation in Make sure metadata for plotted sectors is written sequentially by nazar-pc · Pull Request #2847 · subspace/subspace · GitHub. The root issue was present in older releases, but likely only became apparent with some optimizations that made things faster and more likely to race.

Assuming the issue was found correctly, scrubbing should find and fix broken sectors. Please run scrub on the farms that were potentially affected or they will trigger errors until they naturally expire much later.

I triggered build from above PR that fixes the issue: Snapshot build · subspace/subspace@817d45b · GitHub

Bit confused. If the farms themselves have bad sectors, shouldn’t I be encountering errors during actual farming? Because I haven’t been.

The errors I’ve received have all been about pieces, and only during plotting. My piece cache rebuild is at 37.12% btw, which (as a percent of 200TB) should have every piece covered soon, hoping the rest of the cache rebuild will go faster after that.

Well, pieces can be read from cache and from plots too if farmer has to. I found one edge-case that could potentially result in errors like you have seen originally.

As to farming, it will not result in an error unless you win reward on affected sector, which may explain why you see such behavior. Worth doing scrub either way.

I would imagine that running scrub on approximately 130TB of plots would take considerably longer, and a lot more manual work, than this piece cache rebuild will take.

I’m inclined to wait until I actually see an error occur on a won sector before taking such a drastic step. And please don’t take offense, but it’s still hard for me to trust scrub. Remember that it has cost me something like 16TB of plots in the past, and did not produce any log messages telling me do while it was doing it. And even if I were willing to test it on one plot, in cluster, I’m not sure how to get the raw metrics data to see how many sectors have been expired, since I don’t know how you can just dump the metrics data to a browser page. Every way I’ve tried to browse to the prometheus ip and port, I get no result, even adding /metrics to the browse. If you can tell me how to get a browser dump of the metrics, I’d appreciate it.

Never mind on the browser metrics dump. Not sure why it didn’t work before but it’s working now.

EDIT: But, oddly, it’s only giving me “Plotted” and “Not Plotted” per farm. No “Expired” or “About to Expire”. Which seems somewhat impossible, as it’s got to be a good month since I did a replot on these drives.

(NotPlotted came about because when I switched to cluster, it reallocated the 1% cache on each farm to plotted space, and I’ve only managed to plot about half of that 1%. I guess it’s not going to expire stuff till that’s done.)

I understand, fair enough, but I do believe we don’t have issues in scrub and there is --dry-run now that will find and report, but not attempt to fix issues.

I also want to add more options to it so you can scan just metadata or just specific range of sectors in order to make it way faster than it was.

That is only expected if plotting is not fully complete, it will start detecting expired sectors only after initial plotting is over.

Okay, that last bit tracks with what I surmised in my last edit. My understanding about how scrub works is that it expires those bad sectors so they get replotted. Would sectors expired by the scrub process show in metrics despite initial plotting not being done? If they would, so I can validate for myself on a single test plot that it’s not expiring more than log shows or I expect, then I’ll give it a try on one plot, and more if that works out.

They will not show up unless initial plotting is done, no

Alright, thanks. Then unless I see an error on a won sector during farming, I’ll try just finishing rebuilding the piece cache like we planned yesterday, then restart the plotter and hopefully finish initial plotting at least some of them (most of those 4TB plots only have like 5-15 sectors left to do, it’s just slow to do them when there’s 29 plots involved).

If just rebuilding the cache is enough to get me fully working again, that should be a valuable testing result too.

Quick update: Cache sync is at about 41% now, which means it should have over 80GB of stored pieces, which I thought was all of 'em, but it’s still taking 5 minutes to do each 0.13%. The hoped for speedup doesn’t seem to be happening. So don’t expect my next update for a couple days.

At least it’ll provide a couple more days to see if just farming ever produces that error you predicted.

So turned out it did actually speed up somewhat after the initial 41% or so. Went from 5 to 8 minutes every 0.13% to about every 1-3 minutes. Much better. It sorta finished syncing about an hour ago and has now been sitting on “Syncing piece cache to the latest history size, this may pause block production if takes too long” since then. When I get some sort of indication that that’s finished, I’ll give the plotter a try.

Here is a test build that with latest fixes for edge-cases I found in case you want to give it a try: Snapshot build · subspace/subspace@46730d6 · GitHub

So that history size sync thing took about an hour and a half. I restarted the plotter.

Made no difference:

2024-06-14T20:06:06.188419Z  INFO {public_key=b0e2e4f70fa69651c81dcb48801cf29524321edc57a0d4172b1a86428336dc3b sector_index=3581}: subspace_farmer::cluster::plotter: Plot sector request
2024-06-14T20:06:06.191017Z  INFO {public_key=b0e2e4f70fa69651c81dcb48801cf29524321edc57a0d4172b1a86428336dc3b sector_index=3582}: subspace_farmer::cluster::plotter: Plot sector request
2024-06-14T20:06:06.191492Z  INFO {public_key=b0e2e4f70fa69651c81dcb48801cf29524321edc57a0d4172b1a86428336dc3b sector_index=3583}: subspace_farmer::cluster::plotter: Plot sector request
2024-06-14T20:06:06.191998Z  INFO {public_key=b0e2e4f70fa69651c81dcb48801cf29524321edc57a0d4172b1a86428336dc3b sector_index=3584}: subspace_farmer::cluster::plotter: Plot sector request
2024-06-14T20:06:06.192432Z  INFO {public_key=b0e2e4f70fa69651c81dcb48801cf29524321edc57a0d4172b1a86428336dc3b sector_index=3585}: subspace_farmer::cluster::plotter: Plot sector request
thread 'plotting-3.4' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'plotting-1.0' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-2.0' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-1.3' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-0.0' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-3.2' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-2.2' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'thread 'plotting-3.0plotting-3.1' panicked at ' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs/home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs::526526::4444:
:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"

thread 'plotting-0.3' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-0.2' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-0.1' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-1.1' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-1.2' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-2.1' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-2.3' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-3.1' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-3.0' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-3.3' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-3.2' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
2024-06-14T20:06:19.709555Z  INFO {public_key=a25ba08f3ec742931ad31fa19ae8afcca70e77e6f70596273562f6c58e78e307 sector_index=3574}: subspace_farmer::cluster::plotter: Plot sector request
2024-06-14T20:06:19.711392Z  INFO {public_key=a25ba08f3ec742931ad31fa19ae8afcca70e77e6f70596273562f6c58e78e307 sector_index=3575}: subspace_farmer::cluster::plotter: Plot sector request
2024-06-14T20:06:19.721938Z  INFO {public_key=a25ba08f3ec742931ad31fa19ae8afcca70e77e6f70596273562f6c58e78e307 sector_index=3576}: subspace_farmer::cluster::plotter: Plot sector request
2024-06-14T20:06:19.723607Z  INFO {public_key=a25ba08f3ec742931ad31fa19ae8afcca70e77e6f70596273562f6c58e78e307 sector_index=3577}: subspace_farmer::cluster::plotter: Plot sector request
2024-06-14T20:06:19.724594Z  INFO {public_key=a25ba08f3ec742931ad31fa19ae8afcca70e77e6f70596273562f6c58e78e307 sector_index=3578}: subspace_farmer::cluster::plotter: Plot sector request
2024-06-14T20:06:27.900106Z  INFO subspace_farmer::utils: Received SIGINT, shutting down farmer...

Will your new build fix this problem by itself, or is scrubbing the farms first still the only way?

Still had no errors just farming, btw.

EDIT: Just started a scrub dry-run of the farm that it said it was trying to plot when it threw those last errors. It’ll be a while.

You can try latest test build that supports partial scrubbing: Snapshot build · subspace/subspace@46730d6 · GitHub
Use --target metadata to check the things that I suspect were affected. Should take just a few seconds.

Did this. Forgot to specify --dry-run for this one, whoops. So, it would’ve actually scrubbed. And:

If it did or found anything, it didn’t print in the log.

I guess I’ll retry the --dry-run full scrub with this new version.

If there are no warnings or errors then it didn’t find any issues with metadata, which is puzzling :thinking:

Since the farmer hasn’t been having any issues, I presume that any issues in the farm are only in the last few sectors, like 3575+. Should get to that in the full scrub in around 15 minutes or so.

2024-06-14T20:53:11.939140Z  INFO {farm_index=0}: subspace_farmer::single_disk_farm: Checked 3560/3581 sectors
2024-06-14T20:53:13.690929Z  INFO {farm_index=0}: subspace_farmer::single_disk_farm: Checked 3570/3581 sectors
2024-06-14T20:53:14.519010Z  INFO {farm_index=0}: subspace_farmer::single_disk_farm: Checked 3580/3581 sectors
2024-06-14T20:53:14.526821Z  INFO {farm_index=0}: subspace_farmer::single_disk_farm: Checking cache file path=/mnt/sub/space57/farm1/piece_cache.bin
2024-06-14T20:53:14.527706Z ERROR {farm_index=0}: subspace_farmer::commands::scrub: Irrecoverable farm error occurred, your file system might need to be repaired or disk might need to be replaced path=/mnt/sub/space57/farm1 error=Cache file does not exist at /mnt/sub/space57/farm1/piece_cache.bin

I’ll presume that error is bogus, since it really shouldn’t be checking for a local cache file anymore anyway. It seems like all the sectors checked out though.

Quick question. I’ve been canceling the plotter job as soon as it starts throwing these panics, though it seems like it wants to move on and keep going, picking a new set of sectors to try.

Unless I’m mistaken, when a plotting job is interrupted, it does save some of the progress. “Collecting partially plotted pieces” or something like that might be the message in question. Is it possible that it’s throwing these errors based on prior stored failed half-done attempts to create these sectors? If I just let it run for a while so that it could skip over and dump all these half-plotted sectors, might it then continue ok?

It should not print that if --target metadata is used, but it will check cache otherwise, which indeed doesn’t exist in case of cluster setup.

Plotter is completely in-memory. Once you stop it - it is gone and everything it was working on too.

Can you post full commands used when running for controller, cache and plotter instances?