You’re welcome. It’s easy enough at this point - cache sync doesn’t seem to be burdening the CPU or WAN bandwidth much, and plotting is currently impossible, so no reason not to try it.
For the record, at the moment my entire subspace processing (NATS, controller, cache, farms) is confined to the single Threadripper box. I have other disks on other boxes I’d like to employ eventually, but for the time being I’m not using them as I am running and evaluating unrelated processes that depend on LAN bandwidth. I figured I could at least keep running everything local to that box until that’s done, anyway. The TR has about 60% of my roughly 225TB of plots.
Assuming the issue was found correctly, scrubbing should find and fix broken sectors. Please run scrub on the farms that were potentially affected or they will trigger errors until they naturally expire much later.
Bit confused. If the farms themselves have bad sectors, shouldn’t I be encountering errors during actual farming? Because I haven’t been.
The errors I’ve received have all been about pieces, and only during plotting. My piece cache rebuild is at 37.12% btw, which (as a percent of 200TB) should have every piece covered soon, hoping the rest of the cache rebuild will go faster after that.
Well, pieces can be read from cache and from plots too if farmer has to. I found one edge-case that could potentially result in errors like you have seen originally.
As to farming, it will not result in an error unless you win reward on affected sector, which may explain why you see such behavior. Worth doing scrub either way.
I would imagine that running scrub on approximately 130TB of plots would take considerably longer, and a lot more manual work, than this piece cache rebuild will take.
I’m inclined to wait until I actually see an error occur on a won sector before taking such a drastic step. And please don’t take offense, but it’s still hard for me to trust scrub. Remember that it has cost me something like 16TB of plots in the past, and did not produce any log messages telling me do while it was doing it. And even if I were willing to test it on one plot, in cluster, I’m not sure how to get the raw metrics data to see how many sectors have been expired, since I don’t know how you can just dump the metrics data to a browser page. Every way I’ve tried to browse to the prometheus ip and port, I get no result, even adding /metrics to the browse. If you can tell me how to get a browser dump of the metrics, I’d appreciate it.
Never mind on the browser metrics dump. Not sure why it didn’t work before but it’s working now.
EDIT: But, oddly, it’s only giving me “Plotted” and “Not Plotted” per farm. No “Expired” or “About to Expire”. Which seems somewhat impossible, as it’s got to be a good month since I did a replot on these drives.
(NotPlotted came about because when I switched to cluster, it reallocated the 1% cache on each farm to plotted space, and I’ve only managed to plot about half of that 1%. I guess it’s not going to expire stuff till that’s done.)
I understand, fair enough, but I do believe we don’t have issues in scrub and there is --dry-run now that will find and report, but not attempt to fix issues.
I also want to add more options to it so you can scan just metadata or just specific range of sectors in order to make it way faster than it was.
That is only expected if plotting is not fully complete, it will start detecting expired sectors only after initial plotting is over.
Okay, that last bit tracks with what I surmised in my last edit. My understanding about how scrub works is that it expires those bad sectors so they get replotted. Would sectors expired by the scrub process show in metrics despite initial plotting not being done? If they would, so I can validate for myself on a single test plot that it’s not expiring more than log shows or I expect, then I’ll give it a try on one plot, and more if that works out.
Alright, thanks. Then unless I see an error on a won sector during farming, I’ll try just finishing rebuilding the piece cache like we planned yesterday, then restart the plotter and hopefully finish initial plotting at least some of them (most of those 4TB plots only have like 5-15 sectors left to do, it’s just slow to do them when there’s 29 plots involved).
If just rebuilding the cache is enough to get me fully working again, that should be a valuable testing result too.
Quick update: Cache sync is at about 41% now, which means it should have over 80GB of stored pieces, which I thought was all of 'em, but it’s still taking 5 minutes to do each 0.13%. The hoped for speedup doesn’t seem to be happening. So don’t expect my next update for a couple days.
At least it’ll provide a couple more days to see if just farming ever produces that error you predicted.
So turned out it did actually speed up somewhat after the initial 41% or so. Went from 5 to 8 minutes every 0.13% to about every 1-3 minutes. Much better. It sorta finished syncing about an hour ago and has now been sitting on “Syncing piece cache to the latest history size, this may pause block production if takes too long” since then. When I get some sort of indication that that’s finished, I’ll give the plotter a try.
So that history size sync thing took about an hour and a half. I restarted the plotter.
Made no difference:
2024-06-14T20:06:06.188419Z INFO {public_key=b0e2e4f70fa69651c81dcb48801cf29524321edc57a0d4172b1a86428336dc3b sector_index=3581}: subspace_farmer::cluster::plotter: Plot sector request
2024-06-14T20:06:06.191017Z INFO {public_key=b0e2e4f70fa69651c81dcb48801cf29524321edc57a0d4172b1a86428336dc3b sector_index=3582}: subspace_farmer::cluster::plotter: Plot sector request
2024-06-14T20:06:06.191492Z INFO {public_key=b0e2e4f70fa69651c81dcb48801cf29524321edc57a0d4172b1a86428336dc3b sector_index=3583}: subspace_farmer::cluster::plotter: Plot sector request
2024-06-14T20:06:06.191998Z INFO {public_key=b0e2e4f70fa69651c81dcb48801cf29524321edc57a0d4172b1a86428336dc3b sector_index=3584}: subspace_farmer::cluster::plotter: Plot sector request
2024-06-14T20:06:06.192432Z INFO {public_key=b0e2e4f70fa69651c81dcb48801cf29524321edc57a0d4172b1a86428336dc3b sector_index=3585}: subspace_farmer::cluster::plotter: Plot sector request
thread 'plotting-3.4' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'plotting-1.0' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-2.0' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-1.3' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-0.0' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-3.2' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-2.2' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'thread 'plotting-3.0plotting-3.1' panicked at ' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs/home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs::526526::4444:
:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-0.3' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-0.2' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-0.1' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-1.1' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-1.2' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-2.1' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-2.3' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-3.1' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-3.0' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-3.3' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
thread 'plotting-3.2' panicked at /home/ubuntu/actions-runner/_work/subspace/subspace/crates/subspace-farmer-components/src/plotting.rs:526:44:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
2024-06-14T20:06:19.709555Z INFO {public_key=a25ba08f3ec742931ad31fa19ae8afcca70e77e6f70596273562f6c58e78e307 sector_index=3574}: subspace_farmer::cluster::plotter: Plot sector request
2024-06-14T20:06:19.711392Z INFO {public_key=a25ba08f3ec742931ad31fa19ae8afcca70e77e6f70596273562f6c58e78e307 sector_index=3575}: subspace_farmer::cluster::plotter: Plot sector request
2024-06-14T20:06:19.721938Z INFO {public_key=a25ba08f3ec742931ad31fa19ae8afcca70e77e6f70596273562f6c58e78e307 sector_index=3576}: subspace_farmer::cluster::plotter: Plot sector request
2024-06-14T20:06:19.723607Z INFO {public_key=a25ba08f3ec742931ad31fa19ae8afcca70e77e6f70596273562f6c58e78e307 sector_index=3577}: subspace_farmer::cluster::plotter: Plot sector request
2024-06-14T20:06:19.724594Z INFO {public_key=a25ba08f3ec742931ad31fa19ae8afcca70e77e6f70596273562f6c58e78e307 sector_index=3578}: subspace_farmer::cluster::plotter: Plot sector request
2024-06-14T20:06:27.900106Z INFO subspace_farmer::utils: Received SIGINT, shutting down farmer...
Will your new build fix this problem by itself, or is scrubbing the farms first still the only way?
Still had no errors just farming, btw.
EDIT: Just started a scrub dry-run of the farm that it said it was trying to plot when it threw those last errors. It’ll be a while.
You can try latest test build that supports partial scrubbing: Snapshot build · subspace/subspace@46730d6 · GitHub
Use --target metadata to check the things that I suspect were affected. Should take just a few seconds.
Since the farmer hasn’t been having any issues, I presume that any issues in the farm are only in the last few sectors, like 3575+. Should get to that in the full scrub in around 15 minutes or so.
2024-06-14T20:53:11.939140Z INFO {farm_index=0}: subspace_farmer::single_disk_farm: Checked 3560/3581 sectors
2024-06-14T20:53:13.690929Z INFO {farm_index=0}: subspace_farmer::single_disk_farm: Checked 3570/3581 sectors
2024-06-14T20:53:14.519010Z INFO {farm_index=0}: subspace_farmer::single_disk_farm: Checked 3580/3581 sectors
2024-06-14T20:53:14.526821Z INFO {farm_index=0}: subspace_farmer::single_disk_farm: Checking cache file path=/mnt/sub/space57/farm1/piece_cache.bin
2024-06-14T20:53:14.527706Z ERROR {farm_index=0}: subspace_farmer::commands::scrub: Irrecoverable farm error occurred, your file system might need to be repaired or disk might need to be replaced path=/mnt/sub/space57/farm1 error=Cache file does not exist at /mnt/sub/space57/farm1/piece_cache.bin
I’ll presume that error is bogus, since it really shouldn’t be checking for a local cache file anymore anyway. It seems like all the sectors checked out though.
Quick question. I’ve been canceling the plotter job as soon as it starts throwing these panics, though it seems like it wants to move on and keep going, picking a new set of sectors to try.
Unless I’m mistaken, when a plotting job is interrupted, it does save some of the progress. “Collecting partially plotted pieces” or something like that might be the message in question. Is it possible that it’s throwing these errors based on prior stored failed half-done attempts to create these sectors? If I just let it run for a while so that it could skip over and dump all these half-plotted sectors, might it then continue ok?