I’ve discovered an issue with the farmer cluster: one of my devices showed a plotting progress of 25%, but when I terminated the process and restarted it, the plotting progress started from 11%. I’ve observed this regression in plotting progress on three devices already.
From the following logs, it can be seen that the progress for some directories is correct, while the progress for others still starts from 11%.
2024-06-12T06:19:00.023146Z INFO {farm_index=6}:{sector_index=844}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.78% complete)
2024-06-12T06:19:00.023179Z INFO {farm_index=9}:{sector_index=1797}: subspace_farmer::single_disk_farm::plotting: Plotting sector (25.08% complete)
2024-06-12T06:19:11.930200Z INFO {farm_index=9}:{sector_index=1798}: subspace_farmer::single_disk_farm::plotting: Plotting sector (25.10% complete)
2024-06-12T06:19:17.057047Z INFO {farm_index=8}:{sector_index=836}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.67% complete)
2024-06-12T06:19:20.532444Z INFO {farm_index=8}:{sector_index=837}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.68% complete)
2024-06-12T06:19:20.532491Z INFO {farm_index=9}:{sector_index=1799}: subspace_farmer::single_disk_farm::plotting: Plotting sector (25.11% complete)
2024-06-12T06:19:21.007924Z INFO {farm_index=4}:{sector_index=826}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.53% complete)
2024-06-12T06:19:21.770713Z INFO {farm_index=4}:{sector_index=827}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.54% complete)
2024-06-12T06:19:32.795749Z INFO {farm_index=6}:{sector_index=845}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.80% complete)
2024-06-12T06:19:34.491403Z INFO {farm_index=6}:{sector_index=846}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.81% complete)
2024-06-12T06:19:36.100887Z INFO {farm_index=6}:{sector_index=847}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.82% complete)
2024-06-12T06:19:36.210342Z INFO {farm_index=6}:{sector_index=848}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.84% complete)
2024-06-12T06:19:36.302947Z INFO {farm_index=6}:{sector_index=849}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.85% complete)
2024-06-12T06:19:40.789982Z INFO {farm_index=6}:{sector_index=850}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.86% complete)
2024-06-12T06:19:46.113586Z INFO {farm_index=6}:{sector_index=851}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.88% complete)
2024-06-12T06:19:47.563037Z INFO {farm_index=6}:{sector_index=852}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.89% complete)
2024-06-12T06:19:47.800029Z INFO {farm_index=6}:{sector_index=853}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.91% complete)
2024-06-12T06:19:48.628995Z INFO {farm_index=6}:{sector_index=854}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.92% complete)
2024-06-12T06:20:04.700004Z INFO {farm_index=3}:{sector_index=1719}: subspace_farmer::single_disk_farm::plotting: Plotting sector (23.99% complete)
2024-06-12T06:20:11.049328Z INFO {farm_index=3}:{sector_index=1720}: subspace_farmer::single_disk_farm::plotting: Plotting sector (24.01% complete)
2024-06-12T06:20:30.394135Z INFO {farm_index=11}:{sector_index=901}: subspace_farmer::single_disk_farm::plotting: Plotting sector (12.58% complete)
2024-06-12T06:20:39.720970Z INFO {farm_index=3}:{sector_index=1721}: subspace_farmer::single_disk_farm::plotting: Plotting sector (24.02% complete)
2024-06-12T06:20:42.659259Z INFO {farm_index=3}:{sector_index=1722}: subspace_farmer::single_disk_farm::plotting: Plotting sector (24.04% complete)
2024-06-12T06:20:43.338458Z INFO {farm_index=6}:{sector_index=855}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.93% complete)
2024-06-12T06:20:46.530493Z INFO {farm_index=6}:{sector_index=856}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.95% complete)
2024-06-12T06:20:46.581016Z INFO {farm_index=3}:{sector_index=1723}: subspace_farmer::single_disk_farm::plotting: Plotting sector (24.05% complete)
2024-06-12T06:20:49.157432Z INFO {farm_index=6}:{sector_index=857}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.96% complete)
2024-06-12T06:20:50.985764Z INFO {farm_index=6}:{sector_index=858}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.98% complete)
2024-06-12T06:21:32.231964Z INFO {farm_index=5}:{sector_index=834}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.64% complete)
2024-06-12T06:21:32.250923Z INFO {farm_index=5}:{sector_index=835}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.66% complete)
2024-06-12T06:21:32.926954Z INFO {farm_index=5}:{sector_index=836}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.67% complete)
2024-06-12T06:21:40.592100Z INFO {farm_index=5}:{sector_index=837}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.68% complete)
2024-06-12T06:21:42.299103Z INFO {farm_index=5}:{sector_index=838}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.70% complete)
2024-06-12T06:21:42.321624Z INFO {farm_index=5}:{sector_index=839}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.71% complete)
2024-06-12T06:21:46.532787Z INFO {farm_index=6}:{sector_index=859}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.99% complete)
2024-06-12T06:21:46.935228Z INFO {farm_index=5}:{sector_index=840}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.73% complete)
2024-06-12T06:21:47.234652Z INFO {farm_index=6}:{sector_index=860}: subspace_farmer::single_disk_farm::plotting: Plotting sector (12.00% complete)
2024-06-12T06:22:19.819054Z INFO {farm_index=7}:{sector_index=1887}: subspace_farmer::single_disk_farm::plotting: Plotting sector (26.34% complete)
2024-06-12T06:22:19.835337Z INFO {farm_index=7}:{sector_index=1888}: subspace_farmer::single_disk_farm::plotting: Plotting sector (26.35% complete)
2024-06-12T06:22:19.853795Z INFO {farm_index=7}:{sector_index=1889}: subspace_farmer::single_disk_farm::plotting: Plotting sector (26.37% complete)
2024-06-12T06:22:19.885186Z INFO {farm_index=7}:{sector_index=1890}: subspace_farmer::single_disk_farm::plotting: Plotting sector (26.38% complete)
2024-06-12T06:22:25.708600Z INFO {farm_index=7}:{sector_index=1891}: subspace_farmer::single_disk_farm::plotting: Plotting sector (26.40% complete)
2024-06-12T06:22:28.180803Z INFO {farm_index=7}:{sector_index=1892}: subspace_farmer::single_disk_farm::plotting: Plotting sector (26.41% complete)
2024-06-12T06:22:34.929596Z INFO {farm_index=6}:{sector_index=861}: subspace_farmer::single_disk_farm::plotting: Plotting sector (12.02% complete)
2024-06-12T06:23:44.043431Z INFO {farm_index=5}:{sector_index=841}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.74% complete)
2024-06-12T06:23:44.054940Z INFO {farm_index=5}:{sector_index=842}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.75% complete)
2024-06-12T06:24:18.089181Z INFO {farm_index=7}:{sector_index=1893}: subspace_farmer::single_disk_farm::plotting: Plotting sector (26.42% complete)
2024-06-12T06:25:44.544506Z INFO {farm_index=8}:{sector_index=838}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.70% complete)
2024-06-12T06:25:45.638903Z INFO {farm_index=8}:{sector_index=839}: subspace_farmer::single_disk_farm::plotting: Plotting sector (11.71% complete)
2024-06-12T06:25:52.401306Z INFO {farm_index=11}:{sector_index=902}: subspace_farmer::single_disk_farm::plotting: Plotting sector (12.59% complete)
2024-06-12T06:25:52.415988Z INFO {farm_index=11}:{sector_index=903}: subspace_farmer::single_disk_farm::plotting: Plotting sector (12.60% complete)
2024-06-12T06:25:52.431115Z INFO {farm_index=11}:{sector_index=904}: subspace_farmer::single_disk_farm::plotting: Plotting sector (12.62% complete)
2024-06-12T06:25:53.289754Z INFO {farm_index=11}:{sector_index=905}: subspace_farmer::single_disk_farm::plotting: Plotting sector (12.63% complete)
2024-06-12T06:25:53.311782Z INFO {farm_index=11}:{sector_index=906}: subspace_farmer::single_disk_farm::plotting: Plotting sector (12.65% complete)
2024-06-12T06:25:54.760099Z INFO {farm_index=9}:{sector_index=1800}: subspace_farmer::single_disk_farm::plotting: Plotting sector (25.13% complete)
The plotting progress must be correct. You have different farms and each of them has their own progress. Some 11%, some 25%. Order in which sectors are plotted across different farms is not deterministic, so you may see different output each time you start the software.
This looks like a networking issue where your app lost connection to NATS server for some reason. It reconnected later, but the progress of whatever it was doing will be reset in this case.
This likely means you’re putting too much load on one of the components. Hard to say which one from the logs though. Can you tell me more about the setup? How many things you’re running, what hardware, networking situation, etc.?
Each farmer is plot at the same time. I can confirm that the progress of each farmer is around 25% before the process ends. However, after restarting the process, the progress of some farmers becomes 11%. Suppose there are 6 14T farmers on a device, and 6 farmers start plot at the same time. After a few days, the progress of these 6 farmers will be between 11% and 25%. Is there such a big difference?
As already said, the process is not deterministic, so it is possible that some farms will make more progress than others.
That is A LOT of plotters for a poor single controller and cache instance. The code is not magic and may not be able to process that many at the same time, receiving side is simply overwhelmed.
I’ll create an experimental build that should scale both controller and cache much further, will post a link once I have it. The expectation was that it wouldn’t be necessary, but I don’t think huge setups will work as is unfortunately.
Either way you should consider running more than one node and controller+cache pair in such setup for performance and redundancy. And maybe even clustered NATS.
I have about 200 plotters, but not all of them are turned on yet. If all of them are turned on, do you recommend dividing these 200 plotters into multiple clusters? If they are to be combined into a cluster, how many controllers, caches, and NATS components are recommended to run?
You don’t have to have multiple clusters. Single cluster can have multiple controllers, caches, etc. I do not have numbers for you as I do not have hardware to experiment with such setups. This is something we expect community to try out and share with each other.
It will run multiple instances of internal service on all components of farming cluster, which should allow for higher capacity to receive requests internally. By default it run 32 copies, but can be customized with CLI option.
Today, I tested this version and noticed many frequent warn-level logs. This time, I restarted the process and chose to retain the historical logs. The issue of the P plot progress regression reoccurred. According to the following logs, it can be seen that for the same farmer, the progress before shutdown was significantly greater than the progress after restarting. This happened with multiple farmers. Please take this issue seriously.
I think you might have networking issue because warnings seem to happen after this each time:
2024-06-12T18:04:24.031918Z INFO async_nats: event: disconnected
2024-06-12T18:04:24.048468Z INFO async_nats: event: connected
As to progress being lost, it is possible to lose a bit of progress due to how plotting works: plotting will happen for up to 50 sectors at a time (controlled via --sector-encoding-concurrency, but at the same time only sectors that are plotted sequentially are actually available for farming and will be seeing as plotted on restart. So if you have plotted sectors 1, 10 and 11, without sectors 2-9 being plotted on restart you’ll see that just sector 1 is plotted because there is no information stored about sectors further away.
I will take a look at improving ordering such that you’re more likely to plot sectors sequentially, but you should still look into why farmer looses connection to NATS server.
That is definitely very odd, I don’t have a good explanation for it right now I’m araid, I don’t see how it can happen in the code yet.
Could you set environment variable RUST_LOG=info,subspace_farmer=debug and post logs from that? It will print more information about what is happening.
Are you sure you were already running jun-11 at that point? That specific situation was fixed in jun-11 and should no longer happen with a single NATS connection.
Yes, before the process was stopped, it was running the gemini-3h-2024-jun-11 version. I will continue testing after setting the environment variables tomorrow and then provide the logs.
Thanks for reporting and patience with this. I understand it must be frustrating this much plotting progress.
I recommend to restart it with mentioned environment variable and maybe not waiting as long this time since it seems to have skipped quite a lot of sectors, so should be reproducible much quicker.
When I set the environment variables today and restarted the process to reproduce the progress loss issue, I found that the progress was normal. This is very strange…