Farming cluster

I see, thanks for clarifying.

Farming requires continuous stream of data. Not much, but it needs to happen all the time and will grow with the amount of space audited. That is in theory, in practice things like NFS also add some overhead and not necessarily tuned for low latency and huge amounts of tiny small reads, which likely causes issues, but might be possible to resolve with some tuning.

What I means by bandwidth usage during plotting is that 1G sectors will need to be downloaded in form of pieces in order to start plotting and then uploaded to farmer for storage once sector is creased. With it being 1G in size, it will take a few seconds on 1Gbps connection assuming to have everything local. If you imagine future GPU plotter that can produce sectors even quicker than CPUs can today, you might start having delays in plotting utilization caused by networking.

At the same time for farming it is just necessary to receive a small notification and occasionally when solution is found to send solution back. This requires very little bandwidth.

The requirements described in the document are relative to each other, if you have 2.5G LAN or faster, it may not be an issue in practice. All of this will need to be tested once built of course.

Thanks for the detailed response.

Is samba an actively better sharing method for the planned cluster implementation? Iā€™ve slightly preferred NFS in recent years to samba for most sharing purposes on my linux boxes, but I can switch things back to samba for subspace if it would improve things significantly.

No idea honestly and it is out of the scope of this particular topic

Very early (but supposedly functional) version of farming cluster is building: Snapshot build Ā· subspace/subspace@1f61f5b Ā· GitHub

It is not final and breaking changes are expected to network layer (so if you decide to run early version youā€™ll have to stop all instances and start new version rather than upgrading one by one).

There are docs in CLI that should be sufficient to start, but Iā€™ll also provide short examples here.

nats.io server is required for this to work, running in Docker is recommended, but can be started on regular machines. Cluster configurations are also supported, but youā€™ll have to read their docs on how to set it up.

NATS should be started with a config file containing the following:

max_payload = 2MB

Simply save it into nats.config file and start NATS server with nats -c nats.config, with Docker will look something like this:

docker run \
    --name nats \
    --restart unless-stopped \
    --publish 4222:4222 \
    --volume ./nats.config:/nats.config:ro \
    nats -c /nats.config

Now you have to start 4 components (as separate instances for now, it will be possible to start many of them in the same app in the future).

Controller (will create a few small files for networking purposes):

subspace-farmer cluster --nats-server nats://IP:4222 \
    controller \
        --base-path /path/to/controller-dir \
        --node-rpc-url ws://IP:9944

Cache (supports multiple disks just like farmer):

subspace-farmer cluster --nats-server nats://IP:4222 \
    cache \
        path=/path/to/cache,size=SIZE

Plotter (stateless):

subspace-farmer cluster --nats-server nats://IP:4222 \
    plotter

Farmer (supports multiple disks like usual):

cluster --nats-server nats://IP:4222 \
    farmer \
        --reward-address REWARD_ADDRESS \
        path=/path/to/farm,size=SIZE

Note that all instances can be on different machines, but need to point to the same NATS server/cluster.

Most of familiar farmer CLI options are still available, but spread out across various subcommands accordingly, use --help to discover them.

On-disk format of everything is compatible with regular farmer, hypothetically you can point all instances to the same directory and it will continue to work just fine.

Expect :dragon: , :bug:, :fireworks: and all kinds of :exploding_head: issues for now, though I would appreciate carefully composed bug reports with logs, etc.

More polish and better documentation to come, just wanted to share early progress with the community and get early feedback.

Please keep number of messages in this thread to minimum. Think carefully if your message contributes something substantially valuable to the discussion and for casual chatting post your message on Discord instead.

I may remove message that do not follow ^ policy.

3 Likes

So the PC where we run the NATS server will be the single point of failure in the cluster farming model? If answer is YES, do you have plan to eliminate it in the next development?

In my humle opinion, I would expect the following as perfect:

  1. No NATS server
  2. More than 1 controller, cache in the network and this is where all farmers can have resiliency.

Thank you.

As already mentioned above, NATS support clustering. Read their docs on how to achieve high-availability. You can specify more than one NATS endpoint by repeating --nats-server argument (though auto-discovery of other NATS servers in the cluster should also happen automatically).

You can have multiple controllers as well, but read CLI docs because youā€™ll have to assign a dedicated group of caches to each controller to manage. Caches canā€™t be shared between controllers.

One more question: Do we have any additional doc, besides all your description in your post above? If yes, can you please share any link? Iā€™ve tried to look at Advanced CLI | Farm from Anywhere but nothing about cluster farming is there.

I appreciate any insight as to the issues listed below. Roughly 82TB over 48 plots.

Issues:

Controller: When running the controller and the cache is not set large enough, errors continuously flood the screen. Implementing better error handling for this scenario would prevent the constant display of ā€œnot enough spaceā€ errors. It would be beneficial if the ideal cache size could be queried from the network and shared in the output. Would implementing a maximum cache size setting and allowing the size of piece_cache.bin to be dynamic make sense?

Farming: System CPU utilization is very low (<15%). Not sure if its due to the others errors or not.

Errors:

Controller: Can't preallocate plot file, probably not enough space on disk errors arise when using existing plot paths that contain piece_cache.bin files. Would be nice to have old piece caches files automatically removed if unnecessary.

Controller: buffer of stream grows beyond limit - Lots of these errors throughout the piece cache sync process.

Controller: Error while reading piece from cache, might be a disk corruption error=request timed out: deadline has elapsed farm_index=0 key=Key(b"\x90\xb2\xce\x05\x08\x16{\0\0\0\0\0\0") offset=31213 - Lots of these errors before Farm initialized successfully starts for each plot. I tried putting the cache on two separate disks, as well as a 700GiB ramdisk.

Plotter: ERROR subspace_farmer_components::segment_reconstruction: Recovering missing piece failed. missing_piece_index=49315 received_pieces=0 required_pieces_number=128 numerous errors like this.

Questions:

Does clustering support Prometheus metrics server? If so, which clustering section does it belong? It does not work under farmer.

Logs:
https://ssi.ssc.farm/logs/controller.log
https://ssi.ssc.farm/logs/farmer.log
https://ssi.ssc.farm/logs/plotter.log
https://ssi.ssc.farm/logs/cache.log

There is nothing nowhere, just CLI docs. This is an early version for which there is not even pull request open. You expect too much at this point :slightly_smiling_face:

I donā€™t think controller cares about cache size at all. Iā€™ll check provided logs.

Can do that, but not sure it is very helpful honestly.

It is dynamic in a sense that you can add and remove caches any time and resize them at will.

Farming is not supposed to be CPU-intensive.

Not sure I want to remove it since it will take time to sync it again in case you want to switch back and you can technically use the same directory for both farm and cache. But yes, it will result in those errors in case you have maxed out the drive previously and now specifying the same farm size again (which no longer includes cache capacity in case of cluster version).

You can enable metrics endpoint with an option after subspace-farmer cluster, though most of the metrics are TODO. I think only farmer metrics are currently present. Wait for separate announcement about metrics.

@vexr so did it work in the end once you fixed farm sizes or not?

I just noticed I had not captured that in the logs. It was with an initial piece cache size of 100GiB. I can recreate if that is helpful. The out of space errors did not return when I tried 200GiB, or 700GiB, but the other errors I shared continued.

Once I removed the piece cache files to allow it to start, it had extra sectors to plot. That was what I meant by ā€œLow Farming System CPU utilizationā€ I should have said plotting.

It never progressed with plotting sectors.

Edit: It is my assumption that the cache size only needs to be the size of the archived blockchain history. That is why I initially set it at 100GiB. Is that assumption correct?

I have a 10 TB SSD drive on one of my server, and other 10 servers available.
If I use one farmer node to plot this disk, it will take days.

Can I use subspace cluster shorten this process in one day?
Can subspace cluster use all the CPUs to plot one big SSD quickly?

Plotter is the one using CPU in this case, but if you have removed piece cache and not sync a new one, it may take a while before you see the first sectors being plotted.

UPD: I see controller was overwhelmed with requests for pieces, that is something Iā€™m planning to improve soon.

Yes.

Please use Discord for such questions. Also they are already answered at the beginning of this topic, please read it in full before asking questions.

The piece cache was completely synced at 2024-05-08T05:47:48.298972 and I started the farmer at after that at 2024-05-08T05:47:51.58019Z.

I am not sure if the Error while reading piece from cache, might be a disk corruption request timed out: deadline has elapsed errors are what caused it to not progress. Those errors in controller.log started at 2024-05-08T05:48:24.534677.

I am able to farm with those drives on may-06 without issue.

Let me know if you need any other tests or logs to assist.

New experimental build: Snapshot build Ā· subspace/subspace@84c1680 Ā· GitHub

Feature-wise almost the same, but should work a bit better.

@vexr try this one, please.

1 Like

System Specs

Dual Socket 7742 128 cores (256 threads)
Cache: 200GiB
Plots: 1.9TB * 32
10g link between the nats server and the rest of the cluster (cache, controller, farmer, and plotter on a single system).

Errors

  • During Test 3, Failed to send error progress update error=send failed because receiver is gone in plotter followed by a continuous errors stating Farm errored and stopped farm_index=# error=Low-level plotting error: Timed out without ping from plotter in farmer. This did not seem to recover and I stopped Test 3.

  • Previously reported buffer of stream grows beyond limit errors in controller during piece cache sync.

Observations

plotter

  • Test 1. Plotting uses roughly 50% of system CPU resources with the same sector encoding concurrency settings (16) as may-06. may-06 uses most CPU resources ~95%.
  • Test 2. Doubling sector encoding concurrency to 32 (Same as L3 cache groups) lowers the resources further to about 30%.
  • Test 3. Setting sector encoding concurrency settings (16) and record encoding concurrency (16) on uses about 40% of CPU resources.

Other

subspace-farmer cluster plotter --help

  • --record-encoding-concurrency Should be ā€œHow many records farmerā€¦ā€
  • --plotting-thread-priority. What are the option(s)? Only min is defined as the default option.
  • Minor inconsistencies in the use of periods at the end of some options.

Questions

  • Is there still work to be done to maximize system resources with plotter, or are there settings I should look at changing manually?
  • plotter - Is there a way to increase the Plot sector request count? I canā€™t seem to get it to more than 8 at a time. I wonder if this is what is causing the lack of full resources?

Logs

cache
controller
farmer
plotter

1 Like

Both plotter and farmer have concurrency options. Farmer by default will not attempt to plot more than 8 sectors to limit peak memory usage even if there is more plotting resources in the cluster, so by increasing concurrency on plotter youā€™re leaving resources idle there.

This is exactly the reason you see this:

This is something that will happen sooner or later and that farmer is not quite ready for it unfortunately and will exit the farm. Something to improve in the future builds.

My misunderstanding of the workflow lead me to remove all concurrency settings from farmer and focus on plotter. With this change, I am seeing the CPU usage expected, and the plotting times are in line with where they should be!

Am I correct in assuming plotter resources should be set to the available resources of the host of the plotter(s), and the farmer role is geared more towards directing what resources should be used for its specified plot(s)?

The overall resources for the farmer process seem minimal (aside from higher RAM requirements with multiple concurrent plots). Aside from system memory and possibly limited networking, what other scenario(s) would it make sense to limit concurrency on farmer?

It may be helpful to outline the new roles a bit to help in the mental understanding of how the clustering workflow operates. cache and controller are pretty self explanatory, but the differences between farmer and plotter would be helpful.

I will start rolling this out to other hosts and report anything new that pops up.

Let me know if there are any specific test(s) I can help with further.

Your work on this can be described as nothing short of amazing! :fire: :fire:

There is a description linked in the very first post of this thread. A slightly expanded description can be found in various relevant modules if you look on GitHub, see files under crates/subspace-farmer/src at Comparing main...farming-cluster Ā· subspace/subspace Ā· GitHub

Thanks and appreciate high quality feedback with specs and logs, it helps a lot! I have things to work on next.

1 Like

Slightly improved version: Snapshot build Ā· subspace/subspace@31a98a9 Ā· GitHub

Most important change is that it will not crash the whole farm when plotting of a sector fails, itā€™ll simply restart plotting of that sector until it succeeds. This will also extend to non-cluster mode for those who have unstable computers (thought theyā€™ll still want to make it stable).

There is some cluster networking tweaking that should help.

Hello, I found that the farmer did not start plotting during testing. Here are the logs from the controller, plotter, and farmer.

controller:

2024-05-11T08:20:17.809119Z  INFO subspace_farmer::commands::cluster::controller::farms: Farm initialized successfully farm_index=3 farm_id=01HXKBSPNC03EYKXGY6ZBWP7PY
2024-05-11T08:20:17.809600Z  INFO subspace_farmer::commands::cluster::controller::farms: Farm initialized successfully farm_index=4 farm_id=01HXKBSPNBWSSZBSDYX9CFHZQP
2024-05-11T08:20:17.810006Z  INFO subspace_farmer::commands::cluster::controller::farms: Farm initialized successfully farm_index=5 farm_id=01HXKBSPNBWFZYHX3HG064677K
2024-05-11T08:21:25.966089Z  INFO subspace_farmer::farmer_cache: Synchronizing piece cache
2024-05-11T08:21:26.075180Z  INFO subspace_farmer::farmer_cache: Finished piece cache synchronization
2024-05-11T08:46:53.188274Z  WARN subspace_farmer::commands::cluster::controller::caches: Cache expired and removed, scheduling reinitialization cache_id=01HXKBS2Z5RETFGAPY4CTFPDE4
2024-05-11T08:46:53.188423Z  INFO subspace_farmer::farmer_cache: Initializing piece cache
2024-05-11T08:46:56.237947Z  INFO subspace_farmer::commands::cluster::controller::caches: New cache discovered, scheduling reinitialization cache_id=01HXKBS2Z5RETFGAPY4CTFPDE4
2024-05-11T08:47:00.696917Z  INFO subspace_farmer::farmer_cache: Synchronizing piece cache
2024-05-11T08:47:00.736406Z  INFO subspace_farmer::farmer_cache: Finished piece cache synchronization
2024-05-11T09:07:43.353913Z ERROR yamux::connection: 9c22ade8/11: buffer of stream grows beyond limit    
2024-05-11T09:07:43.357664Z ERROR yamux::connection: 9c22ade8/11: buffer of stream grows beyond limit    

plotter:

2024-05-11T08:49:52.320719Z  WARN subspace_farmer::plotter::cpu: Failed to send error progress update error=send failed because receiver is gone
2024-05-11T08:50:01.115311Z  INFO {public_key=746343717a55e4153717cd2bcdb72423a9ede143166c00be6037d70caacc7754 sector_index=0}: subspace_farmer::cluster::plotter: Plot sector request
2024-05-11T08:50:02.241213Z  INFO {public_key=90b7bdce98d71e16160aa5e46f3d5be622269d8a3a87f9d1828b95ef336dff01 sector_index=1}: subspace_farmer::cluster::plotter: Plot sector request
2024-05-11T08:50:03.685099Z  INFO {public_key=56be6e780c928d1863232ed0b4a5b1d5531fd9e9971d84dacc0de7e4a4aa6469 sector_index=0}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2024-05-11T08:50:29.396831Z  INFO {public_key=cc5a7edc111fdb9dbd534af430e659837e94e3b2c5961d860cedeab6e9a45172 sector_index=0}: subspace_farmer::cluster::plotter: Plot sector request
2024-05-11T08:50:36.829995Z  INFO {public_key=a845eeaa652edc0a1fce9302e0b0c164944be017b6173206d2309f1d5ffe8f47 sector_index=0}: subspace_farmer::cluster::plotter: Finished plotting sector successfully
2024-05-11T08:51:29.399281Z  WARN {public_key=cc5a7edc111fdb9dbd534af430e659837e94e3b2c5961d860cedeab6e9a45172 sector_index=0}: subspace_farmer::cluster::nats_client: Acknowledgement wait timed out request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress
2024-05-11T08:51:29.399394Z  WARN {public_key=cc5a7edc111fdb9dbd534af430e659837e94e3b2c5961d860cedeab6e9a45172 sector_index=0}: subspace_farmer::cluster::plotter: Response sending ended early
2024-05-11T09:10:00.366223Z  WARN subspace_farmer::plotter::cpu: Failed to send error progress update error=send failed because receiver is gone
2024-05-11T09:10:08.904061Z  INFO {public_key=90b7bdce98d71e16160aa5e46f3d5be622269d8a3a87f9d1828b95ef336dff01 sector_index=2}: subspace_farmer::cluster::plotter: Plot sector request

farmer:

2024-05-11T08:20:17.769959Z  INFO {farm_index=3}: subspace_farmer::single_disk_farm::plotting: Plotting sector (0.00% complete) sector_index=0
2024-05-11T08:20:17.770027Z  INFO {farm_index=4}: subspace_farmer::single_disk_farm::plotting: Plotting sector (0.00% complete) sector_index=0
2024-05-11T08:20:17.770291Z  INFO {farm_index=0}: subspace_farmer::single_disk_farm::plotting: Plotting sector (0.00% complete) sector_index=0
2024-05-11T08:46:29.697587Z  INFO {farm_index=1}: subspace_farmer::single_disk_farm::plotting: Plotting sector (0.00% complete) sector_index=0
2024-05-11T08:46:39.699146Z  WARN {farm_index=1}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s sector_index=0 error=Low-level plotting error: Timed out without ping from plotter
2024-05-11T08:50:01.064426Z  INFO {farm_index=5}: subspace_farmer::single_disk_farm::plotting: Plotting sector (0.00% complete) sector_index=0
2024-05-11T08:50:02.190799Z  INFO {farm_index=3}: subspace_farmer::single_disk_farm::plotting: Plotting sector (0.10% complete) sector_index=1
2024-05-11T08:50:29.328608Z  INFO {farm_index=1}:{public_key=cc5a7edc111fdb9dbd534af430e659837e94e3b2c5961d860cedeab6e9a45172 sector_index=1}: subspace_farmer::single_disk_farm::plotting: Plotting sector retry sector_index=0
2024-05-11T08:50:39.330302Z  WARN {farm_index=1}:{public_key=cc5a7edc111fdb9dbd534af430e659837e94e3b2c5961d860cedeab6e9a45172 sector_index=1}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s sector_index=0 error=Low-level plotting error: Timed out without ping from plotter