Farming cluster

We had an issue originally created as a response to Can set up farm to connect to specified farm on internal network? that suggested a hierarchical architecture of connecting multiple farmers together.

After thinking about it more, I realized we should probably decompose previously monolithic farmer into multiple separate components instead that would allow scaling larger farming setups better.

The result of that is Farming cluster, curious what community thinks about this.

P.S. Support of running farmer as a single component as it is today will remain a supported and default setup for more users. Space Acres may or may not support this architecture, that is still TBD.

1 Like

As a farmer, I like the separation and flexibility around resource allocation and utilization. As a decentralization advocate, I dislike enabling more scale in farmers in general, but perhaps decentralization is a pipe-dream once we add tokenomics and money and economies of scale take over. In any case, I think it is better to enable it than to be out-competed by alternatives.

One thought on the architecture, does it also make sense to split out retrieval to it’s own Retriever so that the part responsible for retrieves pieces from plotted sectors on request is separate? It seems to be a different concern from that of farming.

Not sure how much it helps. We’re trying to make things easy to use for small farmers on one end so they don’t join farming pools and we make things flexible and efficient for large farmers to remain autonomous and not join farming pool either.

If we don’t provide such tools, others will do and likely in less decentralization-friendly way. There is no question whether large and small farmers exist, the only question is what software and how they run. So it would be nice if reference and official default software that is built with network goals in mind is what farmers use.

Plotted pieces are physically located within sectors. If sectors are managed by farmer role in described architecture, then it will come from farmers one way or the other, so I’m not sure how and why would it have to be separate.

1 Like

The idea is brilliant. However, my concern is while we want to make way for more scalability, do we introduce more complex so that we’ll have to spend time to fix the complication that we introduced/added.

I’d like to avoid that. So, in order to balance thing, I’d like to combine the node/controller/cache to one; and farmer/plotter to one. I believe the integrated role will have no issue of communication; and thus, troubleshooting for fixing thing will be simpler.

I don’t think I fully understand how you come to those conclusions. If networking will already be involved between components once the split happens, how do you quantify “no issue of communication” and “troubleshooting for fixing thing will be simpler”?

I give only one example. If you have seperate role. Someone may have node, controller, cache on 3 PC’s. So in this case, the risk of the whole farm not operational is 3x when any PC is down.

And if we take the combination of 5 roles, how many set up we can have?

This is unnecessarily complicated to me. The original idea of chia is that people can farm it with a raspberri, but in the end, everyone built powerful PC to handle hundreds of, thousands of TB. So I don’t think we have to go further down to separate roles, just for someone to use a low CPU/build for a particular role. People who affords to have hundreds of TB of SSD to farm subspace, they want simple set up, rather than complication to use a ‘low CPU PC’

First of all you don’t have to run them on different machines, you can still run all components on one machine and then some on other machines as well, maybe CLI will even allow enable multiple components in the same process for convenience.

Second, I think what you’re getting at more broadly is redundancy and the plan is to eventually allow multiple controllers that might connect to their dedicated nodes so you can have redundancy and fail-over in the cluster when something is upgrading or goes down for a different reason.

Can you clarify what you mean here?

Regular farming mode as it is today where everything is vertically integrated is not going anywhere. I envision clustering as a separate way of farming that those who are interested can opt-in if they see a value in it, but no one will be forces and this will not be the recommended default farming method.

I’ve seen that you’ve tried to avoid complication so you and team can focus on your incomplete to-do list.

To me, most take from this ‘farming cluster’ is the cache role, in which all farmers can get the piece cache (blockchain history) locally from it. So any disk change of existing farmers; or firing up brand new farmers will be quick and instant.

My suggestion is just a good enough for this purpose.

No matter that these roles will be run in same machine or different ones, as long as they’re different instance, they’ll have more risk. Then people who has issue will come for help, and this will distract you and team from developing other stuff.

I see there’re other area to develop/optimize, especially the GUI app. For this one, I propose a config.txt template where the user can modify it and have the full options as CLI. I can see that other blockchain has the update link available in the GUI and user just click to that link to upgrade as soon as there is a new release, do we have it yet?

That is one, another popular challenge is network usage by having more than one public P2P port open.

It doesn’t install an update automatically, but it does show that there is a new version and opens the page when you click on it. But this is not the topic of current discussion.

I like it. Even as a small farmer, I think it makes sense to offer complete set of options for large-scale farming. Subspace is doing well compared to other “mining” type projects, in how it is accessible and convenient, helping small operators accomplish what they want to accomplish.

Providing massive farmers with the same level of support is a good look for the network, showing that there are opportunities for a wide range of operators. This will help with word of mouth and enable more enterprise-level investment.

I think most participants can benifit from different options on how to deploy their own set: simplest monolithic instances, replacable components, or a large-scaled cluster solution.

1 Like

Some questions from @Qwinn on Discord:

may I ask how many different executables this new method would entail, or if still 2, then how many processes would have to be launched using different parameters on those executables? It seems like it would increase the number, and therefore the complexity of both initial setup and the applying of updates (modifying more scripts to the new version #s, etc).

As mentioned in Notion document, I expect that the same farmer executable will gain cluster subcommand like this:

subspace-farmer cluster <ROLE> <role-specific args...>

Depending on complexity might be possible to run multiple roles at once:

subspace-farmer cluster \
    <ROLE1> <role-specific args...> \
    -- \
    <ROLE2> <role-specific args...> \
    -- \
    <ROLE3> <role-specific args...>

To be honest, if there is an advantage to the farmer in this seemingly more complex setup versus the prior idea of Supervisor Farmer, which seemed simpler to setup and maintain (and that should also provide the same scaling advantages that the cluster method would, at least as far as I can see), I’m not yet seeing it. Perfectly willing to be convinced tho.

That is a valid feedback, if existing users can’t be convinced, it will be even harder to explain to other users.

I think the biggest advantage of this cluster architecture is that you can scale and upgrade different components independently.

For example you might want to plot as fast as possible with GPUs (when available), but there is no need to have large stock of GPUs on mainnet just to maintain already plotted farms, you may even rent GPU server temporarily in case Internet connection allows you to download plotted sectors.

When plotting/replotting does happen, you will not have a situation where some machines are still plotting, while others are idling, you’ll be able to utilize all the resources you have.

And in the future with multiple controllers you’ll be able to have redundancy/failover to reduce downtime when you upgrade software.

Lastly originally supervisor was supposed to point to remote farms (like like to local disks), which while was easier to implement, meant that supervisor would need to restart every time new farm is added, while with cluster architecture different components are supposed to discover each other, so you can add and remove farms when you need them without restarting other components.

Sounds fine as long as it is possible to have multiple farmer and plotter instances.
Generally there is a question how n Farmers and m Plotters are managed, i.e. will there be a farmer pool and a plotter pool (managed by the controller?) or is there a direct framer <—> plotter setup?

I also think this setup can easily manipulated/changed/edited to create a pool, which might be a concern.

All farmers will be able to use all plotters in the cluster as needed, they are not mapped to each other explicitly.

It is impossible to block existence of pools from technical point of view, but we can try to approach making them pointless from UX/practical point of view.

Imagine you had a choice between running a single independent farmer on each local machine or joining a farming pool that provides some clustering software. From that standpoint what we’re doing is moving away from farming pools, not closer to them. At least this is the way I see it.

Indeed. Unless there is some kind of workable DID solution.

1 Like

While I prefer the separation of roles you originally outlined, this “advantage” seems to be an implementation choice that could be handled either way - not sure this is exclusive to the new proposed architecture. The supervisor approach could probably be developed to support a runtime command to register/remove a remote.

You’re correct, it is possible to implement either way. The original idea with supervisor was based on the assumption of making remote farms similar to local farms.

Thinking again about this. For the record, since I was initially skeptical and not seeing the advantages of this method over farming supervisor, I was eventually sold on it.

One of the things that sold me was the notion that there’d be a central plotter that could do all the plotting work for multiple farms, and never go idle as long as any of the farms had remaining room.

The idea that this would work implies that the problems people have had “farming over the network” to remote disks it was plotting/farming to was caused by excess network traffic due to the farming side of the current structure, and not the plotting side. That just plotting to remote drives without farming wouldn’t cause the problems that attempts to farm/plot remotely previously/currently seemed to entail. Would that be correct?

(For the record, I only ever tried to plot remotely once, to a simple SATA 1TB SSD. It worked well enough, but bandwidth usage seemed high and I got the impression if the disk had been bigger, it would’ve been a problem. Also gleaning that it’s problematic from posts by other people indicating higher misses with that setup, etc.)

I mainly ask for confirmation because a section of the Notions document implies that plotting network usage is higher than farmer, which I found surprising. Still, even if plotting and farming are equally intensive on the network, splitting off the farming will still reduce that load by half and I can see that mostly eliminating the issues.

My brain starts boiling when I try to understand what you write here. Can you maybe re-phrase it somehow?

Er, I’ll try. Previously - a few versions ago - people that reported attempting to farm and plot over a network connection had difficulty doing so. That it consumed a substantial amount of the bandwidth a 1g or 2.5g local network could handle. I did myself notice some lag when trying to do anything else on my 2.5g network on either of the two boxes that were involved in me trying to plot and farm a small 1TB disk over the LAN.

My question was whether just plotting alone, rather than plotting and farming, over a 1G/2.5G lan connection, would consume sufficiently less resources as to not pose the issues they used to. You seemed to suggest in your Notion document that plotting could consume more network bandwidth than farming, so it seems a fair question to ask.

If it’s still not clear what I’m asking, never mind.

EDIT: Example, right here.

Yes, I get that’s farming, not plotting, but if plotting consumes even more network resources than farming does… it’s fair to ask if it could both create LAN lag and if it would decrease plotting speed too.