Motivation
Recently, NUMA support was introduced in the farmer which makes it easy to better utilize certain CPU architectures exhibiting non-uniform memory access. Prior to NUMA support farmers with such CPUs had to manually assign one farmer to a certain group of cores lying in the same NUMA node. This new feature gives advantage exclusively to server-grade CPUs like Epyc, Threadripper, Xeon etc. since those platforms usually forward the NUMA pattern of the CPU to the OS.
However, even though consumer CPUs usually do not offer software-side NUMA support due to lack of this feature in consumer-grade mainboards, many recent CPUs do operate hardware-wise in NUMA regime. This is especially prominent in AMD Ryzen CPUs, like shown in the following examples.
Examples
AMD Ryzen 9 3950X
This CPU consists of two CCDs with 8 physical cores each where 4 cores within one CCD share same L3 cache. This results in relatively low latency withing those 4 cores and relatively high latency to the remaining 12 cores of the CPU.
AMD Ryzen 9 7950X
Like the 3950X this CPU also consists of two CCDs with 8 physical cores but with the difference that they all share the same L3 cache. The resulting latency between the 8 cores in one CCD is relatively low. Likewise, the latency to the other 8 cores of the other CCD is relatively high.
Note
Both CPUs support hyper threading and expose two logical cores per physical core with an offset of 16. This means that e.g. logical cores 1 and 16 belong to the same physical core and so on. This is the reason for the low latency between these logical cores.
Implications
Tests have shown that the farmer primarily benefits from low intra-core latency of the assigned cores. Currently, it is not possible to tell the farmer which cores it has to use per plotting instance if the BIOS is not exhibiting the NUMA behavior to the OS. Therefore, farming on such consumer-grade CPUs cannot be easily tuned to maximum performance. Like on server-grade CPUs prior to NUMA support the user has to start several farmer instances and manually assign them to low-latency groups of CPU cores.
With NUMA support server-grade CPUs now have an advantage over consumer-grade CPUs exhibiting NUMA behavior but without explicit NUMA support by the BIOS. In my opinion, this goes against the idea of decentralization where home farmers should have no disadvantage to server hardware.
High Level Implementation Proposal
I suggest to allow the user to specify which cores concurrent plotting instances are allowed to use. This could be provided in the following form in the case of 7950X:
--sector-encoding-concurrency-groups [1-8,17-24], [9-16,25-32]
This would tell the farmer that it is allowed to run two concurrent plotting instances where the first instance is pinned to cores 1-8 + 17-24 and the second instance is pinned to cores 9-16 + 25-32. Each of these two groups of logical cores has low intra-core latency. This would be the optimal setting for the 7950X.
An issue raised by a user on discord (Discord) where the farmer on a 7950X performs optimally with exactly 4 concurrent plotters could be overcome this way. E.g. one would be able to define 3 concurrent plotting instances with optimal performance:
--sector-encoding-concurrency-groups [1-8,17-24], [9-16], [25-32]
Here, the first instance would utilize 16 logical cores from the first CCD and the second and third plotting instances would utilize 8 logical cores each from the second CCD with optimal intra-core latency.
The optimal core assignment for the 3950X would look like this:
--sector-encoding-concurrency-groups [1-4,17-20], [5-8,21-24], [9-12,25-28], [13-16,29-32]
Farmers with consumer CPUs like in the two examples above would be able to optimize the plotting speeds by either measuring the intra-core latency or by simply copying the core-assignment lists of other farmers with same CPUs without having to rely on NUMA support by the BIOS.