Is this NUMA test result real?

  1. CPU AMD Epyc 7302 x2, RAM 16G DDR4 x16 (meaning motherboard has two Epyc 7302 processors installed and 16 memory modules, 16G each)
  2. 5m10s per sector (one sector is encoded at a time by default)
  3. 6m0s per sector, 4 sectors at a time (meaning number of downloaded and encoded sectors was manually increased)
  4. 4m30s per sector, 8 sectors at a time (meaning 8 NUMA nodes)
  5. 2m90s per sector, 8 sectors at a time (meaning 8 NUMA no

Looks plausible, though impact of NUMA-aware memory allocator seems huge, especially considering that 7002 Epyc processors use I/O die and I don’t think there should be significant difference between accessing any of the memory channels, though you do have two physical sockets and maybe crossing from one socket to another is very costly on 7002 Epyc platform.

If after repeated tests this is confirmed, we might make NUMA-aware memory allocator the default because negative impact on other platforms is limited and benefit here is massive.

The optimization in the new version is still not as fast as running multiple instances of the software.

You numbers say the opposite though :thinking:

This is the test data you gave.

I mean in your first message version of the farmer with NUMA support is faster than version without NUMA support (even when configured to plot 4 sectors at a time). Why are you saying it is not as fast as running multiple instances?

I have a server with only two NUMA nodes. Running the test version, the speed is 5m-6m*2, but I can only open one software. When I open two, the speed drops to over 10 minutes.

Without using the NUMA version, I can open four software, each running stably at 7m*1.

EPYC7302*2 is your CPU

Hm… the whole point of the new version is to utilize CPU fully, you shouldn’t need more than one instance because it’ll be less efficient, which is exactly what you see. Running multiple instances was a workaround for not supporting NUMA that is no longer necessary.

Ah, sorry for confusion. Those were just examples, they are made up numbers and just provided for illustration purposes to show how to submit test results.

My CPU has many cores, but there are only two NUMA nodes. I expect the ideal speed for my CPU to be 7m-8m4. However, the actual speed is 5m-6m2.

Why?

Isn’t this not a good thing?

Assuming I have 4 SSDs, the speed when running multiple instances of the software is 1SSD 7m-8m x1 x4. Using the new version and opening only one instance of the software, the speed is 4SSD 5m-6m x2 x1.

Did you customize any CLI options related to thread pool size or number of encoded sectors in new version? They will interfere with the intended behavior.

Overall it is possible that on some CPUs it will still be non-ideal in some configurations, for example in case of your 8272CL it is simply not the most optimal CPU due to just 2 NUMA nodes and such a massive number of cores in each that many algorithms will not take full advantage of it.

You should still be able to benefit by running two instances instead of 4. In worst case you’ll just run 4 instances like before. As long as performance doesn’t regress I think it is a win because NUMA support is clearly better than previous default.

BTW, with new version threads are pinned to cores, so if you specify encoding concurrency to 4, you should get very good CPU utilization while also avoiding crossing NUMA nodes with just one farmer.

When I start a software process, I use the default parameters.

I am trying to start two software processes. Can I start two processes with these parameters?

--sector-downloading-concurrency 4 
--sector-encoding-concurrency 4
--farming-thread-pool-size 10
--plotting-thread-pool-size 16
--replotting-thread-pool-size 8

so?

If you want to have 4 farms plotted at the same time with one instance of the farmer, I would recommend to specify a single option:

--sector-encoding-concurrency 4

Farmer should be able to calculate all other options automatically in an optimal way.

This will result in half of each CPU being dedicated to plotting of a single sector, replotting will be configured to 1/4 of CPU core and downloading concurrency will be set to optimal value of 5. If you want overlap between sectors for plotting process you might also add --plotting-thread-pool-size 52 and each NUMA node will be processing 2 farms at the same time, but there will still be no NUMA node crossing.

I’m fairly certain it will be more efficient than running multiple farmer instances, especially if you’re not pinning them to NUMA nodes.

I’ll try the parameters you recommended

So ~4m45s per sector, not very fast for such system. You can set number of plotting threads to 104 to achieve the same result as running multiple separate farmers. This is up to you to experiment and share the findings.

more slowly,

–farming-thread-pool-size 10
–plotting-thread-pool-size 16
–replotting-thread-pool-size 8

I usually use this parameter to start four software processes

2024-01-07T15:25:39.261781Z  INFO subspace_farmer::commands::farm: NUMA system detected numa_nodes=2
2024-01-07T15:25:39.261794Z  WARN subspace_farmer::commands::farm: Too few disk farms, CPU will not be utilized fully during plotting, same number of farms as NUMA nodes or more is recommended numa_nodes=2 farms_count=4

why is that?