Comparison of Single vs. Multiple GPU Instances

As requested, I have completed the comparison of performance metrics between single and multiple instances. The analysis includes plotting time, GPU power usage, GPU utilization in %, and GPU memory usage. I’ve also included nvtop graphs to highlight any significant differences. I hope this helps in understanding how to better improve single instance performance to better match that of multiple instances.

Performance Benchmarks

These performance benchmarks were conducted using two RTX 4090 GPUs with the following approach:

  • Snapshot build #383
  • A total of nine drives were used to prevent them from becoming a bottleneck.
  • After adding each plotter to the cluster, I allowed at least one minute of plotting to reduce the effect of the initial ramp up period.
  • GPU statistics were collected from each GPU, averaged over a 300 second period. Data was captured every five seconds.
  • Sector times were calculated based on successful sectors plotted during the same 300 second period, aligning with the GPU statistics.

Observations

  • The sector times are somewhat skewed due to failed sectors that waste processing time. Unfortunately, this issue has been ongoing, and has been reported in the past. This behavior makes replicating ideal conditions difficult.
  • Gaps in GPU utilization were resolved by running more than five four concurrent plotters.
2024-09-15T16:12:45.525621Z  WARN {farm_index=3}:{sector_index=37}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Plotting progress stream ended before plotting finished
2024-09-15T16:12:45.540373Z  WARN {farm_index=3}:{sector_index=31}: subspace_farmer::cluster::nats_client: Received unexpected response stream index, aborting stream actual_index=16 expected_index=15 message_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress response_subject=stream-response.01J7V767AZR6MEJKYSTQAZFBXS
2024-09-15T16:12:45.540633Z  WARN {farm_index=3}:{sector_index=31}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Plotting progress stream ended before plotting finished
2024-09-15T16:12:45.655387Z  INFO {farm_index=0}:{sector_index=19}: subspace_farmer::single_disk_farm::plotting: Plotting sector retry

Thoughts

  • The results were rerun with a concurrency of six because the initial numbers appeared inconsistent. This explains the presence of duplicate results for that run.
  • A concurrency of eight yielded the best results yesterday, with a sector time of 2.7 seconds. It appears that for these two GPUs, a concurrency of seven or eight is the optimal setting.

Results

========================================
Monitoring Results - 2024-09-15 10:23:11
Number of Plotter Instances: 1
Plotter Performance: Sectors: 35 | Avg/min: 7.00 | Avg time: 8.58 sec | TiB/day: 9.83

GPU 0:
  Utilization:  Max: 93%, Avg: 72.65%
  Memory Usage: Max: 1940MB, Avg: 1939.65MB
  Power Draw:   Max: 211.31W, Avg: 180.69W

GPU 1:
  Utilization:  Max: 91%, Avg: 69.41%
  Memory Usage: Max: 1940MB, Avg: 1939.51MB
  Power Draw:   Max: 233.84W, Avg: 189.48W
========================================

========================================
Monitoring Results - 2024-09-15 08:55:36
Number of Plotter Instances: 2
Plotter Performance: Sectors: 51 | Avg/min: 10.20 | Avg time: 5.88 sec | TiB/day: 14.12

GPU 0:
  Utilization:  Max: 100%, Avg: 57.55%
  Memory Usage: Max: 3098MB, Avg: 3096.27MB
  Power Draw:   Max: 202.28W, Avg: 119.89W

GPU 1:
  Utilization:  Max: 100%, Avg: 82.89%
  Memory Usage: Max: 3098MB, Avg: 3096.79MB
  Power Draw:   Max: 243.57W, Avg: 213.80W
========================================

========================================
Monitoring Results - 2024-09-15 09:03:35
Number of Plotter Instances: 3
Plotter Performance: Sectors: 83 | Avg/min: 16.60 | Avg time: 3.61 sec | TiB/day: 23.00

GPU 0:
  Utilization:  Max: 100%, Avg: 47.44%
  Memory Usage: Max: 4255MB, Avg: 4250.44MB
  Power Draw:   Max: 204.78W, Avg: 124.94W

GPU 1:
  Utilization:  Max: 100%, Avg: 51.44%
  Memory Usage: Max: 4255MB, Avg: 4250.48MB
  Power Draw:   Max: 241.26W, Avg: 159.30W
========================================

========================================
Monitoring Results - 2024-09-15 09:10:41
Number of Plotter Instances: 4
Plotter Performance: Sectors: 68 | Avg/min: 13.60 | Avg time: 4.41 sec | TiB/day: 18.83

GPU 0:
  Utilization:  Max: 100%, Avg: 70.15%
  Memory Usage: Max: 5412MB, Avg: 5065.62MB
  Power Draw:   Max: 208.46W, Avg: 152.19W

GPU 1:
  Utilization:  Max: 100%, Avg: 62.29%
  Memory Usage: Max: 5412MB, Avg: 5407.48MB
  Power Draw:   Max: 243.09W, Avg: 174.67W
========================================

========================================
Monitoring Results - 2024-09-15 09:18:39
Number of Plotter Instances: 5
Plotter Performance: Sectors: 69 | Avg/min: 13.80 | Avg time: 4.34 sec | TiB/day: 19.13

GPU 0:
  Utilization:  Max: 100%, Avg: 100.00%
  Memory Usage: Max: 6570MB, Avg: 6536.79MB
  Power Draw:   Max: 190.84W, Avg: 187.16W

GPU 1:
  Utilization:  Max: 100%, Avg: 100.00%
  Memory Usage: Max: 6570MB, Avg: 6569.75MB
  Power Draw:   Max: 245.54W, Avg: 244.37W
========================================

========================================
Monitoring Results - 2024-09-15 09:28:23
Number of Plotter Instances: 6
Plotter Performance: Sectors: 46 | Avg/min: 9.20 | Avg time: 6.52 sec | TiB/day: 12.73

GPU 0:
  Utilization:  Max: 100%, Avg: 100.00%
  Memory Usage: Max: 7727MB, Avg: 7725.86MB
  Power Draw:   Max: 194.98W, Avg: 189.99W

GPU 1:
  Utilization:  Max: 100%, Avg: 100.00%
  Memory Usage: Max: 7729MB, Avg: 7727.51MB
  Power Draw:   Max: 245.05W, Avg: 240.14W
========================================

========================================
Monitoring Results - 2024-09-15 09:39:44
Number of Plotter Instances: 6
Plotter Performance: Sectors: 69 | Avg/min: 13.80 | Avg time: 4.34 sec | TiB/day: 19.13

GPU 0:
  Utilization:  Max: 100%, Avg: 100.00%
  Memory Usage: Max: 7729MB, Avg: 7729.00MB
  Power Draw:   Max: 190.69W, Avg: 188.27W

GPU 1:
  Utilization:  Max: 100%, Avg: 100.00%
  Memory Usage: Max: 7729MB, Avg: 7728.86MB
  Power Draw:   Max: 245.90W, Avg: 244.38W
========================================
========================================
Monitoring Results - 2024-09-15 09:47:11
Number of Plotter Instances: 7
Plotter Performance: Sectors: 102 | Avg/min: 20.40 | Avg time: 2.94 sec | TiB/day: 28.24

GPU 0:
  Utilization:  Max: 100%, Avg: 100.00%
  Memory Usage: Max: 8886MB, Avg: 8885.75MB
  Power Draw:   Max: 195.37W, Avg: 193.15W

GPU 1:
  Utilization:  Max: 100%, Avg: 100.00%
  Memory Usage: Max: 8886MB, Avg: 8885.79MB
  Power Draw:   Max: 240.46W, Avg: 238.56W
========================================

========================================
Monitoring Results - 2024-09-15 09:56:44
Number of Plotter Instances: 8
Plotter Performance: Sectors: 86 | Avg/min: 17.20 | Avg time: 3.48 sec | TiB/day: 23.86

GPU 0:
  Utilization:  Max: 100%, Avg: 100.00%
  Memory Usage: Max: 10044MB, Avg: 10043.86MB
  Power Draw:   Max: 198.86W, Avg: 196.50W

GPU 1:
  Utilization:  Max: 100%, Avg: 100.00%
  Memory Usage: Max: 10044MB, Avg: 10043.24MB
  Power Draw:   Max: 238.93W, Avg: 234.99W
========================================

========================================
Monitoring Results - 2024-09-15 10:23:11
Number of Plotter Instances: 1
Plotter Performance: Sectors: 35 | Avg/min: 8.58 | Avg time: 8.58 sec | TiB/day: 9.83

GPU 0:
  Utilization:  Max: 93%, Avg: 72.65%
  Memory Usage: Max: 1940MB, Avg: 1939.65MB
  Power Draw:   Max: 211.31W, Avg: 180.69W

GPU 1:
  Utilization:  Max: 91%, Avg: 69.41%
  Memory Usage: Max: 1940MB, Avg: 1939.51MB
  Power Draw:   Max: 233.84W, Avg: 189.48W
========================================

results.log

Screencaps from nvtop

screencaps.zip

4 Likes

Both average GPU utilization and nvtop graphs indicate GPU simply doesn’t have sector ready for it to encode, which might be caused by default parameters being insufficient to retrieve sectors fast enough.

I recommended on Discord to try increasing --piece-getter-concurrency on plotter. Since multiple plotters are able to download pieces fast enough, the bottleneck is likely around latency rather than bandwidth or total capacity of the controller to respond to requests.

1 Like

Also worth looking at sector downloading metrics, it should give you an idea of how fast sectors are being received by plotter.

Increasing the --piece-getter-concurrency from 64 to 128 had no noticeable effect on GPU utilization or speed for a single instance of the plotter component.

Interestingly, running the plotter with fewer piece-getter concurrency yielded slightly better performance. The best results were achieved with 8 or 16 concurrency levels, where the plotter exhibited peaks of 2.06 seconds per 145 sectors over 5 minutes with 2 RTX 4090 GPUs (four plotter instances).

To rule out potential bottlenecks, I moved the entire cache to RAM, but this did not affect performance. I also tested by running three complete piece caches in RAM along with three cache components and three controllers, which also did not improve single instance concurrency results.

Additionally, experimenting with --service-instances 128 did not produce any noticeable impact.

These tests were conducted using two NATS servers in a cluster with 40G networking, and the plotter and farmers were connected via 100G and 40G networking, respectively.

These results are from one of the four plotter components. I can provide all four if that would be more helpful.

subspace_farmer_plotter_sector_downloading_time_seconds_sum{kind="gpu-cuda"} 11858.992920180997
subspace_farmer_plotter_sector_downloading_time_seconds_count{kind="gpu-cuda"} 413
subspace_farmer_plotter_sector_encoding_time_seconds_sum{kind="gpu-cuda"} 6553.638764383001
subspace_farmer_plotter_sector_encoding_time_seconds_count{kind="gpu-cuda"} 396
subspace_farmer_plotter_sector_plotting_time_seconds_sum{kind="gpu-cuda"} 46334.27845267801
subspace_farmer_plotter_sector_plotting_time_seconds_count{kind="gpu-cuda"} 396

The default is 32 for plotter and 128 for controller, not sure which one you actually increased since neither defaults to 64.

This option is present on cache, controller and farmer, which one did you change? Overall it defaults to 32 and should be sufficient in most cases.

11859 / 413 = 28.7 seconds to download a sector. This is the bottleneck, not the actual compute. If we can fix downloading speed multiple instances will not be necessary.

I increased the plotter setting from the default of 32 to 64, and then to 128. Would it be more effective if both the controller and plotter were configured with the same values? My thought was that I was running a single plotter to multiple farmers.

I only changed this value on the controller.

Since I’m running four plotters, it’s likely to be four times faster. However, it’s still slower than the peak plotting speeds, so it makes sense that this is the bottleneck.




I’ll continue investigating and keep an eye out for anything unusual, or anything that may help. If I can provide any other results to help with troubleshooting, let me know.

This property has a slightly different meaning for plotter and controller. For controller it means data downloading concurrency when controller needs it, which is mostly during piece cache sync and responding to incoming requests.

Plotter works differently: it first makes a small request to controller asking if the piece it wants is cached and positive result contains information about the cache instance, then plotter sends second request directly to the cache instance to actually pull the piece from it.

Assuming piece cache is large enough to store everything, the two do not overlap directly and increasing plotter’s parameter should be sufficient.

This may help to answer more requests, but in practice is unlikely to make a difference given how large the default value already is.

I think looking closer at cache instance might be the next logical step. Especially per-core CPU usage to spot any bottlenecks there. Though the fact that multiple plotters are capable of pulling more data means it should not be the bottleneck. But something in the chain is, we just need to find it.

The way pieces are downloaded both from cache and DSN have changed in Taurus, I’m wondering if you still see as severe issues as before or has anything improved at all with GPU utilization when running a single instance (should be close to non-cluster setup now).

The info in this post is very helpful, and also with the Nov28 release I am seeing similar results. I can be challenging to make the best use of GPU’s across multiple hosts when each configuration will be different in some ways. I just wanted to provide some feedback on what I have seen so far.

Hardware/Config:
Main Host: RomeD8-2T Board, Epyc 7773X CPU 64 core 128 thread, 256GB memory, Host OS NVME 4 on board, 1600W PSU, 20GB Network via (2) 10G via link aggregation on the switch. For testing (2) 15.36TB U.2 drives were used with each being connected to onboard Oculink connector. The only other drive used was a 4TB NVME M.2 which was already plotted and farming. OS Ubuntu 24.04 server.

(2) RTX4090 GPUs (in a 4U rack, tight fit but it works)

Containers: Nats, Node, Cluster Controller, Cache, Farmer, Plotter

At one time I had one 4090 in another host but moved to main server to try and eliminate networking variable.

Test #1: Single GPU, Single Plotter, default config. It will fully utilize a single GPU with a single container instance.

Test #2 Dual GPU, Single Plotter, default config. I cannot get solid usage on both GPU’s both will show waves of activity with the first one not solid use as it was.

Test #3 Dual GPU 2x Plotter containers dedicated to each GPU. 4 Plotters running on same Host. Performance is better with increased usage on both GPU’s but still not utilizing all it can. I had --cuda-sector-downloading-concurrency 12 set on the plotters which did not seem to help as everything is running on a single host.

Test #4 Same as above Test but with one difference. On the Farmer container I made the setting change of --max-plotting-sectors-per-farm 24. By making that setting change it will use an additional 40GB of memory. I am fine with it using the memory if I get the performance out of it. This is the only way I can get 90-100% GPU usage from both simultaneously. There are still drops in usage but in only lasts a few seconds and not as frequent as without this setting in place.

Sorry, I don’t have grafana screenshot from above where the 15.36 drives were plotted.

Multi-Host Plotting: Same Dual GPU host as above.
Host 2: Lenovo P620 Threadripper Pro workstation, 128GB Memory, 10GB network. Ubuntu24.04 server, RTX 3060, (2) PCIE-NVME cards with (4) 4TB nvme drives each.
Containers: Farmer, Plotter, Cache - I am just using cache container in standard config, no cache group. This is connecting to the Node, Nats, Controller on Host1.

In this test I am plotting (4) 4TB NVME drives at the same time located in Host 2. I do not see any difference in usage with the 4090’s. At the most I am using 5Gbps of network bandwidth. The 3060 in Host2 has no issue with usage and is used 100% at all times. I am satisfied with the performance I am getting by using the additional memory and what I am using for plotting for the time being.

I have not found anything that is preventing me from deploying a working cluster solution, but wanted to provide feedback on my specific setup. If there are any suggestions on configs please let me know and I can test. If you want logs I can get if I know what/where etc. I know the data I am providing is not nearly as comprehensive as vexr has posted :wink:

I think an amazing jobs was done with the architecture of the platform in general, so props to you guys.

The key thing that would help a lot is metrics endpoint data a text, rather than screenshots. Both from plotter and farmer to see what they both think about what is going on here.

I’ll try to reproduce this myself, but it is always nice to collect information from the system where this behavior is already happening.

I have a theory why this might be the case and Snapshot build · autonomys/subspace@8689b56 · GitHub should help with that.
Would be great if either of you can give it a try.