Comparison of Single vs. Multiple GPU Instances

As requested, I have completed the comparison of performance metrics between single and multiple instances. The analysis includes plotting time, GPU power usage, GPU utilization in %, and GPU memory usage. I’ve also included nvtop graphs to highlight any significant differences. I hope this helps in understanding how to better improve single instance performance to better match that of multiple instances.

Performance Benchmarks

These performance benchmarks were conducted using two RTX 4090 GPUs with the following approach:

  • Snapshot build #383
  • A total of nine drives were used to prevent them from becoming a bottleneck.
  • After adding each plotter to the cluster, I allowed at least one minute of plotting to reduce the effect of the initial ramp up period.
  • GPU statistics were collected from each GPU, averaged over a 300 second period. Data was captured every five seconds.
  • Sector times were calculated based on successful sectors plotted during the same 300 second period, aligning with the GPU statistics.

Observations

  • The sector times are somewhat skewed due to failed sectors that waste processing time. Unfortunately, this issue has been ongoing, and has been reported in the past. This behavior makes replicating ideal conditions difficult.
  • Gaps in GPU utilization were resolved by running more than five four concurrent plotters.
2024-09-15T16:12:45.525621Z  WARN {farm_index=3}:{sector_index=37}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Plotting progress stream ended before plotting finished
2024-09-15T16:12:45.540373Z  WARN {farm_index=3}:{sector_index=31}: subspace_farmer::cluster::nats_client: Received unexpected response stream index, aborting stream actual_index=16 expected_index=15 message_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress response_subject=stream-response.01J7V767AZR6MEJKYSTQAZFBXS
2024-09-15T16:12:45.540633Z  WARN {farm_index=3}:{sector_index=31}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Plotting progress stream ended before plotting finished
2024-09-15T16:12:45.655387Z  INFO {farm_index=0}:{sector_index=19}: subspace_farmer::single_disk_farm::plotting: Plotting sector retry

Thoughts

  • The results were rerun with a concurrency of six because the initial numbers appeared inconsistent. This explains the presence of duplicate results for that run.
  • A concurrency of eight yielded the best results yesterday, with a sector time of 2.7 seconds. It appears that for these two GPUs, a concurrency of seven or eight is the optimal setting.

Results

========================================
Monitoring Results - 2024-09-15 10:23:11
Number of Plotter Instances: 1
Plotter Performance: Sectors: 35 | Avg/min: 7.00 | Avg time: 8.58 sec | TiB/day: 9.83

GPU 0:
  Utilization:  Max: 93%, Avg: 72.65%
  Memory Usage: Max: 1940MB, Avg: 1939.65MB
  Power Draw:   Max: 211.31W, Avg: 180.69W

GPU 1:
  Utilization:  Max: 91%, Avg: 69.41%
  Memory Usage: Max: 1940MB, Avg: 1939.51MB
  Power Draw:   Max: 233.84W, Avg: 189.48W
========================================

========================================
Monitoring Results - 2024-09-15 08:55:36
Number of Plotter Instances: 2
Plotter Performance: Sectors: 51 | Avg/min: 10.20 | Avg time: 5.88 sec | TiB/day: 14.12

GPU 0:
  Utilization:  Max: 100%, Avg: 57.55%
  Memory Usage: Max: 3098MB, Avg: 3096.27MB
  Power Draw:   Max: 202.28W, Avg: 119.89W

GPU 1:
  Utilization:  Max: 100%, Avg: 82.89%
  Memory Usage: Max: 3098MB, Avg: 3096.79MB
  Power Draw:   Max: 243.57W, Avg: 213.80W
========================================

========================================
Monitoring Results - 2024-09-15 09:03:35
Number of Plotter Instances: 3
Plotter Performance: Sectors: 83 | Avg/min: 16.60 | Avg time: 3.61 sec | TiB/day: 23.00

GPU 0:
  Utilization:  Max: 100%, Avg: 47.44%
  Memory Usage: Max: 4255MB, Avg: 4250.44MB
  Power Draw:   Max: 204.78W, Avg: 124.94W

GPU 1:
  Utilization:  Max: 100%, Avg: 51.44%
  Memory Usage: Max: 4255MB, Avg: 4250.48MB
  Power Draw:   Max: 241.26W, Avg: 159.30W
========================================

========================================
Monitoring Results - 2024-09-15 09:10:41
Number of Plotter Instances: 4
Plotter Performance: Sectors: 68 | Avg/min: 13.60 | Avg time: 4.41 sec | TiB/day: 18.83

GPU 0:
  Utilization:  Max: 100%, Avg: 70.15%
  Memory Usage: Max: 5412MB, Avg: 5065.62MB
  Power Draw:   Max: 208.46W, Avg: 152.19W

GPU 1:
  Utilization:  Max: 100%, Avg: 62.29%
  Memory Usage: Max: 5412MB, Avg: 5407.48MB
  Power Draw:   Max: 243.09W, Avg: 174.67W
========================================

========================================
Monitoring Results - 2024-09-15 09:18:39
Number of Plotter Instances: 5
Plotter Performance: Sectors: 69 | Avg/min: 13.80 | Avg time: 4.34 sec | TiB/day: 19.13

GPU 0:
  Utilization:  Max: 100%, Avg: 100.00%
  Memory Usage: Max: 6570MB, Avg: 6536.79MB
  Power Draw:   Max: 190.84W, Avg: 187.16W

GPU 1:
  Utilization:  Max: 100%, Avg: 100.00%
  Memory Usage: Max: 6570MB, Avg: 6569.75MB
  Power Draw:   Max: 245.54W, Avg: 244.37W
========================================

========================================
Monitoring Results - 2024-09-15 09:28:23
Number of Plotter Instances: 6
Plotter Performance: Sectors: 46 | Avg/min: 9.20 | Avg time: 6.52 sec | TiB/day: 12.73

GPU 0:
  Utilization:  Max: 100%, Avg: 100.00%
  Memory Usage: Max: 7727MB, Avg: 7725.86MB
  Power Draw:   Max: 194.98W, Avg: 189.99W

GPU 1:
  Utilization:  Max: 100%, Avg: 100.00%
  Memory Usage: Max: 7729MB, Avg: 7727.51MB
  Power Draw:   Max: 245.05W, Avg: 240.14W
========================================

========================================
Monitoring Results - 2024-09-15 09:39:44
Number of Plotter Instances: 6
Plotter Performance: Sectors: 69 | Avg/min: 13.80 | Avg time: 4.34 sec | TiB/day: 19.13

GPU 0:
  Utilization:  Max: 100%, Avg: 100.00%
  Memory Usage: Max: 7729MB, Avg: 7729.00MB
  Power Draw:   Max: 190.69W, Avg: 188.27W

GPU 1:
  Utilization:  Max: 100%, Avg: 100.00%
  Memory Usage: Max: 7729MB, Avg: 7728.86MB
  Power Draw:   Max: 245.90W, Avg: 244.38W
========================================
========================================
Monitoring Results - 2024-09-15 09:47:11
Number of Plotter Instances: 7
Plotter Performance: Sectors: 102 | Avg/min: 20.40 | Avg time: 2.94 sec | TiB/day: 28.24

GPU 0:
  Utilization:  Max: 100%, Avg: 100.00%
  Memory Usage: Max: 8886MB, Avg: 8885.75MB
  Power Draw:   Max: 195.37W, Avg: 193.15W

GPU 1:
  Utilization:  Max: 100%, Avg: 100.00%
  Memory Usage: Max: 8886MB, Avg: 8885.79MB
  Power Draw:   Max: 240.46W, Avg: 238.56W
========================================

========================================
Monitoring Results - 2024-09-15 09:56:44
Number of Plotter Instances: 8
Plotter Performance: Sectors: 86 | Avg/min: 17.20 | Avg time: 3.48 sec | TiB/day: 23.86

GPU 0:
  Utilization:  Max: 100%, Avg: 100.00%
  Memory Usage: Max: 10044MB, Avg: 10043.86MB
  Power Draw:   Max: 198.86W, Avg: 196.50W

GPU 1:
  Utilization:  Max: 100%, Avg: 100.00%
  Memory Usage: Max: 10044MB, Avg: 10043.24MB
  Power Draw:   Max: 238.93W, Avg: 234.99W
========================================

========================================
Monitoring Results - 2024-09-15 10:23:11
Number of Plotter Instances: 1
Plotter Performance: Sectors: 35 | Avg/min: 8.58 | Avg time: 8.58 sec | TiB/day: 9.83

GPU 0:
  Utilization:  Max: 93%, Avg: 72.65%
  Memory Usage: Max: 1940MB, Avg: 1939.65MB
  Power Draw:   Max: 211.31W, Avg: 180.69W

GPU 1:
  Utilization:  Max: 91%, Avg: 69.41%
  Memory Usage: Max: 1940MB, Avg: 1939.51MB
  Power Draw:   Max: 233.84W, Avg: 189.48W
========================================

results.log

Screencaps from nvtop

screencaps.zip

4 Likes

Both average GPU utilization and nvtop graphs indicate GPU simply doesn’t have sector ready for it to encode, which might be caused by default parameters being insufficient to retrieve sectors fast enough.

I recommended on Discord to try increasing --piece-getter-concurrency on plotter. Since multiple plotters are able to download pieces fast enough, the bottleneck is likely around latency rather than bandwidth or total capacity of the controller to respond to requests.

1 Like

Also worth looking at sector downloading metrics, it should give you an idea of how fast sectors are being received by plotter.

Increasing the --piece-getter-concurrency from 64 to 128 had no noticeable effect on GPU utilization or speed for a single instance of the plotter component.

Interestingly, running the plotter with fewer piece-getter concurrency yielded slightly better performance. The best results were achieved with 8 or 16 concurrency levels, where the plotter exhibited peaks of 2.06 seconds per 145 sectors over 5 minutes with 2 RTX 4090 GPUs (four plotter instances).

To rule out potential bottlenecks, I moved the entire cache to RAM, but this did not affect performance. I also tested by running three complete piece caches in RAM along with three cache components and three controllers, which also did not improve single instance concurrency results.

Additionally, experimenting with --service-instances 128 did not produce any noticeable impact.

These tests were conducted using two NATS servers in a cluster with 40G networking, and the plotter and farmers were connected via 100G and 40G networking, respectively.

These results are from one of the four plotter components. I can provide all four if that would be more helpful.

subspace_farmer_plotter_sector_downloading_time_seconds_sum{kind="gpu-cuda"} 11858.992920180997
subspace_farmer_plotter_sector_downloading_time_seconds_count{kind="gpu-cuda"} 413
subspace_farmer_plotter_sector_encoding_time_seconds_sum{kind="gpu-cuda"} 6553.638764383001
subspace_farmer_plotter_sector_encoding_time_seconds_count{kind="gpu-cuda"} 396
subspace_farmer_plotter_sector_plotting_time_seconds_sum{kind="gpu-cuda"} 46334.27845267801
subspace_farmer_plotter_sector_plotting_time_seconds_count{kind="gpu-cuda"} 396

The default is 32 for plotter and 128 for controller, not sure which one you actually increased since neither defaults to 64.

This option is present on cache, controller and farmer, which one did you change? Overall it defaults to 32 and should be sufficient in most cases.

11859 / 413 = 28.7 seconds to download a sector. This is the bottleneck, not the actual compute. If we can fix downloading speed multiple instances will not be necessary.

I increased the plotter setting from the default of 32 to 64, and then to 128. Would it be more effective if both the controller and plotter were configured with the same values? My thought was that I was running a single plotter to multiple farmers.

I only changed this value on the controller.

Since I’m running four plotters, it’s likely to be four times faster. However, it’s still slower than the peak plotting speeds, so it makes sense that this is the bottleneck.




I’ll continue investigating and keep an eye out for anything unusual, or anything that may help. If I can provide any other results to help with troubleshooting, let me know.

This property has a slightly different meaning for plotter and controller. For controller it means data downloading concurrency when controller needs it, which is mostly during piece cache sync and responding to incoming requests.

Plotter works differently: it first makes a small request to controller asking if the piece it wants is cached and positive result contains information about the cache instance, then plotter sends second request directly to the cache instance to actually pull the piece from it.

Assuming piece cache is large enough to store everything, the two do not overlap directly and increasing plotter’s parameter should be sufficient.

This may help to answer more requests, but in practice is unlikely to make a difference given how large the default value already is.

I think looking closer at cache instance might be the next logical step. Especially per-core CPU usage to spot any bottlenecks there. Though the fact that multiple plotters are capable of pulling more data means it should not be the bottleneck. But something in the chain is, we just need to find it.