Slow Plotting with GPU

repost · October 9, 2024, 2:02pm

Issue Report

Environment

SERVER 1
Components: Node (on 1TB SSD), Plotter, Cache (110GiB on 1TB SSD), Controller, NATs (all Docker)
Hardware: 10G NIC connected to 10G switch. RTX 3070. i7 12700KF
OS: Ubuntu 24.04 server
GPU Drivers: 560

SERVER 2
Components: Plotter, Controller, Cache (group1) (all Docker)
Hardware: 10G NIC connected to 10G switch. 2 x RTX 4090. Threadripper 3975WX
OS: Ubuntu 24.04 server
GPU Drivers: 560

SERVER 3
Components: Farmer (Docker)
Hardware: 2.5G NIC connected to 10G switch @ 2.5G. ARM CPU. 4x4TB NVMEs
OS: Debian GNU/Linux 12 (bookworm)

Problem

I am seeing very slow plotting with the RTX 3070. I am unsure if this is related to the network or my setup. Until recently I had seen very fast plot times with GPU. There are no errors on the Plotter or Farmer. The network is not saturated, I do see spikes periodically when there is a plot. Resources on both servers remain low (less than 20% CPU and RAM utilization).

I am happy to attach any relevant logs or screen shots. Here is a snippet of my plots to see times:

2024-10-09T14:00:42.862646Z  INFO {farm_index=3}:{sector_index=10}: subspace_farmer::single_disk_farm::plotting: Plotting sector (0.28% complete)
2024-10-09T14:00:13.124579Z  INFO {farm_index=1}:{sector_index=9}: subspace_farmer::single_disk_farm::plotting: Plotting sector (0.25% complete)
2024-10-09T14:00:12.171474Z  INFO {farm_index=1}:{sector_index=8}: subspace_farmer::single_disk_farm::plotting: Plotting sector (0.22% complete)
2024-10-09T14:00:11.912956Z  INFO {farm_index=1}:{sector_index=7}: subspace_farmer::single_disk_farm::plotting: Plotting sector (0.19% complete)
2024-10-09T14:00:11.683526Z  INFO {farm_index=1}:{sector_index=6}: subspace_farmer::single_disk_farm::plotting: Plotting sector (0.17% complete)
2024-10-09T14:00:11.396150Z  INFO {farm_index=1}:{sector_index=5}: subspace_farmer::single_disk_farm::plotting: Plotting sector (0.14% complete)
2024-10-09T13:57:24.187192Z  INFO {farm_index=0}:{sector_index=11}: subspace_farmer::single_disk_farm::plotting: Plotting sector (0.31% complete)
2024-10-09T13:57:06.113683Z  INFO {farm_index=0}:{sector_index=10}: subspace_farmer::single_disk_farm::plotting: Plotting sector (0.28% complete)
2024-10-09T13:54:50.512231Z  INFO {farm_index=0}:{sector_index=9}: subspace_farmer::single_disk_farm::plotting: Plotting sector (0.25% complete)
2024-10-09T13:54:48.515731Z  INFO {farm_index=0}:{sector_index=8}: subspace_farmer::single_disk_farm::plotting: Plotting sector (0.22% complete)
2024-10-09T13:54:41.028735Z  INFO {farm_index=0}:{sector_index=7}: subspace_farmer::single_disk_farm::plotting: Plotting sector (0.19% complete)
2024-10-09T13:54:40.760937Z  INFO {farm_index=0}:{sector_index=6}: subspace_farmer::single_disk_farm::plotting: Plotting sector (0.17% complete)
2024-10-09T13:54:40.758386Z  INFO {farm_index=0}:{sector_index=5}: subspace_farmer::single_disk_farm::plotting: Plotting sector (0.14% complete)
2024-10-09T13:52:22.453069Z  INFO {farm_index=0}:{sector_index=4}: subspace_farmer::single_disk_farm::plotting: Plotting sector (0.11% complete)
2024-10-09T13:52:07.213289Z  INFO {farm_index=0}:{sector_index=3}: subspace_farmer::single_disk_farm::plotting: Plotting sector (0.08% complete)

duanyz_aiyo · October 10, 2024, 2:24am

I have the same issue; the GPU utilization in the cluster version is unstable and the utilization is low.

nazar-pc · October 10, 2024, 5:58pm

The first step would be to pull metrics on the plotter and see what it is spending the most time doing

repost · October 10, 2024, 6:39pm

I will pull Prometheus metrics soon, do not have time right now. But do have node_exporter up. But assuming you want the other metrics.

Plotter

At a 1 hour resolution

At a 12 hour resolution

The Farmer:
At a 1 hour resolution

At a 12 hour resolution

Farmer Disk metrics

nazar-pc · October 10, 2024, 7:52pm

Application metrics contain breakdown what plotter is spending time on.

For example it might be slow to download pieces or slow to encode or slow to send the sector back. Similarly you can pull plotting metrics from farmer’s point of view, which is a different perspective and can also provide additional insights.

repost · October 10, 2024, 11:46pm

The long download time appears to be the issue, can you advise what might cause this?

nazar-pc · October 11, 2024, 1:23am

This is what some other folks were observing as well, for example in Comparison of Single vs. Multiple GPU Instances

The workaround there was to run multiple instances to absorb these delays. I have not yet invested much time into figuring out what is going on there, but one day I will.

repost · October 11, 2024, 2:05am

Am I understanding correctly that the Cache located on the Plotter is supposed to prevent these sectors from needing to be downloaded? It seems like the Cache is not being used at all if this is the case.

Is it worth trying without a cluster to see if that is the problem? If I can only plot at 2.5 minute sectors and some are getting 3 second sectors that would be greatly disappointing.

nazar-pc · October 11, 2024, 2:19am

That would be a reasonable assumption, you can check both controller and cache metrics to check what is going on with cache.

Controller should print logs when it sees and looses cache instance.

repost · October 11, 2024, 3:43am

The existing Grafana Dashboard does not work for the cluster. So I rebuilding. These are the plotting and cache info for SERVER 1

There are no logs for the controller or cache that would indicate any misses or issues.

repost · October 11, 2024, 4:09am

nazar-pc · October 13, 2024, 8:52am

I meant simply text output of the /metrics endpoint, not screenshots

nazar-pc · October 24, 2024, 9:23pm

I think this should be addressed in Release taurus-2024-oct-24 · autonomys/subspace · GitHub, would be nice to get some feedback

Casey_Bjorn · January 13, 2025, 2:48pm

Beautiful dashboards @repost !
Will you please share it?
I would be VERY much obliged

Topic		Replies	Views
Plotting extremely slow, CPU barely utilized Support plotting	2	465	January 23, 2024
Comparison of Single vs. Multiple GPU Instances Support	10	483	December 9, 2024
Plotting Stops after Error (taurus-oct-2024) Support	10	127	November 1, 2024
Farming cluster plot time Support	9	143	June 3, 2024
About GPU support Support	1	229	July 2, 2024

Slow Plotting with GPU

Issue Report

Environment

Problem

Related topics