Potential table creation rules change for the farmer

I have been experimenting with various things (again) and discovered a change that improves performance for CPU-based table creation by ~10%, possibly even more (all other things being equal).

The observation is that sometimes Y values are actually duplicated, which forces the implementation to track the count of these things here:

And then have to loop over them here:

But it would have been much more efficient to have a simple yes/no flag instead.

From consensus point of view this makes no difference, but for farmer this means a breaking change to the on-disk farm format when decoding data. So in case such a change is introduced, at least temporarily both versions should be supported.

This should be especially helpeful for GPUs where loops like this are very costly there.

For context, here is what the distribution of counts looks roughly like:

Count Occurrences Occurrences in %
Any 11313485 100%
0 11138124 98.44%
1 173970 1.54%
2 1375 0.01%
3 16 0.0001%

Essentially every next count is about two orders of magnitude less likely to happen. I have not seen counts 7+ in my testing, but I guess they may happen with some probability too.

Only handling 0 and 1 will mean tables will be a little bit smaller (we’ll ignore ~1.5% of potentially valid entries on every table), but since we already have parity backup on the farmer and not all of those entries will have matches anyway, the possibility of the farmer not having encoded pieces (having chunks that can’t participate in rewards) is 0% just like before.

Any thoughts about this?

With local optimizations I was able to get these results with the current implementation:

chia/table/parallel/8x  time:   [687.73 ms 701.49 ms 716.28 ms]
                        thrpt:  [11.169  elem/s 11.404  elem/s 11.632  elem/s]

And by only tracking a single entry as described above:

chia/table/parallel/8x  time:   [655.95 ms 664.92 ms 674.66 ms]
                        thrpt:  [11.858  elem/s 12.031  elem/s 12.196  elem/s]

This was done on AMD Threadripper 7970X CPU, where I sliced 8C16T on the same CCX, so should be roughly in the same ballpark as 8C16T Ryzen 7700X CPU.


UPD: With more improvements it went down even further:

chia/table/parallel/8x  time:   [612.71 ms 618.05 ms 624.05 ms]
                        thrpt:  [12.820  elem/s 12.944  elem/s 13.057  elem/s]

Exploring further optimizations, since we don’t technically need any particular Y in the table, so due to small size of each bucket during matches, SIMD binary search might work really well without rmap creation.

And this is especially applicable for GPU, where such things should be even faster. Have a very good feeling about performance potential here.

1 Like

Shared my findings with massive improvements over above numbers here:

For context, current main of https://github.com/autonomys/subspace:

chia/table/single       time:   [1.0668 s 1.0756 s 1.0862 s]
chia/table/parallel/8x  time:   [905.01 ms 919.11 ms 933.34 ms]

Possible compatible backport:

chia/table/parallel/8x  time:   [677.60 ms 684.25 ms 692.06 ms]
                        thrpt:  [11.560  elem/s 11.692  elem/s 11.806  elem/s]

Current state as of above articles with breaking changes:

chia/table/single/1x    time:   [747.82 ms 756.94 ms 768.09 ms]
                        thrpt:  [1.3019  elem/s 1.3211  elem/s 1.3372  elem/s]
chia/table/parallel/8x  time:   [529.15 ms 534.42 ms 540.15 ms]
                        thrpt:  [14.811  elem/s 14.969  elem/s 15.119  elem/s]

This is all on a single CCX of AMD Threadripper 7970X CPU (roughly 8C16T AMD Ryzen 7700X CPU).

1 Like

Two more updates with further changes:

Not fully optimized, but the last update of the old API looked like this:

chia/table/single/1x    time:   [638.21 ms 642.97 ms 649.77 ms]
                        thrpt:  [1.5390  elem/s 1.5553  elem/s 1.5669  elem/s]
chia/table/parallel/8x  time:   [426.94 ms 431.49 ms 436.70 ms]
                        thrpt:  [18.319  elem/s 18.540  elem/s 18.738  elem/s]

And then switching to a different (not fully optimized) API that produces just the proofs protocol needs led to further performance improvements (note that the benchmark here includes proofs, not just tables):

Before:
chia/proofs/single/1x   time:   [728.10 ms 735.58 ms 744.74 ms]
                        thrpt:  [1.3427  elem/s 1.3595  elem/s 1.3734  elem/s]
chia/proofs/parallel/8x time:   [600.14 ms 604.93 ms 609.18 ms]
                        thrpt:  [13.132  elem/s 13.225  elem/s 13.330  elem/s]
After:
chia/proofs/single/1x   time:   [710.02 ms 713.94 ms 718.19 ms]
                        thrpt:  [1.3924  elem/s 1.4007  elem/s 1.4084  elem/s]
chia/proofs/parallel/8x time:   [567.42 ms 574.66 ms 581.51 ms]
                        thrpt:  [13.757  elem/s 13.921  elem/s 14.099  elem/s]

The new code is not fully parallelized when collecting proofs and it is not yet leveraging SIMD for Blake3 in multiple places, which will lead to further performance improvements, probably to the level of just tables creation before, but for the whole thing.

I’m able to create a sector in ~30 seconds on AMD Threadripper 7970X CPU (32C64T) now, which according to Plotting Speed Results is only slower than massive AMD Epyc 9654 (96C192T), and will likely beat it once optimized further (of course, Epyc will win once using this version of the code too).

1 Like

GPU version is complete, it works, confirmed to run successfully on the following hardware:

  • AMD RX 7600 XT (dGPU, RADV, Linux)
  • Nvidia RTX 2080 Ti (dGPU, Nvidia proprietary, Linux)
  • Apple M1 (iGPU)
  • Raspberry PI 5 (iGPU)
  • Software emulation on CPU with LLVMpipe (Lavapipe)

I have no reason to think it’ll not work on A LOT of other hardware as well, for example the following devices should technically be supported since they support Vulkan 1.2 or better:

  • Nvidia GTX 6xx and newer (Kepler+ on Linux and Windows)
  • AMD Radeon HD 77xx and newer (GCN 1.0+ on Linux and Windows)
  • Intel Skylake iGPU and newer (Linux and Windows)
  • all kinds of ARM SBCs and even iGPUs in mobile phones

Performance is a different story of course.

Right now my Vulkan implementation is a few times slower than CUDA version, but it should not be difficult to catch up and surpass it. I barely got it to work and still didn’t manage to figure out how to get any of the GPU profilers to work properly to see why it is slow.

2 Likes