Potential table creation rules change for the farmer

nazar-pc · August 18, 2025, 9:16pm

I have been experimenting with various things (again) and discovered a change that improves performance for CPU-based table creation by ~10%, possibly even more (all other things being equal).

The observation is that sometimes Y values are actually duplicated, which forces the implementation to track the count of these things here:

github.com/autonomys/subspace

crates/subspace-proof-of-space/src/chiapos/table.rs

dc698b1fb


      
          if rmap[r].count == Position::ZERO {
              rmap[r].start_position = right_position;
          }
          rmap[r].count += Position::ONE;

And then have to loop over them here:

github.com/autonomys/subspace

crates/subspace-proof-of-space/src/chiapos/table.rs

dc698b1fb


      
          for right_position in
              rmap_item.start_position..rmap_item.start_position + rmap_item.count
          {
              let m = Match {
                  left_position,
                  left_y: y,
                  right_position,
              };
              output.push(map(m));
          }

But it would have been much more efficient to have a simple yes/no flag instead.

From consensus point of view this makes no difference, but for farmer this means a breaking change to the on-disk farm format when decoding data. So in case such a change is introduced, at least temporarily both versions should be supported.

This should be especially helpeful for GPUs where loops like this are very costly there.

For context, here is what the distribution of counts looks roughly like:

Count	Occurrences	Occurrences in %
Any	11313485	100%
0	11138124	98.44%
1	173970	1.54%
2	1375	0.01%
3	16	0.0001%

Essentially every next count is about two orders of magnitude less likely to happen. I have not seen counts 7+ in my testing, but I guess they may happen with some probability too.

Only handling 0 and 1 will mean tables will be a little bit smaller (we’ll ignore ~1.5% of potentially valid entries on every table), but since we already have parity backup on the farmer and not all of those entries will have matches anyway, the possibility of the farmer not having encoded pieces (having chunks that can’t participate in rewards) is 0% just like before.

Any thoughts about this?

nazar-pc · August 18, 2025, 9:21pm

With local optimizations I was able to get these results with the current implementation:

chia/table/parallel/8x  time:   [687.73 ms 701.49 ms 716.28 ms]
                        thrpt:  [11.169  elem/s 11.404  elem/s 11.632  elem/s]

And by only tracking a single entry as described above:

chia/table/parallel/8x  time:   [655.95 ms 664.92 ms 674.66 ms]
                        thrpt:  [11.858  elem/s 12.031  elem/s 12.196  elem/s]

This was done on AMD Threadripper 7970X CPU, where I sliced 8C16T on the same CCX, so should be roughly in the same ballpark as 8C16T Ryzen 7700X CPU.

UPD: With more improvements it went down even further:

chia/table/parallel/8x  time:   [612.71 ms 618.05 ms 624.05 ms]
                        thrpt:  [12.820  elem/s 12.944  elem/s 13.057  elem/s]

nazar-pc · August 22, 2025, 10:04am

Exploring further optimizations, since we don’t technically need any particular Y in the table, so due to small size of each bucket during matches, SIMD binary search might work really well without rmap creation.

And this is especially applicable for GPU, where such things should be even faster. Have a very good feeling about performance potential here.

nazar-pc · September 7, 2025, 9:28pm

Shared my findings with massive improvements over above numbers here:

nazar-pc · September 8, 2025, 6:05am

For context, current main of https://github.com/autonomys/subspace:

chia/table/single       time:   [1.0668 s 1.0756 s 1.0862 s]
chia/table/parallel/8x  time:   [905.01 ms 919.11 ms 933.34 ms]

Possible compatible backport:

chia/table/parallel/8x  time:   [677.60 ms 684.25 ms 692.06 ms]
                        thrpt:  [11.560  elem/s 11.692  elem/s 11.806  elem/s]

Current state as of above articles with breaking changes:

chia/table/single/1x    time:   [747.82 ms 756.94 ms 768.09 ms]
                        thrpt:  [1.3019  elem/s 1.3211  elem/s 1.3372  elem/s]
chia/table/parallel/8x  time:   [529.15 ms 534.42 ms 540.15 ms]
                        thrpt:  [14.811  elem/s 14.969  elem/s 15.119  elem/s]

This is all on a single CCX of AMD Threadripper 7970X CPU (roughly 8C16T AMD Ryzen 7700X CPU).

nazar-pc · October 17, 2025, 10:09pm

Two more updates with further changes:

Not fully optimized, but the last update of the old API looked like this:

chia/table/single/1x    time:   [638.21 ms 642.97 ms 649.77 ms]
                        thrpt:  [1.5390  elem/s 1.5553  elem/s 1.5669  elem/s]
chia/table/parallel/8x  time:   [426.94 ms 431.49 ms 436.70 ms]
                        thrpt:  [18.319  elem/s 18.540  elem/s 18.738  elem/s]

And then switching to a different (not fully optimized) API that produces just the proofs protocol needs led to further performance improvements (note that the benchmark here includes proofs, not just tables):

Before:
chia/proofs/single/1x   time:   [728.10 ms 735.58 ms 744.74 ms]
                        thrpt:  [1.3427  elem/s 1.3595  elem/s 1.3734  elem/s]
chia/proofs/parallel/8x time:   [600.14 ms 604.93 ms 609.18 ms]
                        thrpt:  [13.132  elem/s 13.225  elem/s 13.330  elem/s]
After:
chia/proofs/single/1x   time:   [710.02 ms 713.94 ms 718.19 ms]
                        thrpt:  [1.3924  elem/s 1.4007  elem/s 1.4084  elem/s]
chia/proofs/parallel/8x time:   [567.42 ms 574.66 ms 581.51 ms]
                        thrpt:  [13.757  elem/s 13.921  elem/s 14.099  elem/s]

The new code is not fully parallelized when collecting proofs and it is not yet leveraging SIMD for Blake3 in multiple places, which will lead to further performance improvements, probably to the level of just tables creation before, but for the whole thing.

I’m able to create a sector in ~30 seconds on AMD Threadripper 7970X CPU (32C64T) now, which according to Plotting Speed Results is only slower than massive AMD Epyc 9654 (96C192T), and will likely beat it once optimized further (of course, Epyc will win once using this version of the code too).

nazar-pc · October 26, 2025, 10:47am

GPU version is complete, it works, confirmed to run successfully on the following hardware:

AMD RX 7600 XT (dGPU, RADV, Linux)
Nvidia RTX 2080 Ti (dGPU, Nvidia proprietary, Linux)
Apple M1 (iGPU)
Raspberry PI 5 (iGPU)
Software emulation on CPU with LLVMpipe (Lavapipe)

I have no reason to think it’ll not work on A LOT of other hardware as well, for example the following devices should technically be supported since they support Vulkan 1.2 or better:

Nvidia GTX 6xx and newer (Kepler+ on Linux and Windows)
AMD Radeon HD 77xx and newer (GCN 1.0+ on Linux and Windows)
Intel Skylake iGPU and newer (Linux and Windows)
all kinds of ARM SBCs and even iGPUs in mobile phones

Performance is a different story of course.

Right now my Vulkan implementation is a few times slower than CUDA version, but it should not be difficult to catch up and surpass it. I barely got it to work and still didn’t manage to figure out how to get any of the GPU profilers to work properly to see why it is slow.

Topic		Replies	Views
Proof of space performance improvements by sharing tables between pieces Research consensus , proof-of-space , performance , chia	29	675	August 10, 2023
Performance of chia proof of space re-implementation in Rust Research consensus , proof-of-space , performance , rust , chia	2	296	May 12, 2023
More rewards on smaller plots Testing rewards , gemini-3f	36	1742	September 11, 2023
Slight change to sector contents Research research , consensus , sector , plot	17	363	April 16, 2023
Worse sector speeds & misses than before, dual Epyc 7532, Windows 10, Mar 29 release Support	17	224	April 2, 2024

Potential table creation rules change for the farmer

Related topics