Potential table creation rules change for the farmer

I have been experimenting with various things (again) and discovered a change that improves performance for CPU-based table creation by ~10%, possibly even more (all other things being equal).

The observation is that sometimes Y values are actually duplicated, which forces the implementation to track the count of these things here:

And then have to loop over them here:

But it would have been much more efficient to have a simple yes/no flag instead.

From consensus point of view this makes no difference, but for farmer this means a breaking change to the on-disk farm format when decoding data. So in case such a change is introduced, at least temporarily both versions should be supported.

This should be especially helpeful for GPUs where loops like this are very costly there.

For context, here is what the distribution of counts looks roughly like:

Count Occurrences Occurrences in %
Any 11313485 100%
0 11138124 98.44%
1 173970 1.54%
2 1375 0.01%
3 16 0.0001%

Essentially every next count is about two orders of magnitude less likely to happen. I have not seen counts 7+ in my testing, but I guess they may happen with some probability too.

Only handling 0 and 1 will mean tables will be a little bit smaller (we’ll ignore ~1.5% of potentially valid entries on every table), but since we already have parity backup on the farmer and not all of those entries will have matches anyway, the possibility of the farmer not having encoded pieces (having chunks that can’t participate in rewards) is 0% just like before.

Any thoughts about this?

With local optimizations I was able to get these results with the current implementation:

chia/table/parallel/8x  time:   [687.73 ms 701.49 ms 716.28 ms]
                        thrpt:  [11.169  elem/s 11.404  elem/s 11.632  elem/s]

And by only tracking a single entry as described above:

chia/table/parallel/8x  time:   [655.95 ms 664.92 ms 674.66 ms]
                        thrpt:  [11.858  elem/s 12.031  elem/s 12.196  elem/s]

This was done on AMD Threadripper 7970X CPU, where I sliced 8C16T on the same CCX, so should be roughly in the same ballpark as 8C16T Ryzen 7700X CPU.


UPD: With more improvements it went down even further:

chia/table/parallel/8x  time:   [612.71 ms 618.05 ms 624.05 ms]
                        thrpt:  [12.820  elem/s 12.944  elem/s 13.057  elem/s]

Exploring further optimizations, since we don’t technically need any particular Y in the table, so due to small size of each bucket during matches, SIMD binary search might work really well without rmap creation.

And this is especially applicable for GPU, where such things should be even faster. Have a very good feeling about performance potential here.

1 Like

Shared my findings with massive improvements over above numbers here:

For context, current main of https://github.com/autonomys/subspace:

chia/table/single       time:   [1.0668 s 1.0756 s 1.0862 s]
chia/table/parallel/8x  time:   [905.01 ms 919.11 ms 933.34 ms]

Possible compatible backport:

chia/table/parallel/8x  time:   [677.60 ms 684.25 ms 692.06 ms]
                        thrpt:  [11.560  elem/s 11.692  elem/s 11.806  elem/s]

Current state as of above articles with breaking changes:

chia/table/single/1x    time:   [747.82 ms 756.94 ms 768.09 ms]
                        thrpt:  [1.3019  elem/s 1.3211  elem/s 1.3372  elem/s]
chia/table/parallel/8x  time:   [529.15 ms 534.42 ms 540.15 ms]
                        thrpt:  [14.811  elem/s 14.969  elem/s 15.119  elem/s]

This is all on a single CCX of AMD Threadripper 7970X CPU (roughly 8C16T AMD Ryzen 7700X CPU).

1 Like