I have been experimenting with various things (again) and discovered a change that improves performance for CPU-based table creation by ~10%, possibly even more (all other things being equal).
The observation is that sometimes Y values are actually duplicated, which forces the implementation to track the count of these things here:
And then have to loop over them here:
But it would have been much more efficient to have a simple yes/no flag instead.
From consensus point of view this makes no difference, but for farmer this means a breaking change to the on-disk farm format when decoding data. So in case such a change is introduced, at least temporarily both versions should be supported.
This should be especially helpeful for GPUs where loops like this are very costly there.
For context, here is what the distribution of counts looks roughly like:
Count
Occurrences
Occurrences in %
Any
11313485
100%
0
11138124
98.44%
1
173970
1.54%
2
1375
0.01%
3
16
0.0001%
Essentially every next count is about two orders of magnitude less likely to happen. I have not seen counts 7+ in my testing, but I guess they may happen with some probability too.
Only handling 0 and 1 will mean tables will be a little bit smaller (we’ll ignore ~1.5% of potentially valid entries on every table), but since we already have parity backup on the farmer and not all of those entries will have matches anyway, the possibility of the farmer not having encoded pieces (having chunks that can’t participate in rewards) is 0% just like before.
Exploring further optimizations, since we don’t technically need any particular Y in the table, so due to small size of each bucket during matches, SIMD binary search might work really well without rmap creation.
And this is especially applicable for GPU, where such things should be even faster. Have a very good feeling about performance potential here.
Not fully optimized, but the last update of the old API looked like this:
chia/table/single/1x time: [638.21 ms 642.97 ms 649.77 ms]
thrpt: [1.5390 elem/s 1.5553 elem/s 1.5669 elem/s]
chia/table/parallel/8x time: [426.94 ms 431.49 ms 436.70 ms]
thrpt: [18.319 elem/s 18.540 elem/s 18.738 elem/s]
And then switching to a different (not fully optimized) API that produces just the proofs protocol needs led to further performance improvements (note that the benchmark here includes proofs, not just tables):
Before:
chia/proofs/single/1x time: [728.10 ms 735.58 ms 744.74 ms]
thrpt: [1.3427 elem/s 1.3595 elem/s 1.3734 elem/s]
chia/proofs/parallel/8x time: [600.14 ms 604.93 ms 609.18 ms]
thrpt: [13.132 elem/s 13.225 elem/s 13.330 elem/s]
After:
chia/proofs/single/1x time: [710.02 ms 713.94 ms 718.19 ms]
thrpt: [1.3924 elem/s 1.4007 elem/s 1.4084 elem/s]
chia/proofs/parallel/8x time: [567.42 ms 574.66 ms 581.51 ms]
thrpt: [13.757 elem/s 13.921 elem/s 14.099 elem/s]
The new code is not fully parallelized when collecting proofs and it is not yet leveraging SIMD for Blake3 in multiple places, which will lead to further performance improvements, probably to the level of just tables creation before, but for the whole thing.
I’m able to create a sector in ~30 seconds on AMD Threadripper 7970X CPU (32C64T) now, which according to Plotting Speed Results is only slower than massive AMD Epyc 9654 (96C192T), and will likely beat it once optimized further (of course, Epyc will win once using this version of the code too).
GPU version is complete, it works, confirmed to run successfully on the following hardware:
AMD RX 7600 XT (dGPU, RADV, Linux)
Nvidia RTX 2080 Ti (dGPU, Nvidia proprietary, Linux)
Apple M1 (iGPU)
Raspberry PI 5 (iGPU)
Software emulation on CPU with LLVMpipe (Lavapipe)
I have no reason to think it’ll not work on A LOT of other hardware as well, for example the following devices should technically be supported since they support Vulkan 1.2 or better:
Nvidia GTX 6xx and newer (Kepler+ on Linux and Windows)
AMD Radeon HD 77xx and newer (GCN 1.0+ on Linux and Windows)
Intel Skylake iGPU and newer (Linux and Windows)
all kinds of ARM SBCs and even iGPUs in mobile phones
Performance is a different story of course.
Right now my Vulkan implementation is a few times slower than CUDA version, but it should not be difficult to catch up and surpass it. I barely got it to work and still didn’t manage to figure out how to get any of the GPU profilers to work properly to see why it is slow.