Proof of space performance improvements by sharing tables between pieces

As in the Linux perf like you have in Faster chiapos (part 3) #1685 for k22

k20

All cores (13900K):

chia/table/parallel     time:   [100.33 ms 101.38 ms 102.31 ms]

4P cores 5.5 GHz:

chia/table/parallel     time:   [1.4368 s 1.4439 s 1.4510 s]

4E cores 4.3 GHz:

chia/table/parallel     time:   [529.83 ms 536.08 ms 542.19 ms]
perf (all cores 13900K)
      14 256,91 msec task-clock:u                     #   14,257 CPUs utilized
              0      context-switches:u               #    0,000 /sec
              0      cpu-migrations:u                 #    0,000 /sec
             54      page-faults:u                    #    3,788 /sec
 71 912 482 535      cpu_core/cycles/u                #    5,044 G/sec                       (45,88%)
 57 923 375 807      cpu_atom/cycles/u                #    4,063 G/sec                       (55,01%)
136 309 783 589      cpu_core/instructions/u          #    9,561 G/sec                       (45,88%)
109 822 003 041      cpu_atom/instructions/u          #    7,703 G/sec                       (55,01%)
 23 445 336 256      cpu_core/branches/u              #    1,644 G/sec                       (45,88%)
 18 912 831 887      cpu_atom/branches/u              #    1,327 G/sec                       (55,01%)
    241 471 587      cpu_core/branch-misses/u         #   16,937 M/sec                       (45,88%)
    186 722 727      cpu_atom/branch-misses/u         #   13,097 M/sec                       (55,01%)
285 992 250 112      cpu_core/slots:u/                #   20,060 G/sec                       (45,88%)
124 379 684 456      cpu_core/topdown-retiring/u      #     43,0% Retiring                   (45,88%)
 33 795 603 678      cpu_core/topdown-bad-spec/u      #     11,7% Bad Speculation            (45,88%)
 31 543 016 663      cpu_core/topdown-fe-bound/u      #     10,9% Frontend Bound             (45,88%)
 99 631 237 645      cpu_core/topdown-be-bound/u      #     34,4% Backend Bound              (45,88%)
  2 871 734 329      cpu_core/topdown-heavy-ops/u     #      1,0% Heavy Operations          #     42,0% Light Operations           (45,88%)
 32 492 570 282      cpu_core/topdown-br-mispredict/u #     11,2% Branch Mispredict         #      0,5% Machine Clears             (45,88%)
 21 608 097 918      cpu_core/topdown-fetch-lat/u     #      7,5% Fetch Latency             #      3,4% Fetch Bandwidth            (45,88%)
 55 502 231 135      cpu_core/topdown-mem-bound/u     #     19,2% Memory Bound              #     15,3% Core Bound                 (45,88%)
     69 058 932      cpu_core/cache-misses:u/                                                (49,21%)
     85 642 280      cpu_atom/cache-misses:u/                                                (51,05%)

k21

All cores (13900K):

chia/table/parallel     time:   [275.92 ms 279.50 ms 283.09 ms]

4P cores 5.5 GHz:

chia/table/parallel     time:   [710.71 ms 717.20 ms 723.59 ms]

4E cores 4.3 GHz:

chia/table/parallel     time:   [1.1262 s 1.1347 s 1.1438 s]
perf (all cores 13900K)
      5 212,34 msec task-clock:u                     #    5,212 CPUs utilized
             0      context-switches:u               #    0,000 /sec
             0      cpu-migrations:u                 #    0,000 /sec
       163 528      page-faults:u                    #   31,373 K/sec
26 193 507 946      cpu_core/cycles/u                #    5,025 G/sec                       (49,16%)
20 559 065 164      cpu_atom/cycles/u                #    3,944 G/sec                       (53,01%)
44 899 330 740      cpu_core/instructions/u          #    8,614 G/sec                       (49,16%)
36 178 677 910      cpu_atom/instructions/u          #    6,941 G/sec                       (53,01%)
 7 700 223 400      cpu_core/branches/u              #    1,477 G/sec                       (49,16%)
 6 232 549 211      cpu_atom/branches/u              #    1,196 G/sec                       (53,01%)
    79 699 486      cpu_core/branch-misses/u         #   15,291 M/sec                       (49,16%)
    64 036 973      cpu_atom/branch-misses/u         #   12,286 M/sec                       (53,01%)
93 996 633 443      cpu_core/slots:u/                #   18,033 G/sec                       (49,16%)
37 947 855 990      cpu_core/topdown-retiring/u      #     39,4% Retiring                   (49,16%)
 9 900 532 328      cpu_core/topdown-bad-spec/u      #     10,3% Bad Speculation            (49,16%)
10 788 528 684      cpu_core/topdown-fe-bound/u      #     11,2% Frontend Bound             (49,16%)
37 568 653 415      cpu_core/topdown-be-bound/u      #     39,1% Backend Bound              (49,16%)
 1 556 733 990      cpu_core/topdown-heavy-ops/u     #      1,6% Heavy Operations          #     37,8% Light Operations           (49,16%)
 9 387 364 492      cpu_core/topdown-br-mispredict/u #      9,8% Branch Mispredict         #      0,5% Machine Clears             (49,16%)
 7 727 908 801      cpu_core/topdown-fetch-lat/u     #      8,0% Fetch Latency             #      3,2% Fetch Bandwidth            (49,16%)
24 931 598 258      cpu_core/topdown-mem-bound/u     #     25,9% Memory Bound              #     13,1% Core Bound                 (49,16%)
    81 999 429      cpu_core/cache-misses:u/                                                (48,09%)
    48 107 253      cpu_atom/cache-misses:u/                                                (52,29%)

k22

All cores (13900K):

chia/table/parallel     time:   [587.62 ms 600.98 ms 614.84 ms]

4P cores 5.5 GHz:

chia/table/parallel     time:   [1.4257 s 1.4339 s 1.4440 s]

4E cores 4.3 GHz:

chia/table/parallel     time:   [2.2677 s 2.2795 s 2.2927 s]
perf (all cores 13900K)
      12 782,93 msec task-clock:u                     #   12,783 CPUs utilized
              0      context-switches:u               #    0,000 /sec
              0      cpu-migrations:u                 #    0,000 /sec
        397 837      page-faults:u                    #   31,123 K/sec
 66 098 770 509      cpu_core/cycles/u                #    5,171 G/sec                       (47,90%)
 51 291 361 638      cpu_atom/cycles/u                #    4,012 G/sec                       (52,50%)
112 191 683 640      cpu_core/instructions/u          #    8,777 G/sec                       (47,90%)
 90 742 909 221      cpu_atom/instructions/u          #    7,099 G/sec                       (52,50%)
 19 502 494 839      cpu_core/branches/u              #    1,526 G/sec                       (47,90%)
 15 551 768 595      cpu_atom/branches/u              #    1,217 G/sec                       (52,50%)
    199 930 102      cpu_core/branch-misses/u         #   15,640 M/sec                       (47,90%)
    155 375 492      cpu_atom/branch-misses/u         #   12,155 M/sec                       (52,50%)
234 077 388 158      cpu_core/slots:u/                #   18,312 G/sec                       (47,90%)
 84 722 306 151      cpu_core/topdown-retiring/u      #     35,4% Retiring                   (47,90%)
 21 153 511 076      cpu_core/topdown-bad-spec/u      #      8,8% Bad Speculation            (47,90%)
 24 601 969 208      cpu_core/topdown-fe-bound/u      #     10,3% Frontend Bound             (47,90%)
108 622 759 602      cpu_core/topdown-be-bound/u      #     45,4% Backend Bound              (47,90%)
  2 824 956 565      cpu_core/topdown-heavy-ops/u     #      1,2% Heavy Operations          #     34,3% Light Operations           (47,90%)
 20 390 220 573      cpu_core/topdown-br-mispredict/u #      8,5% Branch Mispredict         #      0,3% Machine Clears             (47,90%)
 17 754 016 020      cpu_core/topdown-fetch-lat/u     #      7,4% Fetch Latency             #      2,9% Fetch Bandwidth            (47,90%)
 81 042 977 150      cpu_core/topdown-mem-bound/u     #     33,9% Memory Bound              #     11,5% Core Bound                 (47,90%)
     74 300 211      cpu_core/cache-misses:u/                                                (48,21%)
     91 901 717      cpu_atom/cache-misses:u/                                                (52,27%)

k23

All cores (13900K):

chia/table/parallel     time:   [1.2740 s 1.3021 s 1.3288 s]

4P cores 5.5 GHz:

chia/table/parallel     time:   [2.8655 s 2.8809 s 2.8980 s]

4E cores 4.3 GHz:

chia/table/parallel     time:   [4.6215 s 4.6398 s 4.6566 s]
perf (all cores 13900K)
      14 044,82 msec task-clock:u                     #   14,045 CPUs utilized
              0      context-switches:u               #    0,000 /sec
              0      cpu-migrations:u                 #    0,000 /sec
        416 937      page-faults:u                    #   29,686 K/sec
 73 361 499 480      cpu_core/cycles/u                #    5,223 G/sec                       (49,23%)
 56 782 103 595      cpu_atom/cycles/u                #    4,043 G/sec                       (51,12%)
122 933 494 607      cpu_core/instructions/u          #    8,753 G/sec                       (49,23%)
 98 193 000 946      cpu_atom/instructions/u          #    6,991 G/sec                       (51,12%)
 21 269 405 157      cpu_core/branches/u              #    1,514 G/sec                       (49,23%)
 16 966 085 419      cpu_atom/branches/u              #    1,208 G/sec                       (51,12%)
    200 824 045      cpu_core/branch-misses/u         #   14,299 M/sec                       (49,23%)
    153 984 388      cpu_atom/branch-misses/u         #   10,964 M/sec                       (51,12%)
255 569 995 454      cpu_core/slots:u/                #   18,197 G/sec                       (49,23%)
 91 444 633 486      cpu_core/topdown-retiring/u      #     35,5% Retiring                   (49,23%)
 20 594 474 412      cpu_core/topdown-bad-spec/u      #      8,0% Bad Speculation            (49,23%)
 24 504 168 087      cpu_core/topdown-fe-bound/u      #      9,5% Frontend Bound             (49,23%)
121 288 246 751      cpu_core/topdown-be-bound/u      #     47,0% Backend Bound              (49,23%)
  2 883 892 148      cpu_core/topdown-heavy-ops/u     #      1,1% Heavy Operations          #     34,3% Light Operations           (49,23%)
 19 784 338 153      cpu_core/topdown-br-mispredict/u #      7,7% Branch Mispredict         #      0,3% Machine Clears             (49,23%)
 18 042 026 974      cpu_core/topdown-fetch-lat/u     #      7,0% Fetch Latency             #      2,5% Fetch Bandwidth            (49,23%)
 92 328 605 223      cpu_core/topdown-mem-bound/u     #     35,8% Memory Bound              #     11,2% Core Bound                 (49,23%)
    103 688 009      cpu_core/cache-misses:u/                                                (49,05%)
     90 148 097      cpu_atom/cache-misses:u/                                                (51,43%)

We know that with an increase of k by 1 the number of entries in a table doubles. So one k23 table has the same number of entries as 2*k22, 4*k21 and 8*k20 tables respectively. Note also, that with the increase in k each entry increases in size by 1 bit.
Now we take the real data provided by Nazar above:

k20: 101.38 ms  19.2% memory bound
k21: 279.50 ms  25.9% memory bound
k22: 600.98 ms  33.9% memory bound
k23: 1302.1 ms 35.8% memory bound

We can plot it (blue - time in ms to build the table, orange - time spent memory bound, green - time spent memory bound if it doubled with k increase)


We can see that with each k+1, we spend more than double the time and more than double the memory bound time compared to k. In other words, we are still more memory bound if we encode 2 pieces with one k+1 table than with 2 k tables. In fact, computing one k23 table we spend 13x time total and 24x time being memory bound compared to k20 instead of theoretical 8 = 2^23/2^20, k22 spends 6x time total and 9x time being memory bound as compared to k20 instead of theoretical 4 = 2^22/2^20
@Chen_Feng_2023 how would this affect the space-time tradeoff presented in the paper?

1 Like

The memory bound time increase is exponential: a computation of k+1 table spends ~e more time waiting for the memory subsystem than k table computation.


In terms of total time, the increment is closer to 2.3x with every increase of k.
Our previous theoretical tradeoff was 2^{k*q} (where q=7 is number of tables) and we chose k=22, with a true base being 2.3 instead of 2 we can achieve the same tradeoff 2.3^{k*7} with k=19.

1 Like

Time and memory bound percentage are platform-dependent. Maybe if we had infinite amount of parallelism and perfect implementation, it would be linear, so I’m not sure we should overindex there. If I had more CPU cores, DDR5 memory (I have DDR4) or more memory channels the results could be different, this is just data sample of 1.

BTW can you also plot 4P cores and 4E cores? They have the same system memory access, but are artificially limited in terms of compute (L1 cache should be similarly limited, I think all L2 should be accessible, but indirectly, and they should still have access to all L3 cache without changes).

For 4P cores and 4E cores the relation is more linear, closer to 2.1.


On my machine (Apple M1 Pro Cores: 10 (8 performance and 2 efficiency) and LPDDR5 memory) the picture looks similar to yours, with the true base becoming 2.2 around k22 and higher:

We have revisited the reasoning and adversary assumptions we had in the Dilithium paper (from which the recommendation of k22 comes) and concluded that k20 is also fine. I’ll leave @Chen_Feng_2023 to explain better why.
Now if we take the relation to be close to linear, one k22 table is worth 4 * k20, so using a k22 table to mask 4 pieces does not give an attacker any advantage in on-demand plotting. Moreover, it actually makes an attacker work more - now they can’t choose to compute 1 or 2 or 3 k20 tables, they have to put all the work to compute a full k22 table and get 4 pieces.
That said we agree that we can use each k22 to encode 4 pieces.
This would bring down plotting time 4x.
A 1TiB plot will take 1.7 days with all your cores, 3.5 days on my machine, 4 days on your 4P cores and 6.5 days on 4E cores. Still not 24 hours, but much better UX.

1 Like

I see. k22 will still mean that data retrieval from archival storage is very costly, but I guess we’ll have to live with it. Can you update the remaining issue in the Dilithium milestone on GitHub so we can get it done and closed?

I have written this into the spec in blue, please take a look. Also in Verification.

To provide some additional insight on this topic, let us revisit the security foundations for Chia and Subspace.

Loosely speaking, the foundation for Chia can be stated as: Given massive honest storage space \Omega, it is hard for the adversary to “match” the space \Omega (so that the adversary can do double-spending) in terms of the space-time tradeoff even with near-optimal strategies such as Hellman’s attacks.

This foundation applies to Subspace as discussed in our research paper. But more importantly, Subspace enjoys an additional security foundation: Given massive honest storage space \Omega, it is hard for the adversary to “match” the space \Omega (so that the adversary can do double-spending) in terms of the space-time-IO tradeoff under various strategies. In other words, the required IO for many attacks is too demanding for large \Omega.

Next, let us use on-demand plotting as a particular example of attacking strategies. As explained by Dariia, the IO of k22 is about four times the IO of k20. So, we can mask 4 pieces with k22. which does not give the adversary conducting on-demand plotting any advantage in terms of required IO. In other words, k22 with masking 4 pieces is as good as k20 with masking 1 piece for on-demand plotting.

2 Likes

After further consideration, we have noticed that k22 may be too slow for a farmer with a slower CPU to be able to verify PoT, audit, produce a solution, build a block and for that block to propagate through the network all within our target 6 seconds block time. Furthermore, reusing tables complicates the implementation, so we have decided to revert back to using a single table per piece, but that table will be increased to 20 instead of the current 17.