As in the Linux perf
like you have in Faster chiapos (part 3) #1685 for k22
k20
All cores (13900K):
chia/table/parallel time: [100.33 ms 101.38 ms 102.31 ms]
4P cores 5.5 GHz:
chia/table/parallel time: [1.4368 s 1.4439 s 1.4510 s]
4E cores 4.3 GHz:
chia/table/parallel time: [529.83 ms 536.08 ms 542.19 ms]
perf (all cores 13900K)
14 256,91 msec task-clock:u # 14,257 CPUs utilized
0 context-switches:u # 0,000 /sec
0 cpu-migrations:u # 0,000 /sec
54 page-faults:u # 3,788 /sec
71 912 482 535 cpu_core/cycles/u # 5,044 G/sec (45,88%)
57 923 375 807 cpu_atom/cycles/u # 4,063 G/sec (55,01%)
136 309 783 589 cpu_core/instructions/u # 9,561 G/sec (45,88%)
109 822 003 041 cpu_atom/instructions/u # 7,703 G/sec (55,01%)
23 445 336 256 cpu_core/branches/u # 1,644 G/sec (45,88%)
18 912 831 887 cpu_atom/branches/u # 1,327 G/sec (55,01%)
241 471 587 cpu_core/branch-misses/u # 16,937 M/sec (45,88%)
186 722 727 cpu_atom/branch-misses/u # 13,097 M/sec (55,01%)
285 992 250 112 cpu_core/slots:u/ # 20,060 G/sec (45,88%)
124 379 684 456 cpu_core/topdown-retiring/u # 43,0% Retiring (45,88%)
33 795 603 678 cpu_core/topdown-bad-spec/u # 11,7% Bad Speculation (45,88%)
31 543 016 663 cpu_core/topdown-fe-bound/u # 10,9% Frontend Bound (45,88%)
99 631 237 645 cpu_core/topdown-be-bound/u # 34,4% Backend Bound (45,88%)
2 871 734 329 cpu_core/topdown-heavy-ops/u # 1,0% Heavy Operations # 42,0% Light Operations (45,88%)
32 492 570 282 cpu_core/topdown-br-mispredict/u # 11,2% Branch Mispredict # 0,5% Machine Clears (45,88%)
21 608 097 918 cpu_core/topdown-fetch-lat/u # 7,5% Fetch Latency # 3,4% Fetch Bandwidth (45,88%)
55 502 231 135 cpu_core/topdown-mem-bound/u # 19,2% Memory Bound # 15,3% Core Bound (45,88%)
69 058 932 cpu_core/cache-misses:u/ (49,21%)
85 642 280 cpu_atom/cache-misses:u/ (51,05%)
k21
All cores (13900K):
chia/table/parallel time: [275.92 ms 279.50 ms 283.09 ms]
4P cores 5.5 GHz:
chia/table/parallel time: [710.71 ms 717.20 ms 723.59 ms]
4E cores 4.3 GHz:
chia/table/parallel time: [1.1262 s 1.1347 s 1.1438 s]
perf (all cores 13900K)
5 212,34 msec task-clock:u # 5,212 CPUs utilized
0 context-switches:u # 0,000 /sec
0 cpu-migrations:u # 0,000 /sec
163 528 page-faults:u # 31,373 K/sec
26 193 507 946 cpu_core/cycles/u # 5,025 G/sec (49,16%)
20 559 065 164 cpu_atom/cycles/u # 3,944 G/sec (53,01%)
44 899 330 740 cpu_core/instructions/u # 8,614 G/sec (49,16%)
36 178 677 910 cpu_atom/instructions/u # 6,941 G/sec (53,01%)
7 700 223 400 cpu_core/branches/u # 1,477 G/sec (49,16%)
6 232 549 211 cpu_atom/branches/u # 1,196 G/sec (53,01%)
79 699 486 cpu_core/branch-misses/u # 15,291 M/sec (49,16%)
64 036 973 cpu_atom/branch-misses/u # 12,286 M/sec (53,01%)
93 996 633 443 cpu_core/slots:u/ # 18,033 G/sec (49,16%)
37 947 855 990 cpu_core/topdown-retiring/u # 39,4% Retiring (49,16%)
9 900 532 328 cpu_core/topdown-bad-spec/u # 10,3% Bad Speculation (49,16%)
10 788 528 684 cpu_core/topdown-fe-bound/u # 11,2% Frontend Bound (49,16%)
37 568 653 415 cpu_core/topdown-be-bound/u # 39,1% Backend Bound (49,16%)
1 556 733 990 cpu_core/topdown-heavy-ops/u # 1,6% Heavy Operations # 37,8% Light Operations (49,16%)
9 387 364 492 cpu_core/topdown-br-mispredict/u # 9,8% Branch Mispredict # 0,5% Machine Clears (49,16%)
7 727 908 801 cpu_core/topdown-fetch-lat/u # 8,0% Fetch Latency # 3,2% Fetch Bandwidth (49,16%)
24 931 598 258 cpu_core/topdown-mem-bound/u # 25,9% Memory Bound # 13,1% Core Bound (49,16%)
81 999 429 cpu_core/cache-misses:u/ (48,09%)
48 107 253 cpu_atom/cache-misses:u/ (52,29%)
k22
All cores (13900K):
chia/table/parallel time: [587.62 ms 600.98 ms 614.84 ms]
4P cores 5.5 GHz:
chia/table/parallel time: [1.4257 s 1.4339 s 1.4440 s]
4E cores 4.3 GHz:
chia/table/parallel time: [2.2677 s 2.2795 s 2.2927 s]
perf (all cores 13900K)
12 782,93 msec task-clock:u # 12,783 CPUs utilized
0 context-switches:u # 0,000 /sec
0 cpu-migrations:u # 0,000 /sec
397 837 page-faults:u # 31,123 K/sec
66 098 770 509 cpu_core/cycles/u # 5,171 G/sec (47,90%)
51 291 361 638 cpu_atom/cycles/u # 4,012 G/sec (52,50%)
112 191 683 640 cpu_core/instructions/u # 8,777 G/sec (47,90%)
90 742 909 221 cpu_atom/instructions/u # 7,099 G/sec (52,50%)
19 502 494 839 cpu_core/branches/u # 1,526 G/sec (47,90%)
15 551 768 595 cpu_atom/branches/u # 1,217 G/sec (52,50%)
199 930 102 cpu_core/branch-misses/u # 15,640 M/sec (47,90%)
155 375 492 cpu_atom/branch-misses/u # 12,155 M/sec (52,50%)
234 077 388 158 cpu_core/slots:u/ # 18,312 G/sec (47,90%)
84 722 306 151 cpu_core/topdown-retiring/u # 35,4% Retiring (47,90%)
21 153 511 076 cpu_core/topdown-bad-spec/u # 8,8% Bad Speculation (47,90%)
24 601 969 208 cpu_core/topdown-fe-bound/u # 10,3% Frontend Bound (47,90%)
108 622 759 602 cpu_core/topdown-be-bound/u # 45,4% Backend Bound (47,90%)
2 824 956 565 cpu_core/topdown-heavy-ops/u # 1,2% Heavy Operations # 34,3% Light Operations (47,90%)
20 390 220 573 cpu_core/topdown-br-mispredict/u # 8,5% Branch Mispredict # 0,3% Machine Clears (47,90%)
17 754 016 020 cpu_core/topdown-fetch-lat/u # 7,4% Fetch Latency # 2,9% Fetch Bandwidth (47,90%)
81 042 977 150 cpu_core/topdown-mem-bound/u # 33,9% Memory Bound # 11,5% Core Bound (47,90%)
74 300 211 cpu_core/cache-misses:u/ (48,21%)
91 901 717 cpu_atom/cache-misses:u/ (52,27%)
k23
All cores (13900K):
chia/table/parallel time: [1.2740 s 1.3021 s 1.3288 s]
4P cores 5.5 GHz:
chia/table/parallel time: [2.8655 s 2.8809 s 2.8980 s]
4E cores 4.3 GHz:
chia/table/parallel time: [4.6215 s 4.6398 s 4.6566 s]
perf (all cores 13900K)
14 044,82 msec task-clock:u # 14,045 CPUs utilized
0 context-switches:u # 0,000 /sec
0 cpu-migrations:u # 0,000 /sec
416 937 page-faults:u # 29,686 K/sec
73 361 499 480 cpu_core/cycles/u # 5,223 G/sec (49,23%)
56 782 103 595 cpu_atom/cycles/u # 4,043 G/sec (51,12%)
122 933 494 607 cpu_core/instructions/u # 8,753 G/sec (49,23%)
98 193 000 946 cpu_atom/instructions/u # 6,991 G/sec (51,12%)
21 269 405 157 cpu_core/branches/u # 1,514 G/sec (49,23%)
16 966 085 419 cpu_atom/branches/u # 1,208 G/sec (51,12%)
200 824 045 cpu_core/branch-misses/u # 14,299 M/sec (49,23%)
153 984 388 cpu_atom/branch-misses/u # 10,964 M/sec (51,12%)
255 569 995 454 cpu_core/slots:u/ # 18,197 G/sec (49,23%)
91 444 633 486 cpu_core/topdown-retiring/u # 35,5% Retiring (49,23%)
20 594 474 412 cpu_core/topdown-bad-spec/u # 8,0% Bad Speculation (49,23%)
24 504 168 087 cpu_core/topdown-fe-bound/u # 9,5% Frontend Bound (49,23%)
121 288 246 751 cpu_core/topdown-be-bound/u # 47,0% Backend Bound (49,23%)
2 883 892 148 cpu_core/topdown-heavy-ops/u # 1,1% Heavy Operations # 34,3% Light Operations (49,23%)
19 784 338 153 cpu_core/topdown-br-mispredict/u # 7,7% Branch Mispredict # 0,3% Machine Clears (49,23%)
18 042 026 974 cpu_core/topdown-fetch-lat/u # 7,0% Fetch Latency # 2,5% Fetch Bandwidth (49,23%)
92 328 605 223 cpu_core/topdown-mem-bound/u # 35,8% Memory Bound # 11,2% Core Bound (49,23%)
103 688 009 cpu_core/cache-misses:u/ (49,05%)
90 148 097 cpu_atom/cache-misses:u/ (51,43%)
We know that with an increase of k by 1 the number of entries in a table doubles. So one k23
table has the same number of entries as 2*k22
, 4*k21
and 8*k20
tables respectively. Note also, that with the increase in k
each entry increases in size by 1 bit.
Now we take the real data provided by Nazar above:
k20: 101.38 ms 19.2% memory bound
k21: 279.50 ms 25.9% memory bound
k22: 600.98 ms 33.9% memory bound
k23: 1302.1 ms 35.8% memory bound
We can plot it (blue - time in ms to build the table, orange - time spent memory bound, green - time spent memory bound if it doubled with k
increase)
We can see that with each
k+1
, we spend more than double the time and more than double the memory bound time compared to k
. In other words, we are still more memory bound if we encode 2 pieces with one k+1
table than with 2 k
tables. In fact, computing one k23
table we spend 13x time total and 24x time being memory bound compared to k20
instead of theoretical 8 = 2^23/2^20
, k22
spends 6x time total and 9x time being memory bound as compared to k20
instead of theoretical 4 = 2^22/2^20
@Chen_Feng_2023 how would this affect the space-time tradeoff presented in the paper?
The memory bound time increase is exponential: a computation of k+1
table spends ~e more time waiting for the memory subsystem than k
table computation.
In terms of total time, the increment is closer to 2.3x with every increase of
k
.Our previous theoretical tradeoff was 2^{k*q} (where q=7 is number of tables) and we chose k=22, with a true base being 2.3 instead of 2 we can achieve the same tradeoff 2.3^{k*7} with k=19.
Time and memory bound percentage are platform-dependent. Maybe if we had infinite amount of parallelism and perfect implementation, it would be linear, so I’m not sure we should overindex there. If I had more CPU cores, DDR5 memory (I have DDR4) or more memory channels the results could be different, this is just data sample of 1.
BTW can you also plot 4P cores and 4E cores? They have the same system memory access, but are artificially limited in terms of compute (L1 cache should be similarly limited, I think all L2 should be accessible, but indirectly, and they should still have access to all L3 cache without changes).
For 4P cores and 4E cores the relation is more linear, closer to 2.1.
On my machine (Apple M1 Pro Cores: 10 (8 performance and 2 efficiency) and LPDDR5 memory) the picture looks similar to yours, with the true base becoming 2.2 around k22 and higher:
We have revisited the reasoning and adversary assumptions we had in the Dilithium paper (from which the recommendation of k22 comes) and concluded that k20 is also fine. I’ll leave @Chen_Feng_2023 to explain better why.
Now if we take the relation to be close to linear, one k22 table is worth 4 * k20, so using a k22 table to mask 4 pieces does not give an attacker any advantage in on-demand plotting. Moreover, it actually makes an attacker work more - now they can’t choose to compute 1 or 2 or 3 k20 tables, they have to put all the work to compute a full k22 table and get 4 pieces.
That said we agree that we can use each k22 to encode 4 pieces.
This would bring down plotting time 4x.
A 1TiB plot will take 1.7 days with all your cores, 3.5 days on my machine, 4 days on your 4P cores and 6.5 days on 4E cores. Still not 24 hours, but much better UX.
I see. k22 will still mean that data retrieval from archival storage is very costly, but I guess we’ll have to live with it. Can you update the remaining issue in the Dilithium milestone on GitHub so we can get it done and closed?
I have written this into the spec in blue, please take a look. Also in Verification.
To provide some additional insight on this topic, let us revisit the security foundations for Chia and Subspace.
Loosely speaking, the foundation for Chia can be stated as: Given massive honest storage space \Omega, it is hard for the adversary to “match” the space \Omega (so that the adversary can do double-spending) in terms of the space-time tradeoff even with near-optimal strategies such as Hellman’s attacks.
This foundation applies to Subspace as discussed in our research paper. But more importantly, Subspace enjoys an additional security foundation: Given massive honest storage space \Omega, it is hard for the adversary to “match” the space \Omega (so that the adversary can do double-spending) in terms of the space-time-IO tradeoff under various strategies. In other words, the required IO for many attacks is too demanding for large \Omega.
Next, let us use on-demand plotting as a particular example of attacking strategies. As explained by Dariia, the IO of k22 is about four times the IO of k20. So, we can mask 4 pieces with k22. which does not give the adversary conducting on-demand plotting any advantage in terms of required IO. In other words, k22 with masking 4 pieces is as good as k20 with masking 1 piece for on-demand plotting.
After further consideration, we have noticed that k22 may be too slow for a farmer with a slower CPU to be able to verify PoT, audit, produce a solution, build a block and for that block to propagate through the network all within our target 6 seconds block time. Furthermore, reusing tables complicates the implementation, so we have decided to revert back to using a single table per piece, but that table will be increased to 20 instead of the current 17.