Still not working quite correctly, but re-implementation of chiapos in Rust is faster.
Specifically, ~175ms for k17, a few ms faster than first phase in C++ chiapos + we don’t do other phases. k16 goes down do ~91ms.
Perf says it is quite heavily memory-bound, 20-24% typically (would have been worse with larger K I think):
4.003866159 1 000,43 msec task-clock:u # 1,000 CPUs utilized
4.003866159 0 context-switches:u # 0,000 /sec
4.003866159 0 cpu-migrations:u # 0,000 /sec
4.003866159 38 984 page-faults:u # 38,967 K/sec
4.003866159 5 296 564 666 cpu_core/cycles/u # 5,294 G/sec
4.003866159 <not counted> cpu_atom/cycles/u (0,00%)
4.003866159 14 431 722 077 cpu_core/instructions/u # 14,426 G/sec
4.003866159 <not counted> cpu_atom/instructions/u (0,00%)
4.003866159 1 985 520 012 cpu_core/branches/u # 1,985 G/sec
4.003866159 <not counted> cpu_atom/branches/u (0,00%)
4.003866159 20 956 461 cpu_core/branch-misses/u # 20,948 M/sec
4.003866159 <not counted> cpu_atom/branch-misses/u (0,00%)
4.003866159 31 753 616 184 cpu_core/slots:u/ # 31,740 G/sec
4.003866159 13 324 066 398 cpu_core/topdown-retiring/u # 42,0% Retiring
4.003866159 2 864 051 655 cpu_core/topdown-bad-spec/u # 9,0% Bad Speculation
4.003866159 996 191 880 cpu_core/topdown-fe-bound/u # 3,1% Frontend Bound
4.003866159 14 569 306 249 cpu_core/topdown-be-bound/u # 45,9% Backend Bound
4.003866159 249 047 970 cpu_core/topdown-heavy-ops/u # 0,8% Heavy Operations # 41,2% Light Operations
4.003866159 2 864 051 655 cpu_core/topdown-br-mispredict/u # 9,0% Branch Mispredict # 0,0% Machine Clears
4.003866159 498 095 940 cpu_core/topdown-fetch-lat/u # 1,6% Fetch Latency # 1,6% Fetch Bandwidth
4.003866159 7 595 963 087 cpu_core/topdown-mem-bound/u # 23,9% Memory Bound # 22,0% Core Bound
Code could still benefit from some SIMD instructions for bit shifts, but it seems that benefits will be marginal since bitshifts are over small data structures and cache-friendly, so not much room should be left there.
Parallelism only hurts during very narrow timing and large number of elements being processed, at least for the k that we have.