Calculating fraction of missing pieces

saeid · October 6, 2023, 8:36am

I am writing to discuss the fraction of missing pieces under various scenarios. We have explored two primary approaches:

With a fixed number of pieces (Static).
When the number of pieces increases over time.

\textbf{First:}
In the first experiment, we assume a total of 1,000,000 pieces, with 100,000 farmers, where each farmer can store 1000 pieces. Given this setup, the replication factor is 100, resulting in no missing data.

However, when we reduce the number of farmers to 10,000, the replication factor drops to 10. Consequently, we observed that the fraction of missing data becomes 116/1,000,000.

\textbf{Second:}
For the dynamic scenario where \( n \) increases progressively, we follow the given scheme: height grows by n_0 plus a uniformly random value within the range ([min, multiplier * n_0]). In this setup, every farmer joins the system at n = 1000. The objective is to devise a strategy that reduces the likelihood of data loss.
We are assuming n_max = 1,000,000 and min = 100

The process for each farmer is to continuously compute the next height until the maximum is reached.

Multiplier = 4

With a multiplier set to 4, and assuming #farmers = 100,000 with each capable of storing 1000 pieces, the fraction of missing pieces comes to 9876/1,000,000.

The figure below is a histogram of the latest height of farmers.

When the number of farmers is reduced to 10,000, the fraction becomes 87794/1,000,000. This is a logical outcome given the decrease in the replication factor from 100 to 10.

Multiplier = 2

By setting the multiplier to 2, we essentially offer each farmer a broader scope for selecting pieces. This methodology appears to be both more realistic and practical.

For #farmers = 100,000, the fraction of missing pieces is 0.0064%. You can see which pieces weren’t selected by any farmers.

However, with a reduced count of #farmers = 10,000, the fraction of missing pieces rises to 0.0583%.

In conclusion, the results suggest that a multiplier of 2 is preferable. In a subsequent discussion, I will delve into potential modifications to the selection rule to further mitigate data loss.

You can also find the code here:

Chen_Feng_2023 · October 6, 2023, 9:18am

Thanks @saeid!

For the static case, it is a standard balls-and-bins problem and so the fraction of missing pieces is about 1/{e^{replication factor}}. For the dynamic case with multiplier = 4, the fraction of missing pieces is about 1/replication factor. For the dynamic case with multiplier = 2, the fraction of missing pieces is about 1/(2 \times replication factor).

That is, our current sector expiry policy, together with uniform piece selection, leads to a relatively high fraction of missing pieces. This issue can be mitigated in two ways: (1) using better parameters for our sector expiry policy; (2) modifying our uniform piece selection. Saeid will report more on this tomorrow.

saeid · October 9, 2023, 11:47pm

For mitigating the mentioned issue, high fraction of missing pieces, we designed four scenarios as follows. Replication factor in these scenarios is 10.

Scenario 1

The maximum number of pieces is 2,000,000. We have two stages:

10,000 farmers join based on uniform distribution from range [0,1M]
Another 10,000 famers join based on uniform distribution from range [1M,2M].

Then, based on the formula of calculating the heights of each farmer, n_0 + [min, multiplier*n_0], we calculated the latest height of each farmer. Then, each farmer can select 1,000 pieces.

Scenario 2

In this scenario, we assumed history is not important. The setup is similar to scenario 1 but the first 10,000 farmers join the system at a constant height, let’s say 1000.

Scenario 3

In this scenario, we went a step further. The maximum number of pieces is 3,000,000. The setup is like scenario 1. The third batch of farmers join in range [2M,3M] based on uniform distribution.
We set 11 checkpoints from [2M,3M] to track the number of missing data at each stage. These checkpoints selected at heights: {2M,2.1M,2.2M,...,3M}

Scenario 4

We assumed after a specific height, all of the farmers [oldest,…,newest], select their pieces from scratch. For example, after height of 3,000,000 all of the 30,000 farmers select their pieces from range of [0,3M]
This scenario because of overhead is not efficient. However, it gives us clearer insight.

You can see the results of these scenarios with different values of multiplier:

For scenario 3, here is outcome of checkpoints when multiplier = 2.

CheckPoint 0: 3201
CheckPoint 1: 6224
CheckPoint 2: 9400
CheckPoint 3: 12421
CheckPoint 4: 15716
CheckPoint 5: 19814
CheckPoint 6: 26393
CheckPoint 7: 38224
CheckPoint 8: 60043
CheckPoint 9: 100999
CheckPoint 10: 176230

Also, you can find the code for all of the scenarios in this repo:

dariolina · December 15, 2023, 3:36pm

Related, we would like to choose minimum required replication factor protocol parameter to ensure none of the pieces are ever missing.
Let n be the number of pieces in history and r be the replication factor, then m = r*n is the number of plotted pieces.
Assuming all pieces are chosen for plotting iid, this can be modeled via a multinomial distribution with m trials and n categories. Each piece is chosen each time with p=1/n and the expected number of times each piece is chosen is m/n=r.
Then the variance is m*p*(1-p)=r*n/n(1-1/n)=r*(1-1/n) and std is sqrt of variance. For a large n, which in our case is in order of thousands to millions, std\approx\sqrt{r}.
Now, instead of approximating number of missing pieces, let’s approximate the number of pieces replicated less than 10 times for different values of r, which we can do by measuring a tail of a normal distribution with mean \mu=r and std \sigma=\sqrt{r}.
The z-score for selecting a number below 10 times is Z=\frac{10-\mu}{\sigma}=\frac{10-r}{\sqrt{r}} . The probability of a number being selected fewer than 10 times can be obtained with CDF.
With r=25 : \Phi(Z)=0.0013499 which is quite high - 1349 pieces per terabyte of history.
With r=50 : \Phi(Z)=7.70863*10^{-9} which is much better - in 100 terabytes of history, only 1 piece will be replicated below 10 times.

dariolina · December 15, 2023, 3:45pm

To put this into perspective of a real network: currently on Gemini 3g testnet farmers have pledged 6.2PiB of storage. For the real replication factor to approach 50, the history size needs to grow above 120 TiB. For comparison, whole of Ethereum is <20 TiB.
Currently, our real replication factor is > 300 000.

Chen_Feng_2023 · December 16, 2023, 12:27am

Very good! Here is my suggestion: We can use the Poisson approximation instead of the Gaussian approximation. Then, the number of pieces replicated less than 10 times is given by \sum_{k = 0}^9 \frac{e^{-r} r^k}{k!}. I believe the final results will be very similar to yours.

dariolina · December 16, 2023, 1:23pm

Using Poisson approximation you suggest, r=25 yields 0.000221477, or 221 piece may be replicated less than 10 times per terabyte of history and with r=50 it’s 1.25961*10^{-12} which is even less than in my estimate.

dariolina · September 13, 2024, 9:14am

In percentages, r=25 means 99.9779% of pieces will be replicated 10 or more times across the plots if the space pledged is close to be saturated. The actual replication factor will be higher, since this estimate does not account for farmer piece cache and any excess storage pledged to the network.

Topic		Replies	Views
Recovering missing piece failed Incentivized Testnet gemini	38	727	October 10, 2023
Segment_reconstruction: Recovering missing piece on free disk Support gemini , nodes	1	168	September 14, 2023
Farmer jun 11 stop farming after replot start sometimes Support	33	248	June 19, 2024
My node syncs blocks fine but farmer never draws, this has been going on for a long time. Support cli , error , farmer	2	308	September 14, 2023
Usage of total_pieces Research	8	387	February 25, 2023

Calculating fraction of missing pieces

Scenario 1

Scenario 2

Scenario 3

Scenario 4

Related topics