Calculating fraction of missing pieces

I am writing to discuss the fraction of missing pieces under various scenarios. We have explored two primary approaches:

  1. With a fixed number of pieces (Static).
  2. When the number of pieces increases over time.

\textbf{First:}
In the first experiment, we assume a total of 1,000,000 pieces, with 100,000 farmers, where each farmer can store 1000 pieces. Given this setup, the replication factor is 100, resulting in no missing data.

However, when we reduce the number of farmers to 10,000, the replication factor drops to 10. Consequently, we observed that the fraction of missing data becomes 116/1,000,000.

\textbf{Second:}
For the dynamic scenario where \( n \) increases progressively, we follow the given scheme: height grows by n_0 plus a uniformly random value within the range ([min, multiplier * n_0]). In this setup, every farmer joins the system at n = 1000. The objective is to devise a strategy that reduces the likelihood of data loss.
We are assuming n_max = 1,000,000 and min = 100

The process for each farmer is to continuously compute the next height until the maximum is reached.

Multiplier = 4

With a multiplier set to 4, and assuming #farmers = 100,000 with each capable of storing 1000 pieces, the fraction of missing pieces comes to 9876/1,000,000.

The figure below is a histogram of the latest height of farmers.

When the number of farmers is reduced to 10,000, the fraction becomes 87794/1,000,000. This is a logical outcome given the decrease in the replication factor from 100 to 10.

Multiplier = 2

By setting the multiplier to 2, we essentially offer each farmer a broader scope for selecting pieces. This methodology appears to be both more realistic and practical.

For #farmers = 100,000, the fraction of missing pieces is 0.0064%. You can see which pieces weren’t selected by any farmers.

However, with a reduced count of #farmers = 10,000, the fraction of missing pieces rises to 0.0583%.

In conclusion, the results suggest that a multiplier of 2 is preferable. In a subsequent discussion, I will delve into potential modifications to the selection rule to further mitigate data loss.

You can also find the code here:

1 Like

Thanks @saeid!

For the static case, it is a standard balls-and-bins problem and so the fraction of missing pieces is about 1/{e^{replication factor}}. For the dynamic case with multiplier = 4, the fraction of missing pieces is about 1/replication factor. For the dynamic case with multiplier = 2, the fraction of missing pieces is about 1/(2 \times replication factor).

That is, our current sector expiry policy, together with uniform piece selection, leads to a relatively high fraction of missing pieces. This issue can be mitigated in two ways: (1) using better parameters for our sector expiry policy; (2) modifying our uniform piece selection. Saeid will report more on this tomorrow.

For mitigating the mentioned issue, high fraction of missing pieces, we designed four scenarios as follows. Replication factor in these scenarios is 10.

Scenario 1

The maximum number of pieces is 2,000,000. We have two stages:

  1. 10,000 farmers join based on uniform distribution from range [0,1M]
  2. Another 10,000 famers join based on uniform distribution from range [1M,2M].

Then, based on the formula of calculating the heights of each farmer, n_0 + [min, multiplier*n_0], we calculated the latest height of each farmer. Then, each farmer can select 1,000 pieces.


Scenario 2

In this scenario, we assumed history is not important. The setup is similar to scenario 1 but the first 10,000 farmers join the system at a constant height, let’s say 1000.


Scenario 3

In this scenario, we went a step further. The maximum number of pieces is 3,000,000. The setup is like scenario 1. The third batch of farmers join in range [2M,3M] based on uniform distribution.
We set 11 checkpoints from [2M,3M] to track the number of missing data at each stage. These checkpoints selected at heights: {2M,2.1M,2.2M,...,3M}


Scenario 4

We assumed after a specific height, all of the farmers [oldest,…,newest], select their pieces from scratch. For example, after height of 3,000,000 all of the 30,000 farmers select their pieces from range of [0,3M]
This scenario because of overhead is not efficient. However, it gives us clearer insight.

You can see the results of these scenarios with different values of multiplier:

For scenario 3, here is outcome of checkpoints when multiplier = 2.

CheckPoint 0: 3201
CheckPoint 1: 6224
CheckPoint 2: 9400
CheckPoint 3: 12421
CheckPoint 4: 15716
CheckPoint 5: 19814
CheckPoint 6: 26393
CheckPoint 7: 38224
CheckPoint 8: 60043
CheckPoint 9: 100999
CheckPoint 10: 176230

Also, you can find the code for all of the scenarios in this repo:

Related, we would like to choose minimum required replication factor protocol parameter to ensure none of the pieces are ever missing.
Let n be the number of pieces in history and r be the replication factor, then m = r*n is the number of plotted pieces.
Assuming all pieces are chosen for plotting iid, this can be modeled via a multinomial distribution with m trials and n categories. Each piece is chosen each time with p=1/n and the expected number of times each piece is chosen is m/n=r.
Then the variance is m*p*(1-p)=r*n/n(1-1/n)=r*(1-1/n) and std is sqrt of variance. For a large n, which in our case is in order of thousands to millions, std\approx\sqrt{r}.
Now, instead of approximating number of missing pieces, let’s approximate the number of pieces replicated less than 10 times for different values of r, which we can do by measuring a tail of a normal distribution with mean \mu=r and std \sigma=\sqrt{r}.
The z-score for selecting a number below 10 times is Z=\frac{10-\mu}{\sigma}=\frac{10-r}{\sqrt{r}} . The probability of a number being selected fewer than 10 times can be obtained with CDF.
With r=25 : \Phi(Z)=0.0013499 which is quite high - 1349 pieces per terabyte of history.
With r=50 : \Phi(Z)=7.70863*10^{-9} which is much better - in 100 terabytes of history, only 1 piece will be replicated below 10 times.

To put this into perspective of a real network: currently on Gemini 3g testnet farmers have pledged 6.2PiB of storage. For the real replication factor to approach 50, the history size needs to grow above 120 TiB. For comparison, whole of Ethereum is <20 TiB.
Currently, our real replication factor is > 300 000.

Very good! Here is my suggestion: We can use the Poisson approximation instead of the Gaussian approximation. Then, the number of pieces replicated less than 10 times is given by \sum_{k = 0}^9 \frac{e^{-r} r^k}{k!}. I believe the final results will be very similar to yours.

Using Poisson approximation you suggest, r=25 yields 0.000221477, or 221 piece may be replicated less than 10 times per terabyte of history and with r=50 it’s 1.25961*10^{-12} which is even less than in my estimate.

In percentages, r=25 means 99.9779% of pieces will be replicated 10 or more times across the plots if the space pledged is close to be saturated. The actual replication factor will be higher, since this estimate does not account for farmer piece cache and any excess storage pledged to the network.