I have a number of questions about the storage and retrieval aspects of Subspace.
As I understand it - anyone can put any thing (object/file/etc.) to the network and will get back a content-addressed hash, and anyone can also get the object by that content id. I understand there is not a fee implementation in the current test network, so some of these questions may be misinformed …
Search & Retrieval Fees
Presumably, when calling put, you will also pay a fee - whatever the current network-calculated storage fee may be. I don’t see anything re: fees on retrieval, and all of the storage pricing mechanism described seem to be focused on storage, which may be a risk. Both Find and Get may be called in a variety of scenarios to DOS the network unless there is a fee that would be charged for the retrieval. So question one: is a retrieval fee contemplated, and how would that work relative to the storage cost calculation mechanism? If there are no retrieval fees, how are various DOS vectors thwarted (e.g. calling illions of instances of get on very large files, or search for large files, etc.)
Storage Redundancy
If two separate users put the same content (that would resolve to the same content address), do both users pay the fee? Does the second get told that the content is already stored and the fee is not charged? Since content is stored within subspace blocks, perhaps the content actually gets duplicated?
Retrieval Efficiency
Won’t retrieval be considerably slower than all of the other distributed storage mechanisms? I understand the content is stored within <4k blocks that are wrapped in a subspace block, so there is presumably non-negligible overhead in the envelope, and each segment must be retrieved, unwrapped, and then assembled … all distributed options have to do this, but I expect the ratio of useful data : envelop overhead should be substantially worse in subspace. Is this correct, and is there an expectation re: retrieval performance differences inherent to this approach?
Farmer Transaction Verification
My understanding is that farmers simply validate the transaction is signed and the account can cover the fee. What happens with the fee for transactions that are invalid? If it is not assessed, does that open an attack vector to publish many signed, invalid transactions?
A lot of questions seems to originate from the assumption that Subspace is an object storage protocol. It is not, it is just a blockchain that is designed to actually scale. As such, you just put whatever data you want into regular blockchain transactions and they become retrievable later.
It has not been an issue in many other networks deployed in real world. Literally any blockchain that has support for syncing from genesis by anyone for free essentially has this issue.
You don’t pay the fee for storing some object, you just pay a fee for transaction to be accepted by the network. If it happens to contain an object that was already submitted before, well, then you just wasted some credits to store it the second time.
You can query the network before deciding whether to submit a transaction or not.
The description isn’t quite correct in terms of how it is actually structured, but yes, right now data is essentially chunked in 4k pieces and you assemble objects from those. There are some exciting improvements coming to this in the future that will significantly improve this to the point of not really be comparable, but that is about as much as I can share right now.
Perhaps I am misunderstanding something, but I think there are fundamental differences and these comparisons may not hold. Most comparison blockchains expect each full node to store the history, so this scenario does not exist other than during bootstrap, and there doesn’t seem to be an incentive to simply query other nodes for data.
Key comparisons that do have this (storage) feature, including filecoin, storj, and sia, all seem to have mechanisms for retrieval (or bandwidth) fees.
There is no obvious/significant incentive to randomly or repeatedly download a set of ETH blocks. (And the scenarios that come to mind, such as light clients, have evolved a retrieval market to support them through infura, etc.) There may be many varied reasons and incentives to do so with Subspace - regardless of whether it is an object storage protocol by design, or just happens to function like one.
And if I understand this correctly, it will be stored a second time by the network as well, primarily because the containing block(s) is unique, even though content-address is the same. Is that correct?
And if so, does it matter when searching that many duplicates may be found? Does the network simply return the first object found? Any unique risks to duplicate storage at large scale?
Tying back to the retrieval cost thread, this would seem to imply a good practice of search before write - and when search/read is free, (excepting performance costs) that is another use case that may drive more additional retrieval volume.
My understanding is that farmers simply validate that the fee can be covered and that the transaction is signed. I am asking if the fee is paid if the transaction turns out to be invalid when the executors actually process it.
I would assume that the fee would be charged to prevent DOS via invalid transactions; however, I think I read or heard something which indicates the fee would not be paid. (Sorry I cannot find the reference right now.) I am asking if invalid transactions still pay the fee to the executor to process the transaction.
BitTorrent and Arweave off top of my head have no such mechanisms and work just fine for many years. As long as retrievability is not burdening farmers too much, we assume it’ll work just fine and mentioned systems have years of production use to prove that.
It will certainly be stored twice, yes.
Network doesn’t return objects. When something gets into archival history, object mapping is created. That object mapping says where in the global history object is located, it is up to the client to retrieve it by assembling correct pieces and getting object from there. If object is stored twice, it just means there are two perfectly valid offsets at which you can retrieve the same exact object.
You just need to check if object mapping exists for it, retrieving the whole object is not needed. Such lookup is just a DHT lookup, similarly to lookup BitTorrent’s Mainline DHT, where queries are, again, shared on tit-for-tat basis and don’t have other explicit incentives.
In newer model executors validate transactions and make sure the fees are paid. If they do incorrect validation, they’re punished for misbehavior.
I think this makes sense and perhaps eliminates most of the risk. It’s not a free ride for a client to request data - they have to do the assembly. I thought it was less involved from looking at the js api signatures.
While BitTorrent doesn’t have the same economic incentives, Arweave presumably does - so that’s a fair point. I guess we’ll see.
Thank you Nazar for taking the time to respond to these and clarify - it has been helpful.
API tries to hide the complexity behind nice to use interface and actually uses farmer’s RPC right now, but will switch to retrieving individual pieces once DSN is exposed.