Replies: 30 comments
-
For option 1, why is the local index removed? Is there a local index for the sealed copy that is separate from the unsealed copy and this operation would just remove the local index for the unsealed copy? |
Beta Was this translation helpful? Give feedback.
-
One general observation is that the data lifecycle is bifurcating into sealed and unsealed data, which adds complexity. The more we can maintain the same state across sealed and unsealed data, the simpler it will be to reason about these systems. In that vein, I lean towards option (1) with the assumption that SPs only remove unsealed copies when also removing sealed copies. This may imply expiration and removal of sectors should also require the removal of unsealed copies, and the time before the unsealed copy gets deleted is insignificant in the grand scheme of things. Why would an SP hold on to unsealed copies of data that they are no longer storing on-chain? I'm not clear how the repair functionality works - is it to re-index unsealed copies, or find unsealed copies that are indexed? If this is possible, another idea is to trigger a repair from a failed retrieval attempt. But again, I'm not sure how the SP would know whether or not they have the unsealed copy of the requested CID, if they can't find it in their local index. |
Beta Was this translation helpful? Give feedback.
-
When a client queries the SP to ask for the data, the SP
If the unsealed copy has been removed, it's not possible to serve the deal's data. So we should return a "not found" response to queries at step 1. |
Beta Was this translation helpful? Give feedback.
-
I think the answer depends on our expectations of unsealing. The system was designed with the expectation that unsealing would soon become cheap and fast, however that hasn't happened in practice. Today it takes multiple hours to unseal data. If we assume that's not going to change soon, then in practical terms retrievability is unrelated to sealed data. So for the purposes of retrieval, what matters is the unsealed copy, and keeping the index in sync with the unsealed copy.
They may have separate contractual obligations for retrieval vs storage, and they may earn money from retrieval (eg serving popular data). I agree that this is more complicated conceptually so we need to decide if it's worth the additional complexity to allow those use cases.
It is to re-index unsealed copies. There may not be an index for a unsealed copy of data if
Note also that indexes may become corrupted (we've seen this happen in practice with dagstore indexes) so we want a way to detect and repair them. |
Beta Was this translation helpful? Give feedback.
-
Exactly this. Option 2 is the thing that muddies the water here, and why I advocate for just sticking with option 1 until we have a very clear story with good UX around handling retrieval for sealed data. Options 1 makes the logic pretty straightforward: "For all unsealed data keep the local indexes up to date, and for all unsealed copies marked for index announcement, keep network indexers up to date.".We don't need to care about sealed data at all for indexing purposes. It doesn't matter if the sealed sector is expired/removed/etc, we only ever care about the unsealed data, and once it's gone we cleanup the PieceDirectory(soon replacing the dagstore).
To help clarify this:
+1. We need to assume corruption is going to happen, we're dealing with a lot of data. Automating a "scan and repair" approach could help us mitigate failures and reduce some operational overhead for SPs. |
Beta Was this translation helpful? Give feedback.
-
If the sealed data is not tracked in the Network Indexer and not in the local index:
|
Beta Was this translation helpful? Give feedback.
-
I'm going to assume that by CID you mean the actual content of the cid for these (correct me if that's wrong). For both of these, I'll start by addressing the sealed data problem. If there is sealed data that a client wants back there are 2 options for getting it.
The client should ideally know the PieceCID containing their content. This can be looked up on chain to find the SP, and then the flow from the previous scenario can occur. If for some reason the client does not know the PieceCID there content is in, there's no reasonable way to find the data. The other option, as described in option 2 in the issue description would be:
|
Beta Was this translation helpful? Give feedback.
-
This strikes me as a problem we should be thinking through. For example, if I know a content CID is stored with an SP for archival or backup purposes, I might not know the Piece CID but would still want to find it, have it unsealed and retrieve it. Rather than deleting references to sealed data, shouldn't we consider keeping CID to Piece mappings and noting whether there is a sealed copy and unsealed copy? |
Beta Was this translation helpful? Give feedback.
-
This is option 2 mentioned in the description. It has better UX properties for clients, but definitely adds some complexities. Let's walk through the lifecycle of what that would look like in more depth. Quickly though to clarify one thing (which we should ensure #689 covers):
The Lifecycle of Indexing w/ retained indexes for sealed data🚧 Indexing a new DealThis is mostly the same as option 1 in the description but there are some additional things we need to think about.
🚧 Deletion of an unsealed copy
🚧 A sector is unsealed
🚧 Expiration of a sector
🚧 Removal of a sectorIf the sector is removed but an unsealed copy is still available we need to determine the expected behavior here. As @dirkmc mentioned above, storage and retrieval could be negotiated separately, in which case we should keep indexes around as long as there is still an unsealed copy. So the logic would be:
🚧 Repairing IndexesSame as option 1 above, however, integrity of indexes for sealed only copies would not be able to be checked without unsealing the data. Assuming appropriate replication/backup of the PieceDirectory once it's released, this should be a minimal risk. |
Beta Was this translation helpful? Give feedback.
-
fwiw I'm fine with either option from a retrieval perspective as long as I can count on either either of these things being true (which map to the 2 options): (1) having some basic assurance (not rock solid of course) that the SPs the indexer tells me have a CID have an unsealed copy, or (2) that I can count on that I suspect though, if we go with option 1, that it's going to be harder to put sealed data back in to the indexers at a future date once we have sealed->unsealed request/retrieval UX sorted out because by that time the assumption will be baked in to the indexer data that it's sealed so we have a compounded UX problem of ensuring that requests to the indexer API or responses from it account for sealed status. However, if we go with option 2 today, then Maybe we should have a cursory discussion about unsealed request/retrieval UX to cover that base? Maybe it ends up being a simple matter that we can put on a roadmap and know it's not going to be a massive job. (e.g. a new endpoint that take a retrieval-like proposal that I can ping with my unsealing request and it returns a status code and a % of unsealed progress [so I can ping it repeatedly to see how it's going]). |
Beta Was this translation helpful? Give feedback.
-
Based on discussions this week I think option 2 is likely to win out, as it does have the better usability properties. There is a good chunk of integration work we need to do here between Lotus and Boost to get better visibility into Sealed and Unsealed state changes to manage the indexes well. We need it regardless for the unsealed data, but will also need to account for sealing state changes for option 2.
This shouldn't be the case. For option 1, the indexer only as the notion of unsealed data. If the unsealed copy goes away, we just remove that whole deal set from the indexers. For option 1 you don't need to filter on retrieval because all indexes are unsealed. You do need to filter in option 2, because they may be sealed or unsealed. The only bit that should be "harder", is that all the indexes have to be re-ingested for that deal instead of just being able to update the metadata for the deal.
I think it's a good call while it's top of mind, I'll open a discussion for it. |
Beta Was this translation helpful? Give feedback.
-
Started an open discussion in #1027. Not a priority, but wanted to get some thoughts written down while it's fresh in my mind. |
Beta Was this translation helpful? Give feedback.
-
Some data points to consider.
we definitely store all the piece CIDs on the CAR files involved. On a more granular level, since a CAR file can contain files (chunks) of multiple “client’s filenames”, we also store the CID of each these chunks to allow for partial retrievals.
So, Boost does not have to be the source of truth on where to find individual sealed CIDs. The software being built on top of Boost is not expecting that and maps the user file to a Piece. |
Beta Was this translation helpful? Give feedback.
-
My point was that if we choose option 1 now and then later fix up the UX issues and decide we need sealed-only data put back into the indexers then it's going to be harder to expand the scope of indexer data later (and therefore change query patterns for users) than it is to keep it expanded today but carefully flagged for user consumption (option 2). Maybe not though, maybe by the time we get to that the number of users of the indexers is still quite small and it's a trivial change. We could also make queries more explicit, like having the end point identifier say "unsealed" today so it's clear that's all you're getting so any future addition has to introduce a new identifier if you want all the data. |
Beta Was this translation helpful? Give feedback.
-
Some thoughts on my side: For clients that are storing directly with SPs (aka not going through some onboarding tooling like Estuary, web3.storage, etc.) it appears they implicitly trust in the SPs to help them track where all their data is / which pieces they are in. It's unclear to me how sophisticated the tracking tools are for SPs as the variance in SP sophistication can be large. For unannounced data sets, they likely have their own systems in place already. When looking at this from a broader lens, I think it makes sense to have option 2 to discover unsealed data as well, but I struggle a bit to figure out a real use case for this. The cases I'm thinking about are super broad, such as fvm/ compute over data/ other developments longer term in which additional use cases come up, and clients interacting with the network want to find existence of data on the network (whether sealed or unsealed) or have some governance around paying/maintaining sealed/unsealed data. For the existing public (announced) data retrieval / data access that we are seeing, I am thinking if there are any cold storage type use cases. I know there are SPs who have cold storage type offerings (such as PiKNiK), but they are likely for client data and not for public data. In this sense, to enable current retrieval flows smoothly, option 1 makes more sense. The callout I would like to say (which has been mentioned earlier as well) is that once we establish/push a pattern in the ecosystem, it will require a lot of work to change, which is why I'm leaning a bit more towards option 2 though it is the more complex option. |
Beta Was this translation helpful? Give feedback.
-
It would be helpful to get more information on usage of the network indexers etc., though not sure if we can have visibility into where the requests are coming from / a meaningful breakdown (such as IPFS gateways, SPs, data on ramps, etc.) |
Beta Was this translation helpful? Give feedback.
-
I propose an Option 3:
Interested in @willscott 's and @masih 's views. |
Beta Was this translation helpful? Give feedback.
-
I think the states you propose make sense, @LaurenSpiegel |
Beta Was this translation helpful? Give feedback.
-
@LaurenSpiegel I think that's a good proposal / next step given the current state of the network, and provides enough optionality for the future :) |
Beta Was this translation helpful? Give feedback.
-
@LaurenSpiegel, on potion 3 proposal:
Thoughts on how it occurs for me:
|
Beta Was this translation helpful? Give feedback.
-
@masih , I believe some SP's delete the unsealed copies immediately. Others likely do it when they need more disk space. Right now, there is very little to no resurrection by unsealing. So churn rate likely low. However, we did recently learn that some SP's are sharing unsealed copies so even if they deleted their local copy, it does not mean that they won't honor a retrieval request (and cache locally again once they served that retrieval?). We are getting further info on how they are doing this but it seems like something we should keep in mind and not remove a reference to the SP having access to the unsealed data just because boost/lotus doesn't see it on disk at any point in time. |
Beta Was this translation helpful? Give feedback.
-
Ouch. Not great for retrieval. Though, understandable as far as the (current) game theory goes?
Agreed. Continuing the Option 3 discussion, based on low churn level while keeping in mind the need for reducing complexity on Boost side, would it make sense to change Option 3.2 to:
|
Beta Was this translation helpful? Give feedback.
-
On changing 3.2 to keep the indexer up to date on the state of unsealed data, would only want to do this if there is a real impact on TTFB. Is this really the case especially with bitswap where multiple addresses can be checked and it's noisy anyway? It's possible some reputation schemes might want to take into account -- does the SP serve data it indexes? If we clean up the records we remove the ability to do that. Though I agree if real impact on TTFB it's worth the sacrifice. |
Beta Was this translation helpful? Give feedback.
-
I don't know if there are autoretrieve metrics that would answer that. Maybe @hannahhoward or @rvagg know? I am curious how we would know this and can truly measure if we are returning a generic "Not Found" in response to unsealed copy being deleted. So maybe the first iteration here is to change error codes such that we can observe the effect of this on TTFB.
I am not sure I follow; how would the records reflecting the true state removes the ability to measure reputation? |
Beta Was this translation helpful? Give feedback.
-
I am sorry I must have clicked close issue by mistake! |
Beta Was this translation helpful? Give feedback.
-
Totally agree we should have clearer error messaging if possible. Autoretrieve will not help answer performance on bitswap though.
The idea is that SP's would potentially stop announcing CID's so that they don't get hit for them for retrievals and then don't get dinged on reputation for not serving them. Counter is that a thorough reputation system needs to have a better source of what should be indexed, check that and check retrieval (but not all reputation checkers might have that better source and having more signals is better than fewer).
Could have been a freudian click. 😄 We should discuss this in real-time in the new year to get to some resolution. |
Beta Was this translation helpful? Give feedback.
-
We have a sync discussion to talk about this later today, so to quickly summarize we're effectively deciding on option 2 or the newly proposed option 3. As far as I see it, option 3 is a partial implementation of option 2. We still care about announced indexes for sealed data, except option 3 isn't notifying the indexers when unsealed copy removals are detected. As I see these as progressive implementations, I'm not overly opposed to starting with option 3, but I do think the UX here is bad for retrieval clients:
It's not a payment or access problem, it's an eventual consistency problem that will never become consistent. From a retrieval client standpoint, I am now inclined to not trust what's in IPNI if that data never gets updated. It's not IPNI's fault, but will likely get some blame as it's the first step in the retrieval process. If Boost can't detect an unsealed copy, that data isn't retrievable and someone has to manually intervene to make it so. What is option 3 optimizing for?
I don't understand the first part of this, how are they sharing unsealed copies that allow retrieval to work without manual intervention? |
Beta Was this translation helpful? Give feedback.
-
Based on live discussion today, we propose to:
@brendalee to set up sync with lotus team |
Beta Was this translation helpful? Give feedback.
-
Note for implementation: Ensure SP's can manage pricing/access for retrievals where a sector has expired. An SP may wish to change retrieval requirements until a sector is renewed. This should be doable with deal filters, but we should ensure that flow works well and that it's clearly documented. Example: SP Sally is serving free retrievals of Piece |
Beta Was this translation helpful? Give feedback.
-
As we've reached consensus here and as I should have originally created this as a discussion thread, I'm going to move this to the Discussion board and create an issue to track the engineering effort here. |
Beta Was this translation helpful? Give feedback.
-
The purpose of this issue is to clarify and discuss the various potential state changes for indexing over the lifecycle of a deal that has been marked for indexing. This is not intended to cover the existing lifecycle of indexing, but the desired lifecycle. Once solidified, we can create accompanying issues to resolve discrepancies in the current implementation to match the desired state.
Legend
The Lifecycle of Indexing
🍏 Indexing a new Deal
Decision:
SkipIPNIAnnounce=true
via PSD 1.2.1 should NOT be announced. All other deals should.Note: Not covering how deals are identified for indexing in this section as there is a separate effort to solidify requirements around that. See #689 and filecoin-project/notary-governance#666 for more details.
For discussion purposes, let’s assume that once the above issues are complete, there will be some way to identify if a specific deal should be indexed or not, and that there will be a mechanism to account for this for existing deals.
When a new deal has been successfully published, if an unsealed copy exists and the deal is marked for indexing, it should be immediately registered with the index provider/marked for indexing.
🍏 Deletion of an unsealed copy
Decision:
SkipIPNIAnnounce=false
)SkipIPNIAnnounce=false
).When an unsealed copy is deleted today, indexes are not removed. There is currently support in the Network Indexers to include metadata on whether or not the data is unsealed, but it’s not being leveraged correctly today (all announced indexes are being marked as unsealed).
We need a mechanism to detect the removal of unsealed copies (as they can be
rm
’d manually). The section on Repairing Indexes below, speaks to how this might be accomplished. Upon detection of deletion we can perform one of the following actions (need to decide between the options):Option 1 - Remove the indexes(recommended): When an unsealed copy is removed, as the unsealing process is a non trivial operation, we should assume the copy will not become available in a short time frame. As such, the local indexes should be removed and we should announce the deletions to the network indexers. This frees up space both locally and on indexers. If unsealed copies were expected to be created/deleted often, then this option might be less reasonable, but this is not the case today.
Option 2 - Update Index Metadata: When an unsealed copy is removed we can update the metadata of the indexes for that deal to specify that no unsealed copy exists. This would still allow discovery of the SP who has the content, but retrieval would not function without an unseal. The advantage of this option is that a client could pay the Unseal price to get the data, knowing who has it. However, it’s worth noting that retrieval flows requiring unsealing are not particularly clear and would need further work to likely become viable.
If this option is chosen, we may want to change the indexing logic of sector expiration and will definitely need to change how removed sectors are handled.
🍏 A sector is unsealed
Decision:
When we detect a sector has been unsealed, and that sector is eligible for indexing, it should be registered with the index provider for reindexing (assuming unsealed deletion - option1 is selected).
🍏 Expiration of a sector
Decision:
As long as the unsealed copy of the sector exists, the indexes should also exist. No changes should occur until the unsealed copy is removed.
🍏 Removal of a sector
Decision:
SkipIPNIAnnounce=false
)Same as sector expiration, this should be a no-op, as index changes would be triggered from changes to the unsealed copy only.
If unsealed deletion option 2 is selected, removal of a sector when there is no unsealed copy, will require deletion of indexes and announcement to the network indexers.
🍏 Repairing Indexes
Decision:
piece not found
)One of the issues facing retrieval reliability is index metadata getting out of sync with unsealed copies (or the lack thereof). There are several reasons this may be occurring, but it often requires manual intervention by SP’s to repair and the visibility into when this needs to happen is not clear. A proposal that has been discussed recently is to have an automatic repair job for indexing, to automatically ensure unsealed copies, eligible for indexing receive an integrity check and are repaired if there is an issue. This would NOT include automatic unsealing of data as this is a resource intensive process.
An extension of this proposal, given that unsealed copies may be deleted or created manually by SP’s, is to have new index creation, and repair all belong to this “repair” service. This service could be a background process that is continually repairing/registering/removing indexes with limited resource consumption. This could remove some operational overhead for common errors reported with retrievals. Specifics of how this could/should work can be flushed out in a followup issue.
Related Issues & Discussions
Beta Was this translation helpful? Give feedback.
All reactions