Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to enable "Purge Items" Feature ? #93

Open
tsuyoshihamano opened this issue Jan 26, 2021 · 13 comments
Open

How to enable "Purge Items" Feature ? #93

tsuyoshihamano opened this issue Jan 26, 2021 · 13 comments

Comments

@tsuyoshihamano
Copy link

Hi, Tsuyoshi from Raytion GmbH here.
I read about the Purge Items Feature which is also partially mentioned in your documentation here.
I assume that this is a feature where items stored in the Crawl DB and not being fetched in the previous job will be cleaned up.
Will those documents will be also deleted from the Solr Collection (they are also registered as Documents) ?
If yes, does this feature requires to be enabled explicitly as we are currently not able to observe our unvisited items to be deleted.
Does the same mechanism also applies to Access Controls ?

Thank you in advance.

@mwmitchell
Copy link
Contributor

Hi @tsuyoshihamano. What version of the connectors-sdk and Fusion are you using?

@tsuyoshihamano
Copy link
Author

Hi @mwmitchell ,
it is the SDK in version 3.0.0 and Fusion 5.3

@mtibbit-lucidworks
Copy link

@mwmitchell are you familiar with this issue? Christian at Raytion reported it's currently blocking them on the Yammer/MS Teams connector work. We have our update with Raytion tomorrow morning if it'd be helpful for you to join.

@puneetkhanal
Copy link
Contributor

@tsuyoshihamano "Purge Items" feature should work for connectors that does recrawls. Yes, it should delete items in crawlDb as well as content collection, if they are not modified. This holds true for AccessControlItem in crawlDb, but it does not delete any thing from AccessControl collection.

Purge items should be enabled by default in 5.3.

Are you emitting checkpoint as in incremental crawl?

@tsuyoshihamano
Copy link
Author

Thanks @puneetkhanal ,
we are not omitting checkpoints. We do have documents emitted directly without emitting checkpoints.
I guess I need more explanation for how deletions are detected internally within your infrastructure.
I expected the documents in the crawl db to be detected as deleted if on the subsequent job the documents have not been emitted. Could you confirm ? Also, for Access Controls, we need to emit the deletion explicitly to be deleted from the Access Control Collection right ?

@puneetkhanal
Copy link
Contributor

@tsuyoshihamano I looked further regarding purge stray items. Now, the purge stray items works in a special case, for that connector needs to emit a checkpoint and emit candidates with isTransient:false. This was a special scenario where a customer had an incremental connector emitting a checkpoint but they could not figure out which items to delete, so, in that we would look at the crawlDb and delete outdated items.

We would like to understand more about your use case. So, if you are implementing a recrawl or incremental connector, then you need to emit a delete for that item, in order to remove that item from solr collection. Purge items would work only in the special case, I mentioned above (as this is for special case only, it's better not to rely upon this as this is subject to change)

Ideally, it would be better if the connector could figure out by itself which items it needs to delete and emit a delete for that item. The same case holds for AccessControlItem also.

@tsuyoshihamano
Copy link
Author

@puneetkhanal ,
It sounds like we have a similar use case like the first one you described. Currently we are emitting for each document a candidate. Do we need a checkpoint as well or is one of them sufficient ? If only the candidate is sufficient, do we need to explicitly set the isTransient to false (saw in the documentation that it is set to false per default for candidates). Do we need to set further information (e.g. metadata) for the candidate as the "Purge Items" Feature does not work with our current implementation.

@puneetkhanal
Copy link
Contributor

@tsuyoshihamano yeah isTransient is false by default, so you don't need to do anything. You need to emit a checkpoint at the end of the first crawl (metadata is optional) and in the next crawl you will only get that checkpoint item in your connector and based upon that checkpoint you can further emit other candidates.

@tsuyoshihamano
Copy link
Author

@puneetkhanal , would the next crawl also then emit a checkpoint after the crawl ? The deletions will then happen based on the diff of candidates between checkpoints of different crawls?

@puneetkhanal
Copy link
Contributor

@tsuyoshihamano yeah subsequent crawl will update a checkpoint with additional information that may be required for next crawl, and whenever next crawl ends, it will check crawldb to find stray items or obsolete items and delete them.

@puneetkhanal
Copy link
Contributor

It checks for items that have not been modified in the current crawl and then deletes them from crawlDb and solr collection (content collection).

@tsuyoshihamano
Copy link
Author

Thanks @puneetkhanal ,
will give it a try now. So, for Access Controls to get rid of them in the Access Control Collection, a manual delete via deleteAccessControl() is the only way right ?

@puneetkhanal
Copy link
Contributor

Yeah, that is correct way

  /**
     * Example Usage: {@code fetchContext.newDeleteAccessControlItem(id)
     *                            .withQuery(Collections.singletonMap("name","xyz"), false)
     *                            .emit();
     *                }
     */

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants