-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Service-specific document types #89
Comments
It's a tough question. I'd say if we have to decide right now it's a no. In order to better assess the size of the issue and the potential value, do we know how many such service-specific documents are in tosback (in terms of total documents, of proportion of all documents, and of proportion of all service providers)? |
@AdrienFines have you encountered such service-specific document types in the ones you have added manually? |
Usually not, except for Twitter (for instance : https://help.twitter.com/en/rules-and-policies/twitter-reach-limited) but I understand that Twitter is quite a special case anyway. |
Thanks to this issue and #88, we have made major changes to the way document types are declared and handled. The WIP can be followed in #90 and input is welcome 🙂 If #90 is merged, I would find it acceptable to accept more niche document types. It will always be possible to later remove the documents that are unique or find some other way to handle them. |
Service-specific document types that should be created are listed in this mapping table from ToSBack types to CGUs types. These two —hopefully with a slightly widened scope— should be created:
|
Do we want to crawl a document like https://www.foxnews.com/closed-captioning ?
The text was updated successfully, but these errors were encountered: