Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When limits are hit, property is not populated correctly #79

Open
benoit74 opened this issue Oct 3, 2024 · 3 comments
Open

When limits are hit, property is not populated correctly #79

benoit74 opened this issue Oct 3, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@benoit74
Copy link
Collaborator

benoit74 commented Oct 3, 2024

See e.g. https://zimit.kiwix.org/api/v1/requests/781385ac-9640-4232-8ffd-59a0bb68d0a0 or https://zimit.kiwix.org/api/v1/requests/a4e9121f-e73d-4e94-8fa4-b2a87db8e6d9

Both tasks have hit the 2 hours limit, but the API says that the limit has not been hit.

Something is obviously broken somewhere.

Note this is probably an upstream issue, but one need to dive to confirm this.

@benoit74 benoit74 added the bug Something isn't working label Oct 3, 2024
@benoit74 benoit74 self-assigned this Oct 3, 2024
@vivekmuthe4

This comment has been minimized.

@rgaudin
Copy link
Member

rgaudin commented Oct 14, 2024

I believe this is simply set based on the return code of the crawler. I know there have been discussion around return codes and we updated recently so I'd search for this first.

@benoit74
Copy link
Collaborator Author

benoit74 commented Nov 8, 2024

Here is the situation: in browsertrix crawler, when progress is reported, the following structure is used:

{
  "crawled": 2,
  "total": 2,
  "pending": 0,
  "failed": 0,
  "limit": {
    "max": 2,
    "hit": true
  },
  "pendingPages": []
}

The limit.max and limit.hit are populated solely based on the --limit CLI argument (maximum number of pages to crawl) and not on --sizeLimit and --timeLimit nor --diskUtilization.

In zimit, we reuse these two properties to populate the task progress JSON file.

In zimfarm worker, we use the task progress JSON file to set the task container.progress property.

In zimit-frontend we use the container.progress.limit.hit to decide wether we need to indicate a limit has been reached and ZIM is incomplete or not.

All that been said, I don't know how / where we should implement a fix:

  1. browsertrix crawler is already returning exit code 11 when any of --sizeLimit or --timeLimit or --diskUtilization limits are hit. We already detect this in zimit but do nothing of the information expect logging it (see https://github.com/openzim/zimit/blob/15b72022cebf42143dda2eb32b6508b4576bcc5f/src/zimit/zimit.py#L571-L572) ; we could modify zimit to set limit.hit=True when such a code occurs ; this is probably a fast track solution ... but it feels a bit dirty since we modify crawler status "on-the-fly" ... I promise it will be quite confusing / source of other bugs in the future
  2. we can request to modify browsertrix crawler to give details about all limits, and not only --limit ; then modify zimit (if needed, looks like not, it assume limit is a dict with any keys) and zimfarm and zimit-frontend to properly handle new limit values ; probably the cleanest / most future proof solution ... but it is going to take time and discussions (e.g. in browsertrix crawler, the whole crawl status is logged, do we really want to get this even bigger?) ... and the added value to the user is very limited since anyway his ZIM is incomplete
  3. we can implement a middle ground: consider that task progress should only contain a crawl_interrupted property, and properly set this in zimit ; currently zimfarm and zimit-frontend do not need / use more details ; and the amount of limits possible in every scraper can be significant, but all could have a notion of crawl_interrupted, and it could be useful in Zimfarm to be able to still get a ZIM of an interrupted crawl for diagnosis, but never publish such a ZIM to the library (e.g. only push it to S3)

I'm personally in favor of option 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants