Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: SITES-23133 [Importer] Report on the progress of a import job via a new jobs endpoint #508

Merged
merged 13 commits into from
Sep 24, 2024

Conversation

bhellema
Copy link
Contributor

@bhellema bhellema commented Sep 23, 2024

When querying the job progress via the endpoint eg. /tools/import/jobs/e551798c-eb17-49bc-beb3-4e3737734db4/progress

The result over time will look something like this:

{
	"pending": 11,
	"redirect": 0,
	"running": 0,
	"completed": 1,
	"failed": 0
}
// next request
{
	"pending": 9,
	"redirect": 0,
	"running": 0,
	"completed": 3,
	"failed": 0
}
// next request
{
	"pending": 0,
	"redirect": 0,
	"running": 0,
	"completed": 12,
	"failed": 0
}

When getting the job status and the processor will also add the redirectCount to the result:

{
	"id": "7e646a90-0f4b-4355-8d63-20681adfc360",
	...,
	"successCount": 12,
	"failedCount": 0,
	"redirectCount": 0
}

@bhellema bhellema self-assigned this Sep 23, 2024
@bhellema bhellema added the enhancement New feature or request label Sep 23, 2024
Copy link

This PR will trigger a patch release when merged.

src/controllers/import.js Outdated Show resolved Hide resolved
* - failed: URLs that have failed to import.
* - pending: URLs that are still being processed.
* - redirected: URLs that have been redirected.
* - running - URLs that are currently being imported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will running ever be non-zero? Since we do not have access to the data layer from the content-scraper, we do not update the DB when we begin processing a URL.

Copy link
Contributor Author

@bhellema bhellema Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is interesting. We could do two things I guess with this.

  1. remove running as it will never be non zero.
  2. implement a new feature that will allow the content-scraper to message back / use a lamda to update the status of the import url job.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there value in constantly updating this field?

  • Currently, it will always be 0 or 1 since we have configured only a single Lambda invocation for the import queues, so effectively sequential processing
  • The cost of additional writes to the DB each time a new URL begins processing will add up for larger jobs

Copy link
Contributor Author

@bhellema bhellema Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is value at this point updating the field, at this point. I think at this point I should just eliminate the concept of running from the api-service. However if we start running jobs in parallel, then I think it might be interesting to revisit this.

Comment on lines -177 to -186
let updatedOptions = { ...options };

if (importScript) {
updatedOptions = { ...updatedOptions, hasCustomImportJs: true };
}

if (customHeaders) {
updatedOptions = { ...updatedOptions, hasCustomHeaders: true };
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will we expose these values for consumption both by API clients, as well as for use by our dashboards? cc @swetabar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have to provide new columns in the job for these values. We can log an new issue for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SITES-25634

src/support/import-supervisor.js Outdated Show resolved Hide resolved
READ: 'imports.read', // allows users to read the import jobs created with their API key
WRITE: 'imports.write', // allows users to create new import jobs
READ_ALL: 'imports.read_all', // allows users to view all import jobs
WRITE_ALL_DOMAINS: 'imports.write_all_domains', // allows users to import across any domain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include this new scope in spacecat-shared as well? I'm planning to add it with my PR for https://jira.corp.adobe.com/browse/SITES-22756

@bhellema bhellema merged commit 78ef726 into main Sep 24, 2024
4 checks passed
@bhellema bhellema deleted the bhellema/SITES-23133 branch September 24, 2024 15:45
solaris007 pushed a commit that referenced this pull request Sep 24, 2024
# [1.66.0](v1.65.12...v1.66.0) (2024-09-24)

### Features

* SITES-23133 [Importer] Report on the progress of a import job via a new jobs endpoint ([#508](#508)) ([78ef726](78ef726))
@solaris007
Copy link
Member

🎉 This issue has been resolved in version 1.66.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request released
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants